carryless multiplication builtin #9631

travisstaloch · 2021-08-27T00:37:40Z

I would like to work toward creating a carryless multiplication builtin in zig. This is a fast instruction used in simdjson for example to convert binary quote boundaries from json strings into masks. In the following example, Q is a 64 bit quote boundary marker. The last line is the result of a carryless multiplication between Q and 0xfffffffffffffff

{ "\\\" Nam[{": [ 116,"\\\\" , 234, "true", false ], "t":"\\\"" }: input data
__1___1_____1________1____1________1____1___________1_1_1___11__ : Q
______1_____________________________________________________1___ : OD
__1_________1________1____1________1____1___________1_1_1____1__ : Q &=~OD
__1111111111_________11111_________11111____________11__11111___ : CLMUL(Q,~0)

In simdjson this is known as prefix_xor and is implemented here:

Here are some references to this instruction in the zig repo:

Make poly1305 faster #6116 (comment)
zig/lib/std/crypto/ghash.zig

Line 93 in d298719

inline fn clmul_pclmul(x: u64, y: u64) u64 {

The llvm x86 intrinsic is llvm.x86.pclmulqdq

Name ideas:

@mulCarryless()
@mulWithoutCarry()

I hope to use this in my simdjson port to get rid off hacky llvm intrinsic calls such as the following which may not be possible in stage 2:

@"llvm.x86.pclmulqdq"(@bitCast(i64x2, a), @bitCast(i64x2, b), 0)

Related to #903

If accepted, I'm not sure where I would begin. If anyone can suggest a similar builtin which uses different intrinsics per platform (and a custom implementation on arm) , perhaps i can follow its implementation.

The text was updated successfully, but these errors were encountered:

andrewrk · 2021-08-27T00:48:30Z

If accepted, I'm not sure where I would begin. If anyone can suggest a similar builtin which uses different intrinsics per platform (and a custom implementation on arm) , perhaps i can follow its implementation.

The good news and bad news is that you are the pioneer of the first such builtin.

One trick you could try would be using clang to emit LLVM IR, using the pclmulqdq intrinsic, but specifying an x86 CPU that does not have the instruction. In this case it may emit a call to a compiler-rt function, which we can make sure is implemented for other architectures in addition to x86.

Regardless, I do think that if you start on this feature, it can be tackled one bit at a time and I'd be happy to help at any point along the way.

travisstaloch · 2021-08-27T02:24:53Z

One trick you could try would be using clang to emit LLVM IR, using the pclmulqdq intrinsic, but specifying an x86 CPU that does not have the instruction.

Interesting. Not sure if this is what you meant, but I tried the following. I guess my clang-foo is failing. Any suggestions? Not sure 'i386' is a correct option for -mcpu. I tried several others like 'westmere', 'haswell' w/ same results.

$ clang-12 -c builtin-things.c -o foo.bc -emit-llvm --target=x86_64-linux -mcpu=i386 -mpclmul &&  llvm-dis-12 foo.bc -o foo.ll

clang: warning: argument unused during compilation: '-mcpu=i386' [-Wunused-command-line-argument]

$ cat builtin-things.c
#include <stdint.h>
#include <emmintrin.h>
#include <immintrin.h>

uint64_t prefix_xor(const uint64_t bitmask) {
  __m128i all_ones = _mm_set1_epi8('\xFF');
  __m128i result = _mm_clmulepi64_si128(_mm_set_epi64x(0ULL, bitmask), all_ones, 0);
  return _mm_cvtsi128_si64(result);
}

andrewrk · 2021-08-27T02:30:13Z

$ clang-12 -c builtin-things.c -emit-llvm -S
builtin-things.c:7:20: error: '__builtin_ia32_pclmulqdq128' needs target feature pclmul
  __m128i result = _mm_clmulepi64_si128(_mm_set_epi64x(0ULL, bitmask), all_ones, 0);
                   ^
/home/andy/local/llvm12-release/lib/clang/12.0.1/include/__wmmintrin_pclmul.h:45:13: note: expanded from macro '_mm_clmulepi64_si128'
  ((__m128i)__builtin_ia32_pclmulqdq128((__v2di)(__m128i)(X), \
            ^

So now we know how clang handles this situation: compile error

For Zig, we should make the builtin always work, by providing an implementation if necessary.

matu3ba · 2021-08-27T23:20:39Z

uses different intrinsics per platform (and a custom implementation on arm

You can reuse parts of my open PR #9578 to select the correct function at comptime and expose the respective symbol.
Note that I did not implement big endian support (yet), since there is no CI for testing due to LLVM MIPS regression.

Fortunately Rust has implemented usage of this very intrinsic resolving how this gets lowered: rust-lang/stdarch#318.

Feature detection is in LLVM in lib/Support/Host.cpp. Take note to use and reference the MIT release on porting, if possible. There is code in compiler_rt linking that release.

Probably it would also be good to have a central place for intrinsics.
LLVM has intrinsics defined in llvm/lib/IR, but zig uses lib/std/special/compiler_rt for compiler_rt stuff.
So probably using lib/std/special/intrinsics and according intrinsics.zig should be fine to keep it separate, but indicate that things work similar to compiler_rt. However that can also be decided during review.

travisstaloch · 2021-10-01T03:53:34Z

I started working on implementing this. I'm currently able to generate the llvm intrinsic but having a name mangling issue that i'm not sure how to fix. I've posted the issue to llvm irc / discord.

The error is:

ld.lld: error: undefined symbol: llvm.x86.pclmulqdq.v2i64

The correct name is just llvm.x86.pclmulqdq. I'm not sure how to get rid of the trailing .v2i64. There must be some way to get irbuilder::CreateIntrinsic to not add the type name. Or maybe there is an alternative to CreateIntrinsic I should be using?

Of course the code is very hacky and messy so far. Just thought i would share my progress incase anyone has any thoughts on this mangling issue or any other thoughts about how to proceed.

travisstaloch · 2021-10-04T23:09:29Z

looks like the error from my previous comment was solved in llvm-13.

lin72h · 2022-11-06T12:04:01Z

riscv B extension has clmul too, simde has a cross platform c implementation we can learn from. And one of the use case of clmul is blazing fast clhash for zig cache

addresses ziglang#9631. only works with llvm backend/x86 so far. allows new test/behavior/mul_carryless.zig to pass with -Denable-llvm. doesn't do any backend/arch/cpu-feature testing.

farteryhr · 2023-03-19T07:12:43Z

some other important bit manipulation instructions: PDEP and PEXT (and CLMUL, which can also be used for constructing bitty steps other than crypto algorithms). it could be used in wide variety of data co/decompressing, de/encoding algorithms.

polyfill: https://github.com/zwegner/zp7 (also see how AMD fails)

there are many use cases mentioned on the web:

Use BMI2 pdep/pext instructions Forceflow/libmorton#6
https://www.zhihu.com/question/27824125/answer/2621765689 (chinese, database engine developer)
https://haskell-works.github.io/posts/2018-08-22-pdep-and-pext-bit-manipulation-functions.html

"elegance": https://news.ycombinator.com/item?id=20205743

though elegance can't be quantized, in my humble opinion, they're like the new "CLZ CTZ POPCNT triad" as standard bit manipulation units. that's useful, non-trivial to polyfill and makes pain.

please consider also adding them to builtins..

andrewrk added this to the 0.10.0 milestone Aug 27, 2021

andrewrk added accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. labels Aug 27, 2021

travisstaloch mentioned this issue Nov 8, 2022

@mulCarryless() - add a builtin to llvm backend #13483

Closed

Validark mentioned this issue Mar 18, 2023

Add @pext and @pdep builtins #14995

Open

Validark mentioned this issue May 24, 2023

Add packing and unpacking selection operations for Vectors #15837

Open

Validark mentioned this issue Jul 5, 2023

Add operator for carryless multiply #16326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

carryless multiplication builtin #9631

carryless multiplication builtin #9631

travisstaloch commented Aug 27, 2021

andrewrk commented Aug 27, 2021

travisstaloch commented Aug 27, 2021 •

edited

andrewrk commented Aug 27, 2021

matu3ba commented Aug 27, 2021 •

edited

travisstaloch commented Oct 1, 2021

travisstaloch commented Oct 4, 2021

lin72h commented Nov 6, 2022

farteryhr commented Mar 19, 2023

carryless multiplication builtin #9631

carryless multiplication builtin #9631

Comments

travisstaloch commented Aug 27, 2021

andrewrk commented Aug 27, 2021

travisstaloch commented Aug 27, 2021 • edited

andrewrk commented Aug 27, 2021

matu3ba commented Aug 27, 2021 • edited

travisstaloch commented Oct 1, 2021

travisstaloch commented Oct 4, 2021

lin72h commented Nov 6, 2022

farteryhr commented Mar 19, 2023

travisstaloch commented Aug 27, 2021 •

edited

matu3ba commented Aug 27, 2021 •

edited