Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

carryless multiplication builtin #9631

Open
travisstaloch opened this issue Aug 27, 2021 · 8 comments
Open

carryless multiplication builtin #9631

travisstaloch opened this issue Aug 27, 2021 · 8 comments
Labels
accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@travisstaloch
Copy link
Sponsor Contributor

I would like to work toward creating a carryless multiplication builtin in zig. This is a fast instruction used in simdjson for example to convert binary quote boundaries from json strings into masks. In the following example, Q is a 64 bit quote boundary marker. The last line is the result of a carryless multiplication between Q and 0xfffffffffffffff

{ "\\\" Nam[{": [ 116,"\\\\" , 234, "true", false ], "t":"\\\"" }: input data
__1___1_____1________1____1________1____1___________1_1_1___11__ : Q
______1_____________________________________________________1___ : OD
__1_________1________1____1________1____1___________1_1_1____1__ : Q &=~OD
__1111111111_________11111_________11111____________11__11111___ : CLMUL(Q,~0)

In simdjson this is known as prefix_xor and is implemented here:

Here are some references to this instruction in the zig repo:

The llvm x86 intrinsic is llvm.x86.pclmulqdq

Name ideas:

  • @mulCarryless()
  • @mulWithoutCarry()

I hope to use this in my simdjson port to get rid off hacky llvm intrinsic calls such as the following which may not be possible in stage 2:

@"llvm.x86.pclmulqdq"(@bitCast(i64x2, a), @bitCast(i64x2, b), 0)

Related to #903

If accepted, I'm not sure where I would begin. If anyone can suggest a similar builtin which uses different intrinsics per platform (and a custom implementation on arm) , perhaps i can follow its implementation.

@andrewrk andrewrk added this to the 0.10.0 milestone Aug 27, 2021
@andrewrk andrewrk added accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. labels Aug 27, 2021
@andrewrk
Copy link
Member

If accepted, I'm not sure where I would begin. If anyone can suggest a similar builtin which uses different intrinsics per platform (and a custom implementation on arm) , perhaps i can follow its implementation.

The good news and bad news is that you are the pioneer of the first such builtin.

One trick you could try would be using clang to emit LLVM IR, using the pclmulqdq intrinsic, but specifying an x86 CPU that does not have the instruction. In this case it may emit a call to a compiler-rt function, which we can make sure is implemented for other architectures in addition to x86.

Regardless, I do think that if you start on this feature, it can be tackled one bit at a time and I'd be happy to help at any point along the way.

@travisstaloch
Copy link
Sponsor Contributor Author

travisstaloch commented Aug 27, 2021

One trick you could try would be using clang to emit LLVM IR, using the pclmulqdq intrinsic, but specifying an x86 CPU that does not have the instruction.

Interesting. Not sure if this is what you meant, but I tried the following. I guess my clang-foo is failing. Any suggestions? Not sure 'i386' is a correct option for -mcpu. I tried several others like 'westmere', 'haswell' w/ same results.

$ clang-12 -c builtin-things.c -o foo.bc -emit-llvm --target=x86_64-linux -mcpu=i386 -mpclmul &&  llvm-dis-12 foo.bc -o foo.ll

clang: warning: argument unused during compilation: '-mcpu=i386' [-Wunused-command-line-argument]

$ cat builtin-things.c
#include <stdint.h>
#include <emmintrin.h>
#include <immintrin.h>

uint64_t prefix_xor(const uint64_t bitmask) {
  __m128i all_ones = _mm_set1_epi8('\xFF');
  __m128i result = _mm_clmulepi64_si128(_mm_set_epi64x(0ULL, bitmask), all_ones, 0);
  return _mm_cvtsi128_si64(result);
}

@andrewrk
Copy link
Member

$ clang-12 -c builtin-things.c -emit-llvm -S
builtin-things.c:7:20: error: '__builtin_ia32_pclmulqdq128' needs target feature pclmul
  __m128i result = _mm_clmulepi64_si128(_mm_set_epi64x(0ULL, bitmask), all_ones, 0);
                   ^
/home/andy/local/llvm12-release/lib/clang/12.0.1/include/__wmmintrin_pclmul.h:45:13: note: expanded from macro '_mm_clmulepi64_si128'
  ((__m128i)__builtin_ia32_pclmulqdq128((__v2di)(__m128i)(X), \
            ^

So now we know how clang handles this situation: compile error

For Zig, we should make the builtin always work, by providing an implementation if necessary.

@matu3ba
Copy link
Contributor

matu3ba commented Aug 27, 2021

uses different intrinsics per platform (and a custom implementation on arm

You can reuse parts of my open PR #9578 to select the correct function at comptime and expose the respective symbol.
Note that I did not implement big endian support (yet), since there is no CI for testing due to LLVM MIPS regression.

Fortunately Rust has implemented usage of this very intrinsic resolving how this gets lowered: rust-lang/stdarch#318.

Feature detection is in LLVM in lib/Support/Host.cpp. Take note to use and reference the MIT release on porting, if possible. There is code in compiler_rt linking that release.

Probably it would also be good to have a central place for intrinsics.
LLVM has intrinsics defined in llvm/lib/IR, but zig uses lib/std/special/compiler_rt for compiler_rt stuff.
So probably using lib/std/special/intrinsics and according intrinsics.zig should be fine to keep it separate, but indicate that things work similar to compiler_rt. However that can also be decided during review.

@travisstaloch
Copy link
Sponsor Contributor Author

I started working on implementing this. I'm currently able to generate the llvm intrinsic but having a name mangling issue that i'm not sure how to fix. I've posted the issue to llvm irc / discord.

The error is:

ld.lld: error: undefined symbol: llvm.x86.pclmulqdq.v2i64

The correct name is just llvm.x86.pclmulqdq. I'm not sure how to get rid of the trailing .v2i64. There must be some way to get irbuilder::CreateIntrinsic to not add the type name. Or maybe there is an alternative to CreateIntrinsic I should be using?

Of course the code is very hacky and messy so far. Just thought i would share my progress incase anyone has any thoughts on this mangling issue or any other thoughts about how to proceed.

@travisstaloch
Copy link
Sponsor Contributor Author

looks like the error from my previous comment was solved in llvm-13.

@lin72h
Copy link

lin72h commented Nov 6, 2022

riscv B extension has clmul too, simde has a cross platform c implementation we can learn from. And one of the use case of clmul is blazing fast clhash for zig cache

travisstaloch pushed a commit to travisstaloch/zig that referenced this issue Nov 8, 2022
addresses ziglang#9631. only works with llvm backend/x86 so far. allows new
test/behavior/mul_carryless.zig to pass with -Denable-llvm. doesn't do
any backend/arch/cpu-feature testing.
travisstaloch added a commit to travisstaloch/zig that referenced this issue Nov 8, 2022
addresses ziglang#9631. only works with llvm backend/x86 so far. allows new
test/behavior/mul_carryless.zig to pass with -Denable-llvm. doesn't do
any backend/arch/cpu-feature testing.
travisstaloch added a commit to travisstaloch/zig that referenced this issue Nov 8, 2022
addresses ziglang#9631. only works with llvm backend/x86 so far. allows new
test/behavior/mul_carryless.zig to pass with -Denable-llvm. doesn't do
any backend/arch/cpu-feature testing.
andrewrk pushed a commit to travisstaloch/zig that referenced this issue Dec 12, 2022
addresses ziglang#9631. only works with llvm backend/x86 so far. allows new
test/behavior/mul_carryless.zig to pass with -Denable-llvm. doesn't do
any backend/arch/cpu-feature testing.
@farteryhr
Copy link

some other important bit manipulation instructions: PDEP and PEXT (and CLMUL, which can also be used for constructing bitty steps other than crypto algorithms). it could be used in wide variety of data co/decompressing, de/encoding algorithms.

polyfill: https://github.com/zwegner/zp7 (also see how AMD fails)

there are many use cases mentioned on the web:

"elegance": https://news.ycombinator.com/item?id=20205743

though elegance can't be quantized, in my humble opinion, they're like the new "CLZ CTZ POPCNT triad" as standard bit manipulation units. that's useful, non-trivial to polyfill and makes pain.

please consider also adding them to builtins..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

5 participants