Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upBetter SIMD shuffles #36624
Comments
Mark-Simulacrum
added
A-LLVM
I-slow
labels
May 13, 2017
Mark-Simulacrum
added
the
C-enhancement
label
Jul 26, 2017
jneem
referenced
this issue
in rust-lang/regex
Mar 13, 2018
This comment has been minimized.
This comment has been minimized.
|
This still occurs even when using the specific intrinsics in |
This comment has been minimized.
This comment has been minimized.
|
Can reproduce, will try to fill in an LLVM bug for this. Once the LLVM bug is filled and recognized as a real bug we could workaround this in EDIT: reported https://bugs.llvm.org/show_bug.cgi?id=36933 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
jneem commentedSep 21, 2016
I'm trying to do the following in AVX2 using intrinsics: shift
xone byte to the right, while shifting the rightmost byte ofyinto the leftmost byte ofx. This is best done using two instructions:vperm2i128followed byvpalignr. However,simd_shuffle32generates four instructions:vmovdqa(to load a constant),vpblendvb, thenvperm2i128andvpalignr. Here is a a full example, which may be compiled withrustc -O -C target_feature=+avx2 --crate-type=lib --emit=asm shuffle.rs.This might be considered a bug in LLVM, in the sense that it's generating a sub-optimal shuffle. However, I think it should be addressed in rustc, because if I know what the right sequence of instructions is then I shouldn't have to hope that LLVM can generate it. Moreover, it's possible to get the right code from clang (compile with
clang -emit-llvm -mavx2 -O -S shuffle.c):A possibly interesting observation is that the unoptimized LLVM IR from clang contains a
llvm.x86.avx2.vperm2i128intrinsic followed by ashufflevector. The optimized LLVM IR from clang contains twoshufflevectorintrinsics. In order to try to get the same output from rustc, I first patched it to supportllvm.x86.avx2.vperm2i128. After modifyingright_shift_1to use the new intrinsic, I got rustc to producellvm.x86.avx2.vperm2i128followed by ashufflevector. However, the optimized LLVM IR from rustc still produces a singleshufflevector, and it still ends up producing the bad asm.I think this means that the fault is from some optimization pass in rustc that isn't in clang, but I haven't had time to investigate it yet...