Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slice::contains generates suboptimal assembly code #88204

Closed
SadiinsoSnowfall opened this issue Aug 21, 2021 · 8 comments
Closed

Slice::contains generates suboptimal assembly code #88204

SadiinsoSnowfall opened this issue Aug 21, 2021 · 8 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@SadiinsoSnowfall
Copy link

SadiinsoSnowfall commented Aug 21, 2021

Given val: u8, slice: &[u8; 8] and arr: [u8; 8], I expected the following statements to compile down to the same thing :

// a
slice.contains(&val);

// b
slice[0] == val
  || slice[1] == val
  || slice[2] == val
  || slice[3] == val
  || slice[4] == val
  || slice[5] == val
  || slice[6] == val
  || slice[7] == val;
  
// c
arr.contains(&val);

// d
arr[0] == val
  || arr[1] == val
  || arr[2] == val
  || arr[3] == val
  || arr[4] == val
  || arr[5] == val
  || arr[6] == val
  || arr[7] == val;

However, the resulting assembly differs quite a lot:

  • The a statement compiles down to a loop, checking one element at a time, except for T = u8|i8 and N < 16 where it instead call fall on the fast path of memchr which gets optimized a little bit better.
  • The b statement compiles down to a unrolled-loop, checking one element at a time in a branchless fashion. Most of the time it doesn't give any SIMD instructions.
  • The c statement always compiles down to a loop, checking one element at a time, except for T = u8|i8 and N >= 16 where it instead call memchr_general_case
  • The d statement always compiles down to a few branchless SIMD instructions for any primitive type used and any array size.

Because the slice/array size is known at compile-time and the type checker guarantees that it will be respected by any calling function, I expected the compiler to take this into account while optimizing the resulting assembly. However, this information seems to be lost at some point when using the contains method.

arr.contains(&val) and slice.contains(&val) are simplified as arr.as_ref().iter().any(|e| *e == val) and slice.iter().any(|e| *e == val) if I'm not mistaken (which is wierd because for some N and T, they don't yield the same assembly). The compiler does not seem to be able to unroll this case.

godbolt links for
T=u8; N=8
T=u16; N=8
T=u32; N=8
T=u64; N=8

T=u8; N=16
T=u16; N=16
T=u32; N=16
T=u64; N=16

@hkratz
Copy link
Contributor

hkratz commented Aug 21, 2021

@rustbot label +I-slow

@rustbot rustbot added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Aug 21, 2021
@the8472
Copy link
Member

the8472 commented Aug 21, 2021

It can be made vectorizer-friendly by using as_chunks. It also works for slices of unknown size.

https://rust.godbolt.org/z/4GKrxoKxz

@SadiinsoSnowfall
Copy link
Author

SadiinsoSnowfall commented Aug 25, 2021

It looks like something changed in one of the latest nightly version (I retested this on a49e38e67 2021-08-23 according to godbolt), the contains method is now full-unrolled by the compiler and produced a serie of cmp and jumps but almost never produce any SIMD code. Maybe using an unrolled loop like that is preferable for small slices. But when using larger slices, the compiler does not even unroll the loop.

I can't figure out why the generated assembly when using arrays is so much different than when using borrowed slices.

@SadiinsoSnowfall
Copy link
Author

Apparently, both Clang and GCC also avoid using SIMD in this case (preferring unrolled loops or not depending on the targeted architecture). So it may be a non-issue.

@the8472
Copy link
Member

the8472 commented Aug 28, 2021

It looks like something changed in one of the latest nightly version

Update to LLVM 13. #87570

@SadiinsoSnowfall
Copy link
Author

SadiinsoSnowfall commented Sep 16, 2021

On the current nightly available in godbolt (rustc 1.57.0-nightly (9bb77da74 2021-09-13)). Every function compiles to a serie of cmp and jmp instructions except for the memchr_general_case case (u8 slice with more than 8 elements) because this function is not marked as inline.

The only weird thign I found was on the assembly generated by the b and d expressions (see first message) on [T=u8; N=16] slices/array. In this case, the d expression produce a serie of jmp and cmp instructions, but the b expression produce a lot more assembly code including SIMD instructions.

I would like to know if there is a reason for this discrepency, and if a serie of cmp/jmp is in fact the fastest way to perform the operation described in my first message. If everything is ok, I will close the issue.

PS: rustc 1.55 vs current nightly for [T=u64; N=16] case:

pub fn test(slice: &[u64; 16], val: u64) -> bool {
    slice[0] == val
    || slice[1] == val
    || slice[2] == val
    || slice[3] == val
    || slice[4] == val
    || slice[5] == val
    || slice[6] == val
    || slice[7] == val
    || slice[8] == val
    || slice[9] == val
    || slice[10] == val
    || slice[11] == val
    || slice[12] == val
    || slice[13] == val
    || slice[14] == val
    || slice[15] == val
}
rustc_1_55::test:
        vmovq   xmm0, rsi
        vpbroadcastq    ymm0, xmm0
        vpcmpeqq        ymm1, ymm0, ymmword ptr [rdi + 96]
        vpcmpeqq        ymm2, ymm0, ymmword ptr [rdi + 64]
        vpcmpeqq        ymm3, ymm0, ymmword ptr [rdi + 32]
        vpcmpeqq        ymm0, ymm0, ymmword ptr [rdi]
        vpackssdw       ymm1, ymm2, ymm1
        vpackssdw       ymm0, ymm0, ymm3
        vpermq  ymm1, ymm1, 216
        vpermq  ymm0, ymm0, 216
        vpackssdw       ymm0, ymm0, ymm1
        vpmovmskb       eax, ymm0
        test    eax, -1431655766
        setne   al
        vzeroupper
        ret
        
nightly::test:
        mov     al, 1
        cmp     qword ptr [rdi], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 8], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 16], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 24], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 32], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 40], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 48], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 56], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 64], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 72], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 80], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 88], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 96], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 104], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 112], rsi
        je      .LBB0_16
        cmp     qword ptr [rdi + 120], rsi
        sete    al
.LBB0_16:
        ret

@the8472
Copy link
Member

the8472 commented Sep 16, 2021

If you use | instead of || it vectorizes again.

@SadiinsoSnowfall
Copy link
Author

Could this be linked to #83623 in relation with the update to LLVM 13 ? The code vectorizes when using rustc 1.53-1.55 but not on the current nightly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants