Add AVX2 implementation of `first_max_element` #2806

redzic · 2021-09-15T19:58:14Z

This also now requires BMI1 and BMI2 for AVX2 in CpuFeatureLevel (for tzcnt).

tdaede · 2021-09-15T21:23:15Z

Resulting assembly:

example::first_max_element_avx2:
        vmovdqu xmm0, xmmword ptr [rdi]
        vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]
        vpshufd xmm1, xmm0, 238
        vpmaxsd xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 85
        vpmaxsd xmm0, xmm0, xmm1
        vmovd   edx, xmm0
        vpbroadcastd    ymm0, xmm0
        vpcmpeqd        ymm0, ymm0, ymmword ptr [rdi]
        vmovmskps       eax, ymm0
        tzcnt   eax, eax
        vzeroupper
        ret

tdaede · 2021-09-15T21:29:28Z

It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too?

coveralls · 2021-09-15T21:32:41Z

Coverage decreased (-0.1%) to 83.273% when pulling cff1af9 on redzic:find-first-max-avx2 into 2ec4e67 on xiph:master.

redzic · 2021-09-15T22:04:45Z

It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too?

I don't think it's possible to use ymm registers for getting the max of 8 32-bit integers, or if it is there probably wouldn't be any benefit.

As far as I understand, this is how the algorithm LLVM uses works:

max:
        ; rdi is a pointer to the first element, and it is statically known that there are 8 elements

        ; load the first 4 elements of the slice into xmm0
        vmovdqu xmm0, xmmword ptr [rdi]

        ; do a packed max the last 4 elements of the slice (rdi + 16)
        ; basically, xmm0 now looks like this (where x represents the original 8-element array):
        ; xmm0 = [max(x[0], x[4]), max(x[1], x[5]), max(x[2], x[6]), max(x[3], x[7])]
        ;
        ; at this point, we can forget about the entire array and just worry about finding the max of
        ; the 4 elements that are now in xmm0, since the max is definitely inside xmm0
        vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]

        ; shuffle: xmm1 = xmm0[2,3,2,3]
        vpshufd xmm1, xmm0, 238

        ; packed max of xmm0 and xmm1, store the result in xmm0
        ; here is what the registers look like before this step:
        ; xmm0 = [x[0], x[1], x[2], x[3]]
        ; xmm1 = [x[2], x[3], x[2], x[3]]
        ; so the last 2 elements of xmm0 after this instruction are going to be irrelevant,
        ; and as you can see, the max will be either the first or second element of xmm0
        vpmaxsd xmm0, xmm0, xmm1

        ; shuffle : xmm1 = xmm0[1,1,1,1]
        ; xmm0 = [max(x[0], x[2]), max(x[1], x[3]), ...]
        ; xmm1 = [max(x[1], x[3]), max(x[1], x[3]), ...]
        ; (other elements omitted since we don't care about them any more)
        vpshufd xmm1, xmm0, 85

        ; packed max of xmm0 and xmm1, store the result in xmm0
        vpmaxsd xmm0, xmm0, xmm1

        ; now xmm0 = [max(max(x[0], x[2]), max(x[1], x[3])), ...]

        ; so move the first element of xmm0, which is the final max, into eax
        vmovd   eax, xmm0
        ret

tdaede · 2021-09-15T22:15:35Z

Ah yeah sorry, yeah the first max comparison can only be 4 vs 4.

src/cdef.rs

This also now requires BMI1 and BMI2 for AVX2 in `CpuFeatureLevel`.

KyleSiefring · 2021-09-16T02:00:14Z

I'm confused. Isn't this already optimized from the assembly called from here? https://github.com/xiph/rav1e/blob/master/src/asm/x86/cdef.rs#L195

redzic · 2021-09-16T03:52:14Z

@KyleSiefring I think you're right, I didn't notice this.

If I'm not mistaken, the actual assembly corresponding to this particular function seems to be in here:

rav1e/src/x86/cdef_avx2.asm

Lines 1773 to 1785 in 2ec4e67

    
               pmaxsd         xm2, xm0, xm1 
        
               pshufd         xm3, xm2, q1032 
        
               pmaxsd         xm2, xm3 
        
               pshufd         xm3, xm2, q2301 
        
               pmaxsd         xm2, xm3 ; best cost 
        
               ; find the idx using minpos 
        
               ; make everything other than the best cost negative via subtraction 
        
               ; find the min of unsigned 16-bit ints to sort out the negative values 
        
               psubd          xm4, xm1, xm2 
        
               psubd          xm3, xm0, xm2 
        
               packssdw       xm3, xm4 
        
               phminposuw     xm3, xm3

However, I think there could still be at least some value for making first_max_element in the rust code take a &[i32; 8], as it allows the compiler to eliminate some bounds checks and generate less code.

lu-zero · 2021-09-16T09:46:04Z

Could you please split the scalar &[i32; 8] part from the patch?

Your simd code would be good in the not-nasm situation but maybe it would need further discussion.

tdaede · 2021-09-17T20:55:55Z

Could we turn this into an implementation for a non-nasm platform, e.g. ARM/PPC/RISC-V, that doesn't implement CDEF yet?

redzic · 2021-09-18T00:39:53Z

We probably can, I just have to figure out how to do that. I will mark the PR as WIP for now as I try to learn how to write the SIMD implementation for other platforms.

redzic · 2021-11-16T03:58:06Z

@lu-zero @tdaede Now that std::simd has reached nightly, we could probably implement a portable SIMD version of this function with it, but gate it behind nightly, similar to how aarch64 intrinsics are currently being used.

redzic force-pushed the find-first-max-avx2 branch 3 times, most recently from 248542a to 28cb0a9 Compare September 15, 2021 20:52

redzic mentioned this pull request Sep 15, 2021

cdef: enforce 8 element size for first_max_element #2807

Closed

redzic force-pushed the find-first-max-avx2 branch from 28cb0a9 to db76e3c Compare September 15, 2021 21:18

tdaede approved these changes Sep 15, 2021

View reviewed changes

lu-zero reviewed Sep 15, 2021

View reviewed changes

src/cdef.rs Outdated Show resolved Hide resolved

redzic force-pushed the find-first-max-avx2 branch from db76e3c to 47ad5a4 Compare September 15, 2021 23:00

Add AVX2 implementation of first_max_element

cff1af9

This also now requires BMI1 and BMI2 for AVX2 in `CpuFeatureLevel`.

redzic force-pushed the find-first-max-avx2 branch from 47ad5a4 to cff1af9 Compare September 15, 2021 23:06

redzic marked this pull request as draft September 18, 2021 00:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 implementation of `first_max_element` #2806

Add AVX2 implementation of `first_max_element` #2806

redzic commented Sep 15, 2021

tdaede commented Sep 15, 2021

tdaede commented Sep 15, 2021

coveralls commented Sep 15, 2021 •

edited

redzic commented Sep 15, 2021 •

edited

tdaede commented Sep 15, 2021

KyleSiefring commented Sep 16, 2021

redzic commented Sep 16, 2021

lu-zero commented Sep 16, 2021

tdaede commented Sep 17, 2021

redzic commented Sep 18, 2021

redzic commented Nov 16, 2021 •

edited

Add AVX2 implementation of first_max_element #2806

Are you sure you want to change the base?

Add AVX2 implementation of first_max_element #2806

Conversation

redzic commented Sep 15, 2021

tdaede commented Sep 15, 2021

tdaede commented Sep 15, 2021

coveralls commented Sep 15, 2021 • edited

redzic commented Sep 15, 2021 • edited

tdaede commented Sep 15, 2021

KyleSiefring commented Sep 16, 2021

redzic commented Sep 16, 2021

lu-zero commented Sep 16, 2021

tdaede commented Sep 17, 2021

redzic commented Sep 18, 2021

redzic commented Nov 16, 2021 • edited

Add AVX2 implementation of `first_max_element` #2806

Add AVX2 implementation of `first_max_element` #2806

coveralls commented Sep 15, 2021 •

edited

redzic commented Sep 15, 2021 •

edited

redzic commented Nov 16, 2021 •

edited