How to generate more efficient SIMD code #2632
-
The following code simulates _mm256_movemask_ps, although it can run correctly, the execution overhead is higher than _mm256_movemask_ps, is there a more efficient way to write the code that can simulate the performance of _mm256_movemask_ps?For specific time-consuming comparisons, please check the code library repository :https://github.com/zengdelang/ISPC_Sort
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Another question, I need to simulate the _mm512_mask_compressstoreu_pd instruction on the avx512 instruction set to implement the partition function of quick sort, can the following simulation code be compiled into the _mm512_mask_compressstoreu_pd instruction?
|
Beta Was this translation helpful? Give feedback.
-
I am not sure I follow the details but the quoted part looks like that it can be expressed using
compiling with the command: $ ispc -O2 --target=avx2-i8x32 example.ispc -o example.o the following code generated : foo: # @foo
vmovdqu ymm0, ymmword ptr [rdi]
vpminub ymm1, ymm0, ymmword ptr [rsi]
vpcmpeqb ymm0, ymm0, ymm1
vpmovmskb eax, ymm0
not eax
vzeroupper
ret |
Beta Was this translation helpful? Give feedback.
I am not sure I follow the details but the quoted part looks like that it can be expressed using
packmask
like this:compiling with the command: