-
Notifications
You must be signed in to change notification settings - Fork 282
Add Avx2 constant mask swizzle 8/16 and improve 32/64 #1201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Avx2 constant mask swizzle 8/16 and improve 32/64 #1201
Conversation
809f983 to
cff4b80
Compare
|
Ping @serge-sans-paille @JohanMabille and @DiamonDinoia @pitrou following conversation in GH-1141 |
|
I'm surprised at that last failure, since it was green before and seems unrelated to these last changes. I thought the "lane" logic was AVX specific but I see it is also there in SSE2 and AVX512 (where there are in fact four lanes). We should write a meta-algorithm that can be specialized with each intrinsics but this is a longer shot. At least we'll have these in for the release. |
| { | ||
| const auto self_bytes = bitwise_cast<uint8_t>(self); | ||
| // If a mask entry is k, we want 2k in low byte and 2k+1 in high byte | ||
| auto constexpr mask_2k_2kp1 = batch_constant< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like using auto too. But the codebase seems to do `constexpr batch_constant<....> mask_2k_2kp1 {};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, we can improve the code base!
include/xsimd/arch/xsimd_avx2.hpp
Outdated
| return _mm256_permute_pd(self, imm); | ||
| constexpr auto lane_mask = mask % make_batch_constant<uint32_t, (mask.size / 2), A>(); | ||
| // Cheaper intrinsics when not crossing lanes | ||
| // We could also use _mm256_permute_ps which uses a imm8 constant, though it has the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then why not using _mm256_permute_ps ? I think there is a mask() method that returns an immediate? Or did I add it in masked_memory_ops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one uses one more register, and one extra load to build the mask, so yeah, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not using _mm256_permute_ps
They had the same latency so I was not sure the immediate was better. I can make the change!
serge-sans-paille
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra optimization! I assume you compared the generated assembly with GCC and clang.
I agree we need a generalization step for per-lane swizzle, and I also agree to post-pone this to « after the release ». I will handle this.
| } | ||
|
|
||
| template <typename T> | ||
| constexpr bool swizzle_val_is_defined(T val, T size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for my future self: I wonder if we should assert that val is always defined. I'm not quite sure we have a portable behavior across architectures if that's not the case.
| { | ||
| const auto self_bytes = bitwise_cast<uint8_t>(self); | ||
| // If a mask entry is k, we want 2k in low byte and 2k+1 in high byte | ||
| auto constexpr mask_2k_2kp1 = batch_constant< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, we can improve the code base!
include/xsimd/arch/xsimd_avx2.hpp
Outdated
| return _mm256_permute_pd(self, imm); | ||
| constexpr auto lane_mask = mask % make_batch_constant<uint32_t, (mask.size / 2), A>(); | ||
| // Cheaper intrinsics when not crossing lanes | ||
| // We could also use _mm256_permute_ps which uses a imm8 constant, though it has the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one uses one more register, and one extra load to build the mask, so yeah, why?
@serge-sans-paille in fact no (I'm bad at it), my changes are based on the number of intrinsics called and their reported latency (the dispatch being done at compile time). |
I run compiler explorer locally on my machine. Otherwise breakpoint + gdb also works. |
If you're a decent assembly reader, just dump the assembly from your source code. You can take inspiration from this: |
|
For many of them, I unsurprisingly I see the instruction associated with the intrinsic. For the most complex one, the __m256i swapped = _mm256_permute2x128_si256(self, self, 0x01); // [high | low]
constexpr auto self_mask = detail::swizzle_make_self_batch<uint8_t, A, Vals...>();
constexpr auto cross_mask = detail::swizzle_make_cross_batch<uint8_t, A, Vals...>();
__m256i r0 = _mm256_shuffle_epi8(self, self_mask.as_batch());
__m256i r1 = _mm256_shuffle_epi8(swapped, cross_mask.as_batch());
return _mm256_or_si256(r0, r1);gcc-15 (good instructions) .cfi_startproc
vperm2i128 $1, %ymm0, %ymm0, %ymm1
vpshufb .LC0(%rip), %ymm0, %ymm0
vpshufb .LC1(%rip), %ymm1, %ymm1
vpor %ymm1, %ymm0, %ymm0
retgcc-15 dynamic version.cfi_startproc
movl $252645135, %eax
vperm2i128 $1, %ymm0, %ymm0, %ymm3
vmovd %eax, %xmm2
movl $269488144, %eax
vpbroadcastd %xmm2, %ymm2
vpand %ymm2, %ymm1, %ymm2
vpshufb %ymm2, %ymm3, %ymm3
vpshufb %ymm2, %ymm0, %ymm0
vmovd %eax, %xmm2
vpbroadcastd %xmm2, %ymm2
vpand %ymm2, %ymm1, %ymm1
vpcmpeqb .LC2(%rip), %ymm1, %ymm1
vpblendvb %ymm1, %ymm0, %ymm3, %ymm0
ret
.cfi_endprocclang-21 (not the instructions asked for) .cfi_startproc
# %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
andq $-32, %rsp
subq $32, %rsp
movq %rdi, %rax
vinserti128 $1, 16(%rbp), %ymm0, %ymm0
vmovdqa 16(%rbp), %ymm1
vmovdqa .LCPI0_0(%rip), %ymm2 # ymm2 = [0,0,0,0,u,u,u,u,u,u,u,u,u,u,u,u,255,255,255,255,u,u,255,255,u,u,u,u,255,255,0,0]
vpblendvb %ymm2, %ymm0, %ymm1, %ymm0
vpshufb .LCPI0_1(%rip), %ymm0, %ymm0 # ymm0 = ymm0[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,30,30,30,30,30,30,30,30,30,30,30,29,23,19,19,16]
vmovdqa %ymm0, (%rdi)
movq %rbp, %rsp
popq %rbp
.cfi_def_cfa %rsp, 8
vzeroupper
retqclang-21 dynamic version.cfi_startproc
# %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
andq $-32, %rsp
subq $32, %rsp
movq %rdi, %rax
vmovdqa 48(%rbp), %ymm0
vmovdqa 16(%rbp), %ymm1
vpermq $78, %ymm1, %ymm2 # ymm2 = ymm1[2,3,0,1]
vpand .LCPI0_0(%rip), %ymm0, %ymm3
vpshufb %ymm3, %ymm1, %ymm1
vpshufb %ymm3, %ymm2, %ymm2
vpand .LCPI0_1(%rip), %ymm0, %ymm0
vpcmpeqb .LCPI0_2(%rip), %ymm0, %ymm0
vpblendvb %ymm0, %ymm1, %ymm2, %ymm0
vmovdqa %ymm0, (%rdi)
movq %rbp, %rsp
popq %rbp
.cfi_def_cfa %rsp, 8
vzeroupper
retq |
ebc033e to
cce8173
Compare
|
In this case, I usually write a gtest stub to measure the performance
between compilers if clang is much slower I look for a workaround or raise
the issue to llvm.
…On Thursday, November 13, 2025, Antoine Prouvost ***@***.***> wrote:
*AntoinePrv* left a comment (xtensor-stack/xsimd#1201)
<https://urldefense.com/v3/__https://github.com/xtensor-stack/xsimd/pull/1201*issuecomment-3530199068__;Iw!!DSb-azq1wVFtOg!ThB47bmqxrzXc3ar3ERgJZPJlLllsUF1waDchGKl1GO3oLQFLMQgab-BEdxpQR-hRQRZZDr8gODsY20yvCN-rwVT3GdltyM3$>
For many of them, I unsurprisingly I see the instruction associated with
the intrinsic.
For the most complex one, the uint8_t general case, clang seems to not
respect the intrinsict
gcc-15 (good instructions)
.cfi_startproc vperm2i128 $1, %ymm0, %ymm0, %ymm1 vpshufb .LC0(%rip), %ymm0, %ymm0 vpshufb .LC1(%rip), %ymm1, %ymm1 vpor %ymm1, %ymm0, %ymm0 ret
clang-21 (not the instructions asked for)
.cfi_startproc # %bb.0: pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset %rbp, -16 movq %rsp, %rbp .cfi_def_cfa_register %rbp andq $-32, %rsp subq $32, %rsp movq %rdi, %rax vinserti128 $1, 16(%rbp), %ymm0, %ymm0 vmovdqa 16(%rbp), %ymm1 vmovdqa .LCPI0_0(%rip), %ymm2 # ymm2 = [0,0,0,0,u,u,u,u,u,u,u,u,u,u,u,u,255,255,255,255,u,u,255,255,u,u,u,u,255,255,0,0] vpblendvb %ymm2, %ymm0, %ymm1, %ymm0 vpshufb .LCPI0_1(%rip), %ymm0, %ymm0 # ymm0 = ymm0[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,30,30,30,30,30,30,30,30,30,30,30,29,23,19,19,16] vmovdqa %ymm0, (%rdi) movq %rbp, %rsp popq %rbp .cfi_def_cfa %rsp, 8 vzeroupper retq
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/xtensor-stack/xsimd/pull/1201*issuecomment-3530199068__;Iw!!DSb-azq1wVFtOg!ThB47bmqxrzXc3ar3ERgJZPJlLllsUF1waDchGKl1GO3oLQFLMQgab-BEdxpQR-hRQRZZDr8gODsY20yvCN-rwVT3GdltyM3$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACGKNQJUJSAKBGVIY553LED34UJKLAVCNFSM6AAAAACL2LN7NOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKMZQGE4TSMBWHA__;!!DSb-azq1wVFtOg!ThB47bmqxrzXc3ar3ERgJZPJlLllsUF1waDchGKl1GO3oLQFLMQgab-BEdxpQR-hRQRZZDr8gODsY20yvCN-rwVT3CWRJPO8$>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
It's generally fine for a compiler to further optimize a kernel we described in "the best way possible". And as suggested by @DiamonDinoia if that's an issue, we can open a bug on the upstream compiler. clang represents vector intrinsics in a generic way in LLVM IR, optimizes them, and then perform instruction selection to generate assembly, it may happens that the end result is worst than what we expressed in the beginning. Or better :-) Tracking the quality of generated code for each compiler and each compiler version is a tremendous goal, we have a best effort goal on that topic. |
|
BTW, @AntoinePrv if you're happy with the result (I am) can you squash you commit stack and write a proper aggregated commit message so I can merge? thanks! |
cce8173 to
88e88a1
Compare
|
@serge-sans-paille All good for me, thanks! |
7f3e01c
into
xtensor-stack:master
uint8_tuint16_tall_differentlimitation touint32_tversion, addis_identitycaseuint64_t. Making it closer theuint32_timplementation.is_all_different,is_no_duplicates).Similar to the dynamic version, all functions are implemented for a sized
uint, and other types forward to the proper size.