Add Avx2 constant mask swizzle 8/16 and improve 32/64 #1201

AntoinePrv · 2025-11-11T23:50:03Z

Add AVX2 implementation of constant mask swizzle for uint8_t uint16_t
Remove all_different limitation to uint32_t version, add is_identity case
No significative changes for uint64_t. Making it closer the uint32_t implementation.
Remove unused utility functions (is_all_different, is_no_duplicates).

Similar to the dynamic version, all functions are implemented for a sized uint, and other types forward to the proper size.

AntoinePrv · 2025-11-12T22:45:30Z

Ping @serge-sans-paille @JohanMabille and @DiamonDinoia @pitrou following conversation in GH-1141

AntoinePrv · 2025-11-12T23:06:42Z

I'm surprised at that last failure, since it was green before and seems unrelated to these last changes.

I thought the "lane" logic was AVX specific but I see it is also there in SSE2 and AVX512 (where there are in fact four lanes). We should write a meta-algorithm that can be specialized with each intrinsics but this is a longer shot. At least we'll have these in for the release.
For uint8_t can also improve the is_dup_lo is_dup_hi logic with smaller sizes that can be permuted.

DiamonDinoia · 2025-11-13T02:24:24Z

include/xsimd/arch/xsimd_avx2.hpp

+        {
+            const auto self_bytes = bitwise_cast<uint8_t>(self);
+            // If a mask entry is k, we want 2k in low byte and 2k+1 in high byte
+            auto constexpr mask_2k_2kp1 = batch_constant<


I like using auto too. But the codebase seems to do `constexpr batch_constant<....> mask_2k_2kp1 {};

That's fine, we can improve the code base!

DiamonDinoia · 2025-11-13T02:25:58Z

include/xsimd/arch/xsimd_avx2.hpp

-                return _mm256_permute_pd(self, imm);
+                constexpr auto lane_mask = mask % make_batch_constant<uint32_t, (mask.size / 2), A>();
+                // Cheaper intrinsics when not crossing lanes
+                // We could also use _mm256_permute_ps which uses a imm8 constant, though it has the


then why not using _mm256_permute_ps ? I think there is a mask() method that returns an immediate? Or did I add it in masked_memory_ops?

And this one uses one more register, and one extra load to build the mask, so yeah, why?

why not using _mm256_permute_ps

They had the same latency so I was not sure the immediate was better. I can make the change!

serge-sans-paille

Thanks for the extra optimization! I assume you compared the generated assembly with GCC and clang.

I agree we need a generalization step for per-lane swizzle, and I also agree to post-pone this to « after the release ». I will handle this.

serge-sans-paille · 2025-11-13T06:22:02Z

include/xsimd/arch/xsimd_avx2.hpp

+            }
+
+            template <typename T>
+            constexpr bool swizzle_val_is_defined(T val, T size)


Note for my future self: I wonder if we should assert that val is always defined. I'm not quite sure we have a portable behavior across architectures if that's not the case.

serge-sans-paille · 2025-11-13T06:23:26Z

include/xsimd/arch/xsimd_avx2.hpp

+        {
+            const auto self_bytes = bitwise_cast<uint8_t>(self);
+            // If a mask entry is k, we want 2k in low byte and 2k+1 in high byte
+            auto constexpr mask_2k_2kp1 = batch_constant<


That's fine, we can improve the code base!

serge-sans-paille · 2025-11-13T06:25:16Z

include/xsimd/arch/xsimd_avx2.hpp

-                return _mm256_permute_pd(self, imm);
+                constexpr auto lane_mask = mask % make_batch_constant<uint32_t, (mask.size / 2), A>();
+                // Cheaper intrinsics when not crossing lanes
+                // We could also use _mm256_permute_ps which uses a imm8 constant, though it has the


And this one uses one more register, and one extra load to build the mask, so yeah, why?

AntoinePrv · 2025-11-13T16:54:46Z

I assume you compared the generated assembly with GCC and clang.

@serge-sans-paille in fact no (I'm bad at it), my changes are based on the number of intrinsics called and their reported latency (the dispatch being done at compile time).
Do you recommend a particular way of doing it?

DiamonDinoia · 2025-11-13T19:11:46Z

I assume you compared the generated assembly with GCC and clang.

@serge-sans-paille in fact no (I'm bad at it), my changes are based on the number of intrinsics called and their reported latency (the dispatch being done at compile time). Do you recommend a particular way of doing it?

I run compiler explorer locally on my machine.

Otherwise breakpoint + gdb also works.

serge-sans-paille · 2025-11-13T20:06:46Z

I assume you compared the generated assembly with GCC and clang.

@serge-sans-paille in fact no (I'm bad at it), my changes are based on the number of intrinsics called and their reported latency (the dispatch being done at compile time). Do you recommend a particular way of doing it?

If you're a decent assembly reader, just dump the assembly from your source code. You can take inspiration from this:

printf '#include <xsimd/xsimd.hpp>\nauto foo(xsimd::batch<float> a, xsimd::batch<float> b) { return xsimd::avg(a, b); }' | clang++ -mavx2 -O2 -S -xc++ -o- - -Iinclude -DNDEBUG | pygmentize -lasm

AntoinePrv · 2025-11-13T23:41:03Z

For many of them, I unsurprisingly I see the instruction associated with the intrinsic.

For the most complex one, the uint8_t general case, clang seems to not respect the intrinsict:
c++

__m256i swapped = _mm256_permute2x128_si256(self, self, 0x01); // [high | low]
constexpr auto self_mask = detail::swizzle_make_self_batch<uint8_t, A, Vals...>();
constexpr auto cross_mask = detail::swizzle_make_cross_batch<uint8_t, A, Vals...>();
__m256i r0 = _mm256_shuffle_epi8(self, self_mask.as_batch());
__m256i r1 = _mm256_shuffle_epi8(swapped, cross_mask.as_batch());
return _mm256_or_si256(r0, r1);

gcc-15 (good instructions)

.cfi_startproc
vperm2i128      $1, %ymm0, %ymm0, %ymm1
vpshufb .LC0(%rip), %ymm0, %ymm0
vpshufb .LC1(%rip), %ymm1, %ymm1
vpor    %ymm1, %ymm0, %ymm0
ret

gcc-15 dynamic version

.cfi_startproc
movl    $252645135, %eax
vperm2i128      $1, %ymm0, %ymm0, %ymm3
vmovd   %eax, %xmm2
movl    $269488144, %eax
vpbroadcastd    %xmm2, %ymm2
vpand   %ymm2, %ymm1, %ymm2
vpshufb %ymm2, %ymm3, %ymm3
vpshufb %ymm2, %ymm0, %ymm0
vmovd   %eax, %xmm2
vpbroadcastd    %xmm2, %ymm2
vpand   %ymm2, %ymm1, %ymm1
vpcmpeqb        .LC2(%rip), %ymm1, %ymm1
vpblendvb       %ymm1, %ymm0, %ymm3, %ymm0
ret
.cfi_endproc

clang-21 (not the instructions asked for)

.cfi_startproc
# %bb.0:
pushq   %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq    %rsp, %rbp
.cfi_def_cfa_register %rbp
andq    $-32, %rsp
subq    $32, %rsp
movq    %rdi, %rax
vinserti128     $1, 16(%rbp), %ymm0, %ymm0
vmovdqa 16(%rbp), %ymm1
vmovdqa .LCPI0_0(%rip), %ymm2           # ymm2 = [0,0,0,0,u,u,u,u,u,u,u,u,u,u,u,u,255,255,255,255,u,u,255,255,u,u,u,u,255,255,0,0]
vpblendvb       %ymm2, %ymm0, %ymm1, %ymm0
vpshufb .LCPI0_1(%rip), %ymm0, %ymm0    # ymm0 = ymm0[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,30,30,30,30,30,30,30,30,30,30,30,29,23,19,19,16]
vmovdqa %ymm0, (%rdi)
movq    %rbp, %rsp
popq    %rbp
.cfi_def_cfa %rsp, 8
vzeroupper
retq

clang-21 dynamic version

.cfi_startproc
# %bb.0:
pushq   %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq    %rsp, %rbp
.cfi_def_cfa_register %rbp
andq    $-32, %rsp
subq    $32, %rsp
movq    %rdi, %rax
vmovdqa 48(%rbp), %ymm0
vmovdqa 16(%rbp), %ymm1
vpermq  $78, %ymm1, %ymm2               # ymm2 = ymm1[2,3,0,1]
vpand   .LCPI0_0(%rip), %ymm0, %ymm3
vpshufb %ymm3, %ymm1, %ymm1
vpshufb %ymm3, %ymm2, %ymm2
vpand   .LCPI0_1(%rip), %ymm0, %ymm0
vpcmpeqb        .LCPI0_2(%rip), %ymm0, %ymm0
vpblendvb       %ymm0, %ymm1, %ymm2, %ymm0
vmovdqa %ymm0, (%rdi)
movq    %rbp, %rsp
popq    %rbp
.cfi_def_cfa %rsp, 8
vzeroupper
retq

DiamonDinoia · 2025-11-14T00:30:22Z

In this case, I usually write a gtest stub to measure the performance between compilers if clang is much slower I look for a workaround or raise the issue to llvm.

…

On Thursday, November 13, 2025, Antoine Prouvost ***@***.***> wrote: *AntoinePrv* left a comment (xtensor-stack/xsimd#1201) <https://urldefense.com/v3/__https://github.com/xtensor-stack/xsimd/pull/1201*issuecomment-3530199068__;Iw!!DSb-azq1wVFtOg!ThB47bmqxrzXc3ar3ERgJZPJlLllsUF1waDchGKl1GO3oLQFLMQgab-BEdxpQR-hRQRZZDr8gODsY20yvCN-rwVT3GdltyM3$> For many of them, I unsurprisingly I see the instruction associated with the intrinsic. For the most complex one, the uint8_t general case, clang seems to not respect the intrinsict gcc-15 (good instructions) .cfi_startproc vperm2i128 $1, %ymm0, %ymm0, %ymm1 vpshufb .LC0(%rip), %ymm0, %ymm0 vpshufb .LC1(%rip), %ymm1, %ymm1 vpor %ymm1, %ymm0, %ymm0 ret clang-21 (not the instructions asked for) .cfi_startproc # %bb.0: pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset %rbp, -16 movq %rsp, %rbp .cfi_def_cfa_register %rbp andq $-32, %rsp subq $32, %rsp movq %rdi, %rax vinserti128 $1, 16(%rbp), %ymm0, %ymm0 vmovdqa 16(%rbp), %ymm1 vmovdqa .LCPI0_0(%rip), %ymm2 # ymm2 = [0,0,0,0,u,u,u,u,u,u,u,u,u,u,u,u,255,255,255,255,u,u,255,255,u,u,u,u,255,255,0,0] vpblendvb %ymm2, %ymm0, %ymm1, %ymm0 vpshufb .LCPI0_1(%rip), %ymm0, %ymm0 # ymm0 = ymm0[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,30,30,30,30,30,30,30,30,30,30,30,29,23,19,19,16] vmovdqa %ymm0, (%rdi) movq %rbp, %rsp popq %rbp .cfi_def_cfa %rsp, 8 vzeroupper retq — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/xtensor-stack/xsimd/pull/1201*issuecomment-3530199068__;Iw!!DSb-azq1wVFtOg!ThB47bmqxrzXc3ar3ERgJZPJlLllsUF1waDchGKl1GO3oLQFLMQgab-BEdxpQR-hRQRZZDr8gODsY20yvCN-rwVT3GdltyM3$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACGKNQJUJSAKBGVIY553LED34UJKLAVCNFSM6AAAAACL2LN7NOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKMZQGE4TSMBWHA__;!!DSb-azq1wVFtOg!ThB47bmqxrzXc3ar3ERgJZPJlLllsUF1waDchGKl1GO3oLQFLMQgab-BEdxpQR-hRQRZZDr8gODsY20yvCN-rwVT3CWRJPO8$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

serge-sans-paille · 2025-11-15T14:44:06Z

It's generally fine for a compiler to further optimize a kernel we described in "the best way possible". And as suggested by @DiamonDinoia if that's an issue, we can open a bug on the upstream compiler.

clang represents vector intrinsics in a generic way in LLVM IR, optimizes them, and then perform instruction selection to generate assembly, it may happens that the end result is worst than what we expressed in the beginning. Or better :-)

Tracking the quality of generated code for each compiler and each compiler version is a tremendous goal, we have a best effort goal on that topic.

serge-sans-paille · 2025-11-15T14:46:29Z

BTW, @AntoinePrv if you're happy with the result (I am) can you squash you commit stack and write a proper aggregated commit message so I can merge? thanks!

AntoinePrv · 2025-11-15T22:23:55Z

@serge-sans-paille All good for me, thanks!

AntoinePrv force-pushed the swizzle-avx2 branch 5 times, most recently from 809f983 to cff4b80 Compare November 12, 2025 22:19

AntoinePrv changed the title ~~Improve Avx2 swizzle~~ Add Avx2 constant mask swizzle 8/16 and improve 32/64 Nov 12, 2025

AntoinePrv marked this pull request as ready for review November 12, 2025 22:30

DiamonDinoia reviewed Nov 13, 2025

View reviewed changes

serge-sans-paille reviewed Nov 13, 2025

View reviewed changes

AntoinePrv marked this pull request as draft November 13, 2025 22:15

AntoinePrv force-pushed the swizzle-avx2 branch from ebc033e to cce8173 Compare November 13, 2025 23:54

AntoinePrv marked this pull request as ready for review November 13, 2025 23:54

Avx2 constant swizzle 8/16 bits and optimizations

88e88a1

AntoinePrv force-pushed the swizzle-avx2 branch from cce8173 to 88e88a1 Compare November 15, 2025 22:23

serge-sans-paille merged commit 7f3e01c into xtensor-stack:master Nov 16, 2025
58 of 60 checks passed

AntoinePrv mentioned this pull request Nov 18, 2025

AVX swizzle broadcast and swap optimization #1213

Merged

Add Avx2 constant mask swizzle 8/16 and improve 32/64 #1201

Add Avx2 constant mask swizzle 8/16 and improve 32/64 #1201

Uh oh!

Conversation

AntoinePrv commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntoinePrv commented Nov 12, 2025

Uh oh!

AntoinePrv commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AntoinePrv commented Nov 13, 2025

Uh oh!

DiamonDinoia commented Nov 13, 2025

Uh oh!

serge-sans-paille commented Nov 13, 2025

Uh oh!

AntoinePrv commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamonDinoia commented Nov 14, 2025 via email

Uh oh!

serge-sans-paille commented Nov 15, 2025

Uh oh!

serge-sans-paille commented Nov 15, 2025

Uh oh!

AntoinePrv commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AntoinePrv commented Nov 11, 2025 •

edited

Loading

AntoinePrv commented Nov 12, 2025 •

edited

Loading

AntoinePrv commented Nov 13, 2025 •

edited

Loading