Skip to content

Conversation

@AntoinePrv
Copy link
Contributor

@AntoinePrv AntoinePrv commented Nov 11, 2025

  • Add AVX2 implementation of constant mask swizzle for uint8_t uint16_t
  • Remove all_different limitation to uint32_t version, add is_identity case
  • No significative changes for uint64_t. Making it closer the uint32_t implementation.
  • Remove unused utility functions (is_all_different, is_no_duplicates).

Similar to the dynamic version, all functions are implemented for a sized uint, and other types forward to the proper size.

@AntoinePrv AntoinePrv force-pushed the swizzle-avx2 branch 5 times, most recently from 809f983 to cff4b80 Compare November 12, 2025 22:19
@AntoinePrv AntoinePrv changed the title Improve Avx2 swizzle Add Avx2 constant mask swizzle 8/16 and improve 32/64 Nov 12, 2025
@AntoinePrv AntoinePrv marked this pull request as ready for review November 12, 2025 22:30
@AntoinePrv
Copy link
Contributor Author

Ping @serge-sans-paille @JohanMabille and @DiamonDinoia @pitrou following conversation in GH-1141

@AntoinePrv
Copy link
Contributor Author

AntoinePrv commented Nov 12, 2025

I'm surprised at that last failure, since it was green before and seems unrelated to these last changes.

I thought the "lane" logic was AVX specific but I see it is also there in SSE2 and AVX512 (where there are in fact four lanes). We should write a meta-algorithm that can be specialized with each intrinsics but this is a longer shot. At least we'll have these in for the release.
For uint8_t can also improve the is_dup_lo is_dup_hi logic with smaller sizes that can be permuted.

{
const auto self_bytes = bitwise_cast<uint8_t>(self);
// If a mask entry is k, we want 2k in low byte and 2k+1 in high byte
auto constexpr mask_2k_2kp1 = batch_constant<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like using auto too. But the codebase seems to do `constexpr batch_constant<....> mask_2k_2kp1 {};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, we can improve the code base!

return _mm256_permute_pd(self, imm);
constexpr auto lane_mask = mask % make_batch_constant<uint32_t, (mask.size / 2), A>();
// Cheaper intrinsics when not crossing lanes
// We could also use _mm256_permute_ps which uses a imm8 constant, though it has the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why not using _mm256_permute_ps ? I think there is a mask() method that returns an immediate? Or did I add it in masked_memory_ops?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one uses one more register, and one extra load to build the mask, so yeah, why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using _mm256_permute_ps

They had the same latency so I was not sure the immediate was better. I can make the change!

Copy link
Contributor

@serge-sans-paille serge-sans-paille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra optimization! I assume you compared the generated assembly with GCC and clang.

I agree we need a generalization step for per-lane swizzle, and I also agree to post-pone this to « after the release ». I will handle this.

}

template <typename T>
constexpr bool swizzle_val_is_defined(T val, T size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for my future self: I wonder if we should assert that val is always defined. I'm not quite sure we have a portable behavior across architectures if that's not the case.

{
const auto self_bytes = bitwise_cast<uint8_t>(self);
// If a mask entry is k, we want 2k in low byte and 2k+1 in high byte
auto constexpr mask_2k_2kp1 = batch_constant<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, we can improve the code base!

return _mm256_permute_pd(self, imm);
constexpr auto lane_mask = mask % make_batch_constant<uint32_t, (mask.size / 2), A>();
// Cheaper intrinsics when not crossing lanes
// We could also use _mm256_permute_ps which uses a imm8 constant, though it has the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one uses one more register, and one extra load to build the mask, so yeah, why?

@AntoinePrv
Copy link
Contributor Author

I assume you compared the generated assembly with GCC and clang.

@serge-sans-paille in fact no (I'm bad at it), my changes are based on the number of intrinsics called and their reported latency (the dispatch being done at compile time).
Do you recommend a particular way of doing it?

@DiamonDinoia
Copy link
Contributor

I assume you compared the generated assembly with GCC and clang.

@serge-sans-paille in fact no (I'm bad at it), my changes are based on the number of intrinsics called and their reported latency (the dispatch being done at compile time). Do you recommend a particular way of doing it?

I run compiler explorer locally on my machine.

Otherwise breakpoint + gdb also works.

@serge-sans-paille
Copy link
Contributor

I assume you compared the generated assembly with GCC and clang.

@serge-sans-paille in fact no (I'm bad at it), my changes are based on the number of intrinsics called and their reported latency (the dispatch being done at compile time). Do you recommend a particular way of doing it?

If you're a decent assembly reader, just dump the assembly from your source code. You can take inspiration from this:

printf '#include <xsimd/xsimd.hpp>\nauto foo(xsimd::batch<float> a, xsimd::batch<float> b) { return xsimd::avg(a, b); }' | clang++ -mavx2 -O2 -S -xc++ -o- - -Iinclude -DNDEBUG | pygmentize -lasm

@AntoinePrv AntoinePrv marked this pull request as draft November 13, 2025 22:15
@AntoinePrv
Copy link
Contributor Author

AntoinePrv commented Nov 13, 2025

For many of them, I unsurprisingly I see the instruction associated with the intrinsic.

For the most complex one, the uint8_t general case, clang seems to not respect the intrinsict:
c++

__m256i swapped = _mm256_permute2x128_si256(self, self, 0x01); // [high | low]
constexpr auto self_mask = detail::swizzle_make_self_batch<uint8_t, A, Vals...>();
constexpr auto cross_mask = detail::swizzle_make_cross_batch<uint8_t, A, Vals...>();
__m256i r0 = _mm256_shuffle_epi8(self, self_mask.as_batch());
__m256i r1 = _mm256_shuffle_epi8(swapped, cross_mask.as_batch());
return _mm256_or_si256(r0, r1);

gcc-15 (good instructions)

.cfi_startproc
vperm2i128      $1, %ymm0, %ymm0, %ymm1
vpshufb .LC0(%rip), %ymm0, %ymm0
vpshufb .LC1(%rip), %ymm1, %ymm1
vpor    %ymm1, %ymm0, %ymm0
ret
gcc-15 dynamic version
.cfi_startproc
movl    $252645135, %eax
vperm2i128      $1, %ymm0, %ymm0, %ymm3
vmovd   %eax, %xmm2
movl    $269488144, %eax
vpbroadcastd    %xmm2, %ymm2
vpand   %ymm2, %ymm1, %ymm2
vpshufb %ymm2, %ymm3, %ymm3
vpshufb %ymm2, %ymm0, %ymm0
vmovd   %eax, %xmm2
vpbroadcastd    %xmm2, %ymm2
vpand   %ymm2, %ymm1, %ymm1
vpcmpeqb        .LC2(%rip), %ymm1, %ymm1
vpblendvb       %ymm1, %ymm0, %ymm3, %ymm0
ret
.cfi_endproc

clang-21 (not the instructions asked for)

.cfi_startproc
# %bb.0:
pushq   %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq    %rsp, %rbp
.cfi_def_cfa_register %rbp
andq    $-32, %rsp
subq    $32, %rsp
movq    %rdi, %rax
vinserti128     $1, 16(%rbp), %ymm0, %ymm0
vmovdqa 16(%rbp), %ymm1
vmovdqa .LCPI0_0(%rip), %ymm2           # ymm2 = [0,0,0,0,u,u,u,u,u,u,u,u,u,u,u,u,255,255,255,255,u,u,255,255,u,u,u,u,255,255,0,0]
vpblendvb       %ymm2, %ymm0, %ymm1, %ymm0
vpshufb .LCPI0_1(%rip), %ymm0, %ymm0    # ymm0 = ymm0[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,30,30,30,30,30,30,30,30,30,30,30,29,23,19,19,16]
vmovdqa %ymm0, (%rdi)
movq    %rbp, %rsp
popq    %rbp
.cfi_def_cfa %rsp, 8
vzeroupper
retq
clang-21 dynamic version
.cfi_startproc
# %bb.0:
pushq   %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq    %rsp, %rbp
.cfi_def_cfa_register %rbp
andq    $-32, %rsp
subq    $32, %rsp
movq    %rdi, %rax
vmovdqa 48(%rbp), %ymm0
vmovdqa 16(%rbp), %ymm1
vpermq  $78, %ymm1, %ymm2               # ymm2 = ymm1[2,3,0,1]
vpand   .LCPI0_0(%rip), %ymm0, %ymm3
vpshufb %ymm3, %ymm1, %ymm1
vpshufb %ymm3, %ymm2, %ymm2
vpand   .LCPI0_1(%rip), %ymm0, %ymm0
vpcmpeqb        .LCPI0_2(%rip), %ymm0, %ymm0
vpblendvb       %ymm0, %ymm1, %ymm2, %ymm0
vmovdqa %ymm0, (%rdi)
movq    %rbp, %rsp
popq    %rbp
.cfi_def_cfa %rsp, 8
vzeroupper
retq

@AntoinePrv AntoinePrv marked this pull request as ready for review November 13, 2025 23:54
@DiamonDinoia
Copy link
Contributor

DiamonDinoia commented Nov 14, 2025 via email

@serge-sans-paille
Copy link
Contributor

It's generally fine for a compiler to further optimize a kernel we described in "the best way possible". And as suggested by @DiamonDinoia if that's an issue, we can open a bug on the upstream compiler.

clang represents vector intrinsics in a generic way in LLVM IR, optimizes them, and then perform instruction selection to generate assembly, it may happens that the end result is worst than what we expressed in the beginning. Or better :-)

Tracking the quality of generated code for each compiler and each compiler version is a tremendous goal, we have a best effort goal on that topic.

@serge-sans-paille
Copy link
Contributor

BTW, @AntoinePrv if you're happy with the result (I am) can you squash you commit stack and write a proper aggregated commit message so I can merge? thanks!

@AntoinePrv
Copy link
Contributor Author

@serge-sans-paille All good for me, thanks!

@serge-sans-paille serge-sans-paille merged commit 7f3e01c into xtensor-stack:master Nov 16, 2025
58 of 60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants