Skip to content

feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332

Open
DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/dynamic-masks
Open

feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332
DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/dynamic-masks

Conversation

@DiamonDinoia
Copy link
Copy Markdown
Contributor

Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path fallback is collapsed to a whole-vector select, and the unaligned page-cross fast path is dropped since the underlying intrinsics suppress faults on masked-off lanes regardless of alignment.

@DiamonDinoia DiamonDinoia force-pushed the feat/dynamic-masks branch 5 times, most recently from 7484c4b to d5f21c7 Compare May 1, 2026 19:51
@serge-sans-paille
Copy link
Copy Markdown
Contributor

Coild you split the head / tail part in another PR? This one is already quite dense...

// (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single
// intrinsic that suppresses inactive-lane reads in hardware.
constexpr std::size_t size = batch<T, A>::size;
alignas(A::alignment()) std::array<T, size> buffer {};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it worse, building a mask is not always a single operation depending on the target...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in [latest push] — switched the common-arch fallback to use mask.get(i) directly instead of materialising mask.mask() once and shifting per lane. The .mask() call is now gone from the hot path on architectures that fall back to common, so the per-target cost of building the bit mask no longer matters here.

// (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single
// intrinsic that suppresses inactive-lane reads in hardware.
constexpr std::size_t size = batch<T, A>::size;
alignas(A::alignment()) std::array<T, size> buffer {};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this array assignment forces everything to zero, while some stores are not needed, and the compiler is notable to optimize this away in the generic case

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — dropped the value-init {} on the buffer and instead write every lane unconditionally in one pass: buffer[i] = mask.get(i) ? mem[i] : T(0);. No double-write on inactive lanes anymore.

Comment thread include/xsimd/arch/xsimd_avx2.hpp Outdated
XSIMD_INLINE std::enable_if_t<std::is_integral<T>::value && (sizeof(T) == 4 || sizeof(T) == 8), batch<T, A>>
load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<avx2>) noexcept
{
using int_t = std::conditional_t<sizeof(T) == 4, int32_t, long long>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why long long and not int64_t ? Tehre's no garantee that sizeof(long long) == 8

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long long is what the Intel intrinsic _mm256_maskload_epi64 takes (long long const*). Using int64_t* would force a reinterpret_cast at every call site. Added static_assert(sizeof(long long) == 8, ...) next to the helpers so the assumption is pinned at compile time — the dispatcher uses the pointer-width to pick the right intrinsic.

Comment thread include/xsimd/arch/xsimd_avx2.hpp Outdated
}
else
{
_mm256_maskstore_epi64(reinterpret_cast<long long*>(mem), __m256i(mask), __m256i(src));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I guess that's a constraint of the Intel intrinsic, at least static_assert that sizeof(long long) ==8 and sizeof(int) == 4 if you're using this to disntinguish between the two?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added static_assert(sizeof(int) == 4, ...) and static_assert(sizeof(long long) == 8, ...) next to the detail::maskload/maskstore helpers. Refactored both the constant-mask and runtime-mask AVX2 paths to share these helpers so the size assumption lives in one place. The runtime store now goes through detail::maskstore(reinterpret_cast<int_t*>(mem), ...) like the load already did, removing the open-coded XSIMD_IF_CONSTEXPR(sizeof(T) == 4) branch.

// constructs a 128-bit chunk predicate (svdupq_b{8,16,32,64}), which
// is replication-based and does not correctly express a per-lane
// mask on SVE wider than 128 bits — going through ``as_batch_bool``
// gives the right predicate for every vector width. ``int32``/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if the pmask approach would be faster? If so we could still if constexpr its usage when the sve size allows it.

Copy link
Copy Markdown
Contributor Author

@DiamonDinoia DiamonDinoia May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — pmask (svdupq_b{8,16,32,64}) is replication-based: it builds a 128-bit chunk and replicates it across the SVE register, so it only expresses a per-lane mask correctly when the SVE vector length is exactly 128 bits. For 256/512/1024/2048-bit SVE it would silently produce the wrong predicate. The current path through as_batch_bool is correct for every VL and lowers to a single cmpne against the integer-domain mask. We could in principle gate pmask on __ARM_FEATURE_SVE_BITS == 128, but on my qemu build that case lowers to the same ptrue + cmpne pair, so it would just be conditional code without a measured win. Happy to add the gate if you have a benchmark that shows a delta on a 128-bit-VL system.

Comment thread include/xsimd/types/xsimd_api.hpp Outdated
* so partial loads across a page boundary are safe. \c stream_mode is not
* supported.
*
* \warning Runtime-mask loads carry a significant performance penalty on
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should go into details here:

  1. it's difficult to maintain this kind of documentation (what about newly added architectures)
  2. we already have the case for other operations and we don't specify it.

I think it's important to communicate that info, but until we have an automated way to do so, better not just throw documentation at it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — collapsed the four runtime-mask doxygen blocks down to one short paragraph each (one for load, one for store), with \overload on the unaligned variant. Removed the per-architecture rundown of which targets do/do not have native maskload, since that information rots quickly as the project picks up new arches.

Comment thread include/xsimd/types/xsimd_batch.hpp Outdated
static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value,
"supported load mode");
constexpr uint64_t full_mask = details::full_mask(size);
const auto bits = mask.mask();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure we want that extra call to mask which may be costly, plus the extra tests... if masking is supported, is it beneifical? If it's not, we're already slow...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — dropped the bits == 0 / bits == full_mask early-out in both batch::load(ptr, batch_bool, mode) and batch::store(ptr, batch_bool, mode). The runtime-mask member now just forwards straight to kernel::load_masked / kernel::store_masked. Targets with native predicated instructions (AVX2/AVX-512/SVE/RVV) absorb the all-zero / all-one mask via the hardware predicate, and on the common scalar fallback the per-lane loop handles those cases for free. The extra mask.mask() call and the two compares are gone.

Comment thread include/xsimd/types/xsimd_batch.hpp Outdated
static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value,
"supported store mode");
constexpr uint64_t full_mask = details::full_mask(size);
const auto bits = mask.mask();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix — early-out and static_assert are gone from batch::store(T*, batch_bool, Mode); it now forwards directly to kernel::store_masked.

@DiamonDinoia DiamonDinoia force-pushed the feat/dynamic-masks branch from d5f21c7 to 665925b Compare May 5, 2026 14:16
DiamonDinoia added a commit to DiamonDinoia/xsimd that referenced this pull request May 5, 2026
…1332 review

- common: drop zero-init buffer + mask.mask() pack; use mask.get(i) directly
- batch::load/store(batch_bool, Mode): drop bits==0/full early-out, forward
  to kernel (native arches absorb the all-zero/all-one mask in hardware)
- avx2: pin sizeof(int)==4 / sizeof(long long)==8 next to detail::maskload
  helpers; runtime store routes through detail::maskstore symmetrically
- avx2_128: introduce detail::maskload_128 / maskstore_128; constant- and
  runtime-mask paths share them; fix stale convert<double>/avx_128 dispatch
  tags on the int64/uint64 constant-mask overloads
- xsimd_api.hpp + data_transfer.rst: shorten runtime-mask docs (drop the
  per-architecture rundown that would rot as new arches land)
- test_load_store: collapse run_*_mask_pattern / run_*_runtime_mask_pattern
  into one helper parameterized on a MaskKind policy; drop first_N/last_N
  patterns (covered by load_head/load_tail in the follow-up branch)

Codegen on AVX2/AVX2-128 verified: each runtime-mask load/store reduces
to a single vpmaskmov{d,q}; the early-out removal eliminates an extra
vmovmskps + test + cmp + branch tail.
Adds runtime batch_bool mask overloads of xsimd::load_masked and
xsimd::store_masked across AVX, AVX2, AVX-512, SSE, SVE, RVV, and NEON;
generic common-path fallback collapsed to a whole-vector select. SVE
compile-time masked load/store forwarded through the runtime path so the
per-lane predicate is correct on SVE wider than 128 bits.

Adds arch-specific runtime-mask overloads of load_masked / store_masked
for the avx_128 and avx2_128 arches so they inherit the hardware
predicated load/store path on x86.

Squashed from:
  b57a766  feat: add runtime batch_bool mask overloads for load_masked/store_masked
  d5f21c7  feat: add runtime batch_bool mask overloads for avx_128 / avx2_128
@DiamonDinoia DiamonDinoia force-pushed the feat/dynamic-masks branch from 7d7bbc3 to d240cef Compare May 5, 2026 18:07
…lpers

Shorten verbose comments around masked load/store paths, drop the
sizeof(int)/sizeof(long long) static_asserts (intrinsic boundaries now
reinterpret_cast at the call site), and collapse the four
maskload_128/maskstore_128 detail overloads into two XSIMD_IF_CONSTEXPR-
dispatched templates. Public surface unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants