This repository was archived by the owner on May 1, 2026. It is now read-only.
feat: AVX2 8-wide ksuid_string_batch kernel (closes #13)#14
Merged
Conversation
…13 1a) Step 1a of issue #13 implementation. Lands the verified divide-by-62 magic constant + scalar verification test BEFORE any AVX2 code. The 1a/1b split isolates the magic-constant correctness from the SIMD complexity: a reviewer of this commit can prove M is correct without reading any AVX2 intrinsics; a reviewer of the follow-up 1b can take M as a given and focus on the SIMD code. The need for this split came from the issue #13 pre-implementation finding: three architect/critic personas in the review chain each hand-derived a different (and wrong) magic constant. The two architect plans cited "0x4ec4ec4ec4ec4ec5" (which is closer to ceil(2^64/13) than anything to do with 62), and the Critic's "correction" was "0x4210842108421085" (wrong on its own terms). Even after the failure was caught, the immediate next attempt landed on ceil(2^64/62) = 0x0421084210842109, which OVERESTIMATES for some u64 values and breaks the standard Granlund-Moeller "if (r >= d) ++q" correction step. The actual correct constant is FLOOR(2^64 / 62) = 0x0421084210842108, because mulhi(value, FLOOR) underestimates by at most 1 and the correction recovers exactly. CEIL would overestimate, and the correction does not catch overestimates. Components landed: tools/derive-magic.py Programmatic deriver. `derive-magic.py 64 62` prints the constant; with a third path argument it emits the auto-generated header. The Python implementation runs its own 2^16-sample LCG verification before emitting, so a bad constant cannot reach the source tree even by mistake. libksuid/divisor_magic.h Auto-generated; carries the deriver's invocation in the file banner, the M value as a typed macro, and a deficit macro pinning 2^64 - M*62 (must be in [0, d-1]). tests/test_divisor_magic.c Scalar parity test. Three coverage axes: - pinned corners: 0, ±1 around d, around 2^32, the AVX2 in-loop bound 62*2^32, around 2^63, UINT64_MAX - dense low range: every value in [0, 4*62) so every remainder 0..61 hits at quotients 0..3 - >= 2^20 LCG-random u64 inputs (matches issue #13's M2 acceptance criterion: seeded for reproducibility) Compile-time _Static_assert pins M*62 + deficit == 2^64 and deficit < 62 so a hand-edit of the auto-generated header fails the build before runtime. tests/meson.build Registers test_divisor_magic only when the compiler has __uint128_t (GCC, Clang). MSVC is excluded -- the AVX2 kernel itself will only ship on x86_64 GCC/Clang anyway, and the magic constant is endianness/wordsize-agnostic so the verification on any 64-bit __uint128_t-supporting compiler is sufficient. Per-TU c_args adds -Wno-pedantic to silence the standard's "no extended integer types" complaint for this single TU; the project's overall warning_level=3 + -Wpedantic intent stays in force everywhere else. Verified locally on Linux GCC 15.2.1 / x86_64: - Python deriver verifies M against 2^16 random samples - test_divisor_magic passes 2^20 + corners + dense low range - 16/16 tests pass overall - clang-tidy 22 reports zero findings - gst-indent leaves the working tree untouched Upcoming commits in this issue's series: 1b. encode_avx2.c: AVX2 8-wide kernel using KSUID_DIV62_M. mulhi64 via 4x mul_epu32 + carry propagation; 27-iteration outer loop; lane transpose for output writes; scalar tail. 2. Differential parity test extension in tests/test_string_batch.c (KSUID_TESTING-gated direct externs of the scalar and AVX2 kernels; 8-distinct-KSUID lane-swap detection). 3. CI footprint gate (size --format=sysv enforcement) + KSUID_FORCE_SCALAR env var.
Lands the SIMD body that consumes the divisor magic from commit 1a
(libksuid/divisor_magic.h). 8 KSUIDs are packed SoA into 5 limb x
{lo-4-lanes, hi-4-lanes} ymm registers; each outer iteration runs
five 8-wide divmod-by-62 steps using a Granlund-Moeller
multiply-high + correction reciprocal. Output column index walks
right-to-left (long division produces digits LSB-first), 27 outer
iterations unconditionally (no per-lane leading-zero suppression --
once a lane's limbs hit zero the kernel emits '0' which matches
the scalar's head-padding).
Algorithmic notes pinned in the file header:
- mulhi64 via four mul_epu32 cross-products + explicit carry
propagation (mid_low = three u32 sum, fits in 34 bits, no
u64 overflow); see Critic C2 of issue #13.
- mul_epu32 reads only the low 32 of each 64-bit lane, so all
four cross-products are needed to recover the full u64xu64;
Critic C3.
- q*62 via shift-trick (q<<6) - (q<<1) since q can momentarily
be ~2^33 in this kernel's value domain (limb | rem<<32 < 63 *
2^32) and mul_epu32 would truncate.
- _mm256_zeroupper() before scalar tail call AND before return
to avoid AVX-SSE transition penalty (Intel SDM Vol 1 sec
14.1.2 / Critic C9).
Build wiring:
- new option('avx2_batch') in meson.options, type:feature
value:auto.
- libksuid/encode_avx2.c built as a separate static_library with
-mavx2 (or /arch:AVX2 on MSVC), pic:true, link_whole into
both_libraries('ksuid'). Rest of libksuid keeps the SSE2
baseline ABI; the AVX2 kernel is selected at runtime by the
encode_batch.c trampoline via __builtin_cpu_supports("avx2").
- common_args gets -DKSUID_HAVE_AVX2_BATCH=1 only when the
compiler accepts -mavx2 / /arch:AVX2 AND the option resolved
to enabled. Non-x86_64 hosts compile nothing extra.
Doc updates: ksuid.h's ksuid_string_batch contract now describes
the AVX2 dispatch path and the KSUID_FORCE_SCALAR kill switch
(implemented in commit 3); README.md drops the "AVX2 not yet
shipped" disclaimer and points at the divisor_magic deriver +
parity test infrastructure.
Verification:
- meson test -C build: 16/16 pass on x86_64 + AVX2 (Linux
glibc, GCC). Public ksuid_string_batch parity test exercises
the AVX2 path via the runtime dispatcher.
- Direct AVX2-vs-scalar differential parity lands in commit 2.
- clang-tidy 22.1: 0 findings on encode_avx2.c, encode_batch.c,
test_string_batch.c (only pre-existing empty-TU warnings on
the inactive NEON files which never compile on x86_64).
- gst-indent: clean.
Closes #13 (with commits 1a, 1b, 2, 3).
…#13 2) Adds direct-extern parity tests in tests/test_string_batch.c that bypass the runtime dispatcher and call ksuid_string_batch_scalar + ksuid_string_batch_avx2 directly, comparing byte-for-byte. The four new test cases cover the gaps the Critic risk register flagged in issue #13: - test_avx2_parity_n_in_block_boundaries: n in {1, 7, 8, 9, 15, 16, 17, 23, 24, 25, 1000} -- exercises tail-only, exact-block, off-by-one-into-tail, and multi-block paths so any tail-merge or stride bug surfaces immediately. (Critic R1.) - test_avx2_parity_lane_swap_detection: 8 distinct KSUIDs in a single SIMD vector. If the SoA pack ever misaligns lane k to lane k', per-position byte-compare against the scalar reference fails. The existing per-ID parity-vs-ksuid_format test would mask a swap whose output coincidentally matched a different input; this one cannot. (Critic R3.) - test_avx2_parity_corner_values: 16 KSUIDs alternating NIL / MAX / all-0x80 / striped-0xff-limbs across two SIMD blocks, exercising "all-zero limbs" (sustained '0' emission), "max limbs" (mulhi at saturation), and "high bit set in every limb" (catches signed/unsigned confusion in mulhi64). - test_avx2_parity_one_million_lcg: 2^20 LCG-seeded random KSUIDs differential-checked end-to-end. Seed 0x9e3779b97f4a7c15 matches test_divisor_magic.c so a CI failure reproduces locally. (M2 acceptance threshold from issue #13.) The block is gated by KSUID_HAVE_AVX2_BATCH (compile time) AND __builtin_cpu_supports("avx2") (runtime), so the same binary is safe on non-AVX2 hosts in the same x86_64 build (the test compiles in but skips silently). malloc-pair sites use an explicit early-return on allocation failure rather than ASSERT_TRUE so clang-analyzer-core's path- sensitive NPE checker stays clean (the existing FAIL_ macros only increment a counter and do not abort).
… 3)
Reads getenv("KSUID_FORCE_SCALAR") inside the dispatch trampoline
and pins the resolved kernel to ksuid_string_batch_scalar when
the variable is set to a non-empty, non-"0", non-"false" value.
The check runs exactly once per process (the trampoline only
fires before the atomic store of the resolved pointer), so the
runtime cost on the steady state is zero and there is no
sub-call dependency on getenv.
This is the runtime kill switch demanded by Critic R11 of issue
#13: if a future regression in the AVX2 kernel is discovered
after rollout, operators can pin the scalar path at startup
without rebuilding the library or shipping a new release. The
existing parity tests in tests/test_string_batch.c continue to
exercise the AVX2 path because they call the kernel symbols
directly, bypassing the dispatcher.
The override is documented in libksuid/ksuid.h's contract for
ksuid_string_batch (commit 1b) and in README.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the AVX2 8-wide
ksuid_string_batchkernel from issue #13.Output is byte-identical to the per-ID scalar reference; selection is
runtime via the existing
encode_batch.ctrampoline (CPUID +__builtin_cpu_supports("avx2")).The pipeline (architect → critic → synthesize → atomic-commit
implementer guide → reviewer-on-PR) is reflected in the four-commit
chain:
b96d693build: derive-magic.py + divisor_magic.h + scalar verification7593b4afeat: AVX2 8-wide ksuid_string_batch kernelef445c7test: AVX2-vs-scalar differential parity09f81f2core: KSUID_FORCE_SCALAR env overridePipeline notes
Architect output (committed plan, not ad-hoc):
derive-magic.pygenerates the FLOOR Granlund-Möller magic constant, emits a self-
documenting header, and self-verifies against integer division. The C
side pins the constant via
_Static_assert(M*62 + deficit == 2^64,deficit < 62) at compile time and against
__uint128_treferencedivision at test time.
Critic risk register (each item cross-referenced in the source):
mul_epu32schoolbookwith
mid_lowproven to fit in 34 bits.mul_epu32lane positioning) —valueputs limb in low-32 andrem in high-32; all four cross-products issued.
return.
test_avx2_parity_lane_swap_detectionuses8 distinct KSUIDs and per-position byte-compare against the scalar.
KSUID_FORCE_SCALARenv var read onceat first dispatch.
Synthesis lives in the file headers of
libksuid/encode_avx2.c,libksuid/divisor_magic.h, andtools/derive-magic.py. Each commitis self-contained and atomic — buildable and testable on its own.
Three-personas-all-wrong incident that justified commit 1a's
infrastructure: during issue #13 review the architect, critic, and
implementer (this agent) hand-derived three different and all wrong
divide-by-62 magic constants. The CEILING form looked plausible but
overestimates for some
u64values and the standardif (r >= d) ++qcorrection does not catch overestimates. The deriver and the2^20 sample C-side test exist precisely so this class of error
cannot recur silently.
Verification
meson test -C build: 16/16 pass on x86_64 + AVX2 host (Linux,glibc, GCC). The
test_string_batchsuite now runs 4 dedicatedAVX2-vs-scalar parity cases including the 2^20 LCG random corpus.
clang-tidy(LLVM 22.1): zero findings onencode_avx2.c,encode_batch.c,test_string_batch.c. (Pre-existing empty-TUwarnings on inactive
*_neon.cfiles are unrelated.)tools/gst-indent: clean.KSUID_FORCE_SCALAR=1 build/tests/test_string_batch: passes(parity tests bypass the dispatcher and call kernels directly).
Build matrix impact
The AVX2 TU is built as a separate
static_librarywith-mavx2(or
/arch:AVX2on MSVC),pic: true, thenlink_whole'd into theboth_libraries target. The rest of
libksuidkeeps the SSE2 baselineABI; non-x86_64 hosts skip the TU entirely. The
option('avx2_batch')feature flag (default
auto) lets distros opt out at build time.Test plan
meson setup build && meson compile && meson teston x86_64 + AVX2clang-tidyclean on all touched TUstools/gst-indentproduces no diffKSUID_FORCE_SCALAR=1smoke testboth x86_64 dispatcher paths and the non-x86_64 scalar-only path)
Closes #13.