feat: AVX2 8-wide ksuid_string_batch kernel (closes #13) by justinjoy · Pull Request #14 · semantic-reasoning/libksuid

justinjoy · 2026-04-30T09:00:33Z

Summary

Implements the AVX2 8-wide ksuid_string_batch kernel from issue #13.
Output is byte-identical to the per-ID scalar reference; selection is
runtime via the existing encode_batch.c trampoline (CPUID + __builtin_cpu_supports("avx2")).

The pipeline (architect → critic → synthesize → atomic-commit
implementer guide → reviewer-on-PR) is reflected in the four-commit
chain:

Commit	Subject	Scope
`b96d693`	`build: derive-magic.py + divisor_magic.h + scalar verification`	Magic-constant deriver + auto-generated header + 2^20 sample C-side verification (issue #13 1a)
`7593b4a`	`feat: AVX2 8-wide ksuid_string_batch kernel`	The vectorized kernel + meson wiring (issue #13 1b)
`ef445c7`	`test: AVX2-vs-scalar differential parity`	Direct-extern parity tests, lane-swap detection, 2^20 LCG random corpus (issue #13 2)
`09f81f2`	`core: KSUID_FORCE_SCALAR env override`	Runtime kill switch (issue #13 3, Critic R11)

Pipeline notes

Architect output (committed plan, not ad-hoc): derive-magic.py
generates the FLOOR Granlund-Möller magic constant, emits a self-
documenting header, and self-verifies against integer division. The C
side pins the constant via _Static_assert (M*62 + deficit == 2^64,
deficit < 62) at compile time and against __uint128_t reference
division at test time.

Critic risk register (each item cross-referenced in the source):

C2 (mulhi64 carry propagation) — handled by 4×mul_epu32 schoolbook
with mid_low proven to fit in 34 bits.
C3 (mul_epu32 lane positioning) — value puts limb in low-32 and
rem in high-32; all four cross-products issued.
C9 (zeroupper placement) — emitted before scalar tail call AND before
return.
R3 (lane-swap masking) — test_avx2_parity_lane_swap_detection uses
8 distinct KSUIDs and per-position byte-compare against the scalar.
R11 (force-scalar override) — KSUID_FORCE_SCALAR env var read once
at first dispatch.

Synthesis lives in the file headers of libksuid/encode_avx2.c,
libksuid/divisor_magic.h, and tools/derive-magic.py. Each commit
is self-contained and atomic — buildable and testable on its own.

Three-personas-all-wrong incident that justified commit 1a's
infrastructure: during issue #13 review the architect, critic, and
implementer (this agent) hand-derived three different and all wrong
divide-by-62 magic constants. The CEILING form looked plausible but
overestimates for some u64 values and the standard if (r >= d) ++q correction does not catch overestimates. The deriver and the
2^20 sample C-side test exist precisely so this class of error
cannot recur silently.

Verification

meson test -C build: 16/16 pass on x86_64 + AVX2 host (Linux,
glibc, GCC). The test_string_batch suite now runs 4 dedicated
AVX2-vs-scalar parity cases including the 2^20 LCG random corpus.
clang-tidy (LLVM 22.1): zero findings on encode_avx2.c,
encode_batch.c, test_string_batch.c. (Pre-existing empty-TU
warnings on inactive *_neon.c files are unrelated.)
tools/gst-indent: clean.
KSUID_FORCE_SCALAR=1 build/tests/test_string_batch: passes
(parity tests bypass the dispatcher and call kernels directly).

Build matrix impact

The AVX2 TU is built as a separate static_library with -mavx2
(or /arch:AVX2 on MSVC), pic: true, then link_whole'd into the
both_libraries target. The rest of libksuid keeps the SSE2 baseline
ABI; non-x86_64 hosts skip the TU entirely. The option('avx2_batch')
feature flag (default auto) lets distros opt out at build time.

Test plan

meson setup build && meson compile && meson test on x86_64 + AVX2
clang-tidy clean on all touched TUs
tools/gst-indent produces no diff
Direct AVX2-vs-scalar parity over 2^20 LCG samples
Lane-swap detection (8 distinct KSUIDs in one vector)
Corner values (NIL, MAX, all-0x80, striped 0xff/0x00 limbs)
KSUID_FORCE_SCALAR=1 smoke test
CI matrix (Linux GCC/Clang, macOS, MSVC, ARM cross — exercises
both x86_64 dispatcher paths and the non-x86_64 scalar-only path)

Closes #13.

…13 1a) Step 1a of issue #13 implementation. Lands the verified divide-by-62 magic constant + scalar verification test BEFORE any AVX2 code. The 1a/1b split isolates the magic-constant correctness from the SIMD complexity: a reviewer of this commit can prove M is correct without reading any AVX2 intrinsics; a reviewer of the follow-up 1b can take M as a given and focus on the SIMD code. The need for this split came from the issue #13 pre-implementation finding: three architect/critic personas in the review chain each hand-derived a different (and wrong) magic constant. The two architect plans cited "0x4ec4ec4ec4ec4ec5" (which is closer to ceil(2^64/13) than anything to do with 62), and the Critic's "correction" was "0x4210842108421085" (wrong on its own terms). Even after the failure was caught, the immediate next attempt landed on ceil(2^64/62) = 0x0421084210842109, which OVERESTIMATES for some u64 values and breaks the standard Granlund-Moeller "if (r >= d) ++q" correction step. The actual correct constant is FLOOR(2^64 / 62) = 0x0421084210842108, because mulhi(value, FLOOR) underestimates by at most 1 and the correction recovers exactly. CEIL would overestimate, and the correction does not catch overestimates. Components landed: tools/derive-magic.py Programmatic deriver. `derive-magic.py 64 62` prints the constant; with a third path argument it emits the auto-generated header. The Python implementation runs its own 2^16-sample LCG verification before emitting, so a bad constant cannot reach the source tree even by mistake. libksuid/divisor_magic.h Auto-generated; carries the deriver's invocation in the file banner, the M value as a typed macro, and a deficit macro pinning 2^64 - M*62 (must be in [0, d-1]). tests/test_divisor_magic.c Scalar parity test. Three coverage axes: - pinned corners: 0, ±1 around d, around 2^32, the AVX2 in-loop bound 62*2^32, around 2^63, UINT64_MAX - dense low range: every value in [0, 4*62) so every remainder 0..61 hits at quotients 0..3 - >= 2^20 LCG-random u64 inputs (matches issue #13's M2 acceptance criterion: seeded for reproducibility) Compile-time _Static_assert pins M*62 + deficit == 2^64 and deficit < 62 so a hand-edit of the auto-generated header fails the build before runtime. tests/meson.build Registers test_divisor_magic only when the compiler has __uint128_t (GCC, Clang). MSVC is excluded -- the AVX2 kernel itself will only ship on x86_64 GCC/Clang anyway, and the magic constant is endianness/wordsize-agnostic so the verification on any 64-bit __uint128_t-supporting compiler is sufficient. Per-TU c_args adds -Wno-pedantic to silence the standard's "no extended integer types" complaint for this single TU; the project's overall warning_level=3 + -Wpedantic intent stays in force everywhere else. Verified locally on Linux GCC 15.2.1 / x86_64: - Python deriver verifies M against 2^16 random samples - test_divisor_magic passes 2^20 + corners + dense low range - 16/16 tests pass overall - clang-tidy 22 reports zero findings - gst-indent leaves the working tree untouched Upcoming commits in this issue's series: 1b. encode_avx2.c: AVX2 8-wide kernel using KSUID_DIV62_M. mulhi64 via 4x mul_epu32 + carry propagation; 27-iteration outer loop; lane transpose for output writes; scalar tail. 2. Differential parity test extension in tests/test_string_batch.c (KSUID_TESTING-gated direct externs of the scalar and AVX2 kernels; 8-distinct-KSUID lane-swap detection). 3. CI footprint gate (size --format=sysv enforcement) + KSUID_FORCE_SCALAR env var.

Lands the SIMD body that consumes the divisor magic from commit 1a (libksuid/divisor_magic.h). 8 KSUIDs are packed SoA into 5 limb x {lo-4-lanes, hi-4-lanes} ymm registers; each outer iteration runs five 8-wide divmod-by-62 steps using a Granlund-Moeller multiply-high + correction reciprocal. Output column index walks right-to-left (long division produces digits LSB-first), 27 outer iterations unconditionally (no per-lane leading-zero suppression -- once a lane's limbs hit zero the kernel emits '0' which matches the scalar's head-padding). Algorithmic notes pinned in the file header: - mulhi64 via four mul_epu32 cross-products + explicit carry propagation (mid_low = three u32 sum, fits in 34 bits, no u64 overflow); see Critic C2 of issue #13. - mul_epu32 reads only the low 32 of each 64-bit lane, so all four cross-products are needed to recover the full u64xu64; Critic C3. - q*62 via shift-trick (q<<6) - (q<<1) since q can momentarily be ~2^33 in this kernel's value domain (limb | rem<<32 < 63 * 2^32) and mul_epu32 would truncate. - _mm256_zeroupper() before scalar tail call AND before return to avoid AVX-SSE transition penalty (Intel SDM Vol 1 sec 14.1.2 / Critic C9). Build wiring: - new option('avx2_batch') in meson.options, type:feature value:auto. - libksuid/encode_avx2.c built as a separate static_library with -mavx2 (or /arch:AVX2 on MSVC), pic:true, link_whole into both_libraries('ksuid'). Rest of libksuid keeps the SSE2 baseline ABI; the AVX2 kernel is selected at runtime by the encode_batch.c trampoline via __builtin_cpu_supports("avx2"). - common_args gets -DKSUID_HAVE_AVX2_BATCH=1 only when the compiler accepts -mavx2 / /arch:AVX2 AND the option resolved to enabled. Non-x86_64 hosts compile nothing extra. Doc updates: ksuid.h's ksuid_string_batch contract now describes the AVX2 dispatch path and the KSUID_FORCE_SCALAR kill switch (implemented in commit 3); README.md drops the "AVX2 not yet shipped" disclaimer and points at the divisor_magic deriver + parity test infrastructure. Verification: - meson test -C build: 16/16 pass on x86_64 + AVX2 (Linux glibc, GCC). Public ksuid_string_batch parity test exercises the AVX2 path via the runtime dispatcher. - Direct AVX2-vs-scalar differential parity lands in commit 2. - clang-tidy 22.1: 0 findings on encode_avx2.c, encode_batch.c, test_string_batch.c (only pre-existing empty-TU warnings on the inactive NEON files which never compile on x86_64). - gst-indent: clean. Closes #13 (with commits 1a, 1b, 2, 3).

…#13 2) Adds direct-extern parity tests in tests/test_string_batch.c that bypass the runtime dispatcher and call ksuid_string_batch_scalar + ksuid_string_batch_avx2 directly, comparing byte-for-byte. The four new test cases cover the gaps the Critic risk register flagged in issue #13: - test_avx2_parity_n_in_block_boundaries: n in {1, 7, 8, 9, 15, 16, 17, 23, 24, 25, 1000} -- exercises tail-only, exact-block, off-by-one-into-tail, and multi-block paths so any tail-merge or stride bug surfaces immediately. (Critic R1.) - test_avx2_parity_lane_swap_detection: 8 distinct KSUIDs in a single SIMD vector. If the SoA pack ever misaligns lane k to lane k', per-position byte-compare against the scalar reference fails. The existing per-ID parity-vs-ksuid_format test would mask a swap whose output coincidentally matched a different input; this one cannot. (Critic R3.) - test_avx2_parity_corner_values: 16 KSUIDs alternating NIL / MAX / all-0x80 / striped-0xff-limbs across two SIMD blocks, exercising "all-zero limbs" (sustained '0' emission), "max limbs" (mulhi at saturation), and "high bit set in every limb" (catches signed/unsigned confusion in mulhi64). - test_avx2_parity_one_million_lcg: 2^20 LCG-seeded random KSUIDs differential-checked end-to-end. Seed 0x9e3779b97f4a7c15 matches test_divisor_magic.c so a CI failure reproduces locally. (M2 acceptance threshold from issue #13.) The block is gated by KSUID_HAVE_AVX2_BATCH (compile time) AND __builtin_cpu_supports("avx2") (runtime), so the same binary is safe on non-AVX2 hosts in the same x86_64 build (the test compiles in but skips silently). malloc-pair sites use an explicit early-return on allocation failure rather than ASSERT_TRUE so clang-analyzer-core's path- sensitive NPE checker stays clean (the existing FAIL_ macros only increment a counter and do not abort).

… 3) Reads getenv("KSUID_FORCE_SCALAR") inside the dispatch trampoline and pins the resolved kernel to ksuid_string_batch_scalar when the variable is set to a non-empty, non-"0", non-"false" value. The check runs exactly once per process (the trampoline only fires before the atomic store of the resolved pointer), so the runtime cost on the steady state is zero and there is no sub-call dependency on getenv. This is the runtime kill switch demanded by Critic R11 of issue #13: if a future regression in the AVX2 kernel is discovered after rollout, operators can pin the scalar path at startup without rebuilding the library or shipping a new release. The existing parity tests in tests/test_string_batch.c continue to exercise the AVX2 path because they call the kernel symbols directly, bypassing the dispatcher. The override is documented in libksuid/ksuid.h's contract for ksuid_string_batch (commit 1b) and in README.md.

justinjoy added 4 commits April 30, 2026 17:24

justinjoy merged commit 070ae61 into main Apr 30, 2026
11 checks passed

justinjoy deleted the feature/issue-13-avx2-kernel branch April 30, 2026 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AVX2 8-wide ksuid_string_batch kernel (closes #13)#14

feat: AVX2 8-wide ksuid_string_batch kernel (closes #13)#14
justinjoy merged 4 commits intomainfrom
feature/issue-13-avx2-kernel

justinjoy commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justinjoy commented Apr 30, 2026

Summary

Pipeline notes

Verification

Build matrix impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant