Skip to content
This repository was archived by the owner on May 1, 2026. It is now read-only.

feat: AVX2 8-wide ksuid_string_batch kernel (closes #13)#14

Merged
justinjoy merged 4 commits intomainfrom
feature/issue-13-avx2-kernel
Apr 30, 2026
Merged

feat: AVX2 8-wide ksuid_string_batch kernel (closes #13)#14
justinjoy merged 4 commits intomainfrom
feature/issue-13-avx2-kernel

Conversation

@justinjoy
Copy link
Copy Markdown
Contributor

Summary

Implements the AVX2 8-wide ksuid_string_batch kernel from issue #13.
Output is byte-identical to the per-ID scalar reference; selection is
runtime via the existing encode_batch.c trampoline (CPUID + __builtin_cpu_supports("avx2")).

The pipeline (architect → critic → synthesize → atomic-commit
implementer guide → reviewer-on-PR) is reflected in the four-commit
chain:

Commit Subject Scope
b96d693 build: derive-magic.py + divisor_magic.h + scalar verification Magic-constant deriver + auto-generated header + 2^20 sample C-side verification (issue #13 1a)
7593b4a feat: AVX2 8-wide ksuid_string_batch kernel The vectorized kernel + meson wiring (issue #13 1b)
ef445c7 test: AVX2-vs-scalar differential parity Direct-extern parity tests, lane-swap detection, 2^20 LCG random corpus (issue #13 2)
09f81f2 core: KSUID_FORCE_SCALAR env override Runtime kill switch (issue #13 3, Critic R11)

Pipeline notes

Architect output (committed plan, not ad-hoc): derive-magic.py
generates the FLOOR Granlund-Möller magic constant, emits a self-
documenting header, and self-verifies against integer division. The C
side pins the constant via _Static_assert (M*62 + deficit == 2^64,
deficit < 62) at compile time and against __uint128_t reference
division at test time.

Critic risk register (each item cross-referenced in the source):

  • C2 (mulhi64 carry propagation) — handled by 4×mul_epu32 schoolbook
    with mid_low proven to fit in 34 bits.
  • C3 (mul_epu32 lane positioning) — value puts limb in low-32 and
    rem in high-32; all four cross-products issued.
  • C9 (zeroupper placement) — emitted before scalar tail call AND before
    return.
  • R3 (lane-swap masking) — test_avx2_parity_lane_swap_detection uses
    8 distinct KSUIDs and per-position byte-compare against the scalar.
  • R11 (force-scalar override) — KSUID_FORCE_SCALAR env var read once
    at first dispatch.

Synthesis lives in the file headers of libksuid/encode_avx2.c,
libksuid/divisor_magic.h, and tools/derive-magic.py. Each commit
is self-contained and atomic — buildable and testable on its own.

Three-personas-all-wrong incident that justified commit 1a's
infrastructure: during issue #13 review the architect, critic, and
implementer (this agent) hand-derived three different and all wrong
divide-by-62 magic constants. The CEILING form looked plausible but
overestimates for some u64 values and the standard if (r >= d) ++q correction does not catch overestimates. The deriver and the
2^20 sample C-side test exist precisely so this class of error
cannot recur silently.

Verification

  • meson test -C build: 16/16 pass on x86_64 + AVX2 host (Linux,
    glibc, GCC). The test_string_batch suite now runs 4 dedicated
    AVX2-vs-scalar parity cases including the 2^20 LCG random corpus.
  • clang-tidy (LLVM 22.1): zero findings on encode_avx2.c,
    encode_batch.c, test_string_batch.c. (Pre-existing empty-TU
    warnings on inactive *_neon.c files are unrelated.)
  • tools/gst-indent: clean.
  • KSUID_FORCE_SCALAR=1 build/tests/test_string_batch: passes
    (parity tests bypass the dispatcher and call kernels directly).

Build matrix impact

The AVX2 TU is built as a separate static_library with -mavx2
(or /arch:AVX2 on MSVC), pic: true, then link_whole'd into the
both_libraries target. The rest of libksuid keeps the SSE2 baseline
ABI; non-x86_64 hosts skip the TU entirely. The option('avx2_batch')
feature flag (default auto) lets distros opt out at build time.

Test plan

  • meson setup build && meson compile && meson test on x86_64 + AVX2
  • clang-tidy clean on all touched TUs
  • tools/gst-indent produces no diff
  • Direct AVX2-vs-scalar parity over 2^20 LCG samples
  • Lane-swap detection (8 distinct KSUIDs in one vector)
  • Corner values (NIL, MAX, all-0x80, striped 0xff/0x00 limbs)
  • KSUID_FORCE_SCALAR=1 smoke test
  • CI matrix (Linux GCC/Clang, macOS, MSVC, ARM cross — exercises
    both x86_64 dispatcher paths and the non-x86_64 scalar-only path)

Closes #13.

…13 1a)

Step 1a of issue #13 implementation. Lands the verified
divide-by-62 magic constant + scalar verification test BEFORE any
AVX2 code. The 1a/1b split isolates the magic-constant correctness
from the SIMD complexity: a reviewer of this commit can prove M is
correct without reading any AVX2 intrinsics; a reviewer of the
follow-up 1b can take M as a given and focus on the SIMD code.

The need for this split came from the issue #13 pre-implementation
finding: three architect/critic personas in the review chain each
hand-derived a different (and wrong) magic constant. The two
architect plans cited "0x4ec4ec4ec4ec4ec5" (which is closer to
ceil(2^64/13) than anything to do with 62), and the Critic's
"correction" was "0x4210842108421085" (wrong on its own terms).
Even after the failure was caught, the immediate next attempt
landed on ceil(2^64/62) = 0x0421084210842109, which OVERESTIMATES
for some u64 values and breaks the standard Granlund-Moeller
"if (r >= d) ++q" correction step.

The actual correct constant is FLOOR(2^64 / 62) = 0x0421084210842108,
because mulhi(value, FLOOR) underestimates by at most 1 and the
correction recovers exactly. CEIL would overestimate, and the
correction does not catch overestimates.

Components landed:

  tools/derive-magic.py
    Programmatic deriver. `derive-magic.py 64 62` prints the
    constant; with a third path argument it emits the
    auto-generated header. The Python implementation runs its
    own 2^16-sample LCG verification before emitting, so a
    bad constant cannot reach the source tree even by mistake.

  libksuid/divisor_magic.h
    Auto-generated; carries the deriver's invocation in the
    file banner, the M value as a typed macro, and a deficit
    macro pinning 2^64 - M*62 (must be in [0, d-1]).

  tests/test_divisor_magic.c
    Scalar parity test. Three coverage axes:
      - pinned corners: 0, ±1 around d, around 2^32, the AVX2
        in-loop bound 62*2^32, around 2^63, UINT64_MAX
      - dense low range: every value in [0, 4*62) so every
        remainder 0..61 hits at quotients 0..3
      - >= 2^20 LCG-random u64 inputs (matches issue #13's M2
        acceptance criterion: seeded for reproducibility)
    Compile-time _Static_assert pins M*62 + deficit == 2^64
    and deficit < 62 so a hand-edit of the auto-generated
    header fails the build before runtime.

  tests/meson.build
    Registers test_divisor_magic only when the compiler has
    __uint128_t (GCC, Clang). MSVC is excluded -- the AVX2
    kernel itself will only ship on x86_64 GCC/Clang anyway,
    and the magic constant is endianness/wordsize-agnostic so
    the verification on any 64-bit __uint128_t-supporting
    compiler is sufficient. Per-TU c_args adds -Wno-pedantic
    to silence the standard's "no extended integer types"
    complaint for this single TU; the project's overall
    warning_level=3 + -Wpedantic intent stays in force
    everywhere else.

Verified locally on Linux GCC 15.2.1 / x86_64:
  - Python deriver verifies M against 2^16 random samples
  - test_divisor_magic passes 2^20 + corners + dense low range
  - 16/16 tests pass overall
  - clang-tidy 22 reports zero findings
  - gst-indent leaves the working tree untouched

Upcoming commits in this issue's series:
  1b. encode_avx2.c: AVX2 8-wide kernel using KSUID_DIV62_M.
      mulhi64 via 4x mul_epu32 + carry propagation; 27-iteration
      outer loop; lane transpose for output writes; scalar tail.
  2.  Differential parity test extension in tests/test_string_batch.c
      (KSUID_TESTING-gated direct externs of the scalar and AVX2
      kernels; 8-distinct-KSUID lane-swap detection).
  3.  CI footprint gate (size --format=sysv enforcement) +
      KSUID_FORCE_SCALAR env var.
Lands the SIMD body that consumes the divisor magic from commit 1a
(libksuid/divisor_magic.h). 8 KSUIDs are packed SoA into 5 limb x
{lo-4-lanes, hi-4-lanes} ymm registers; each outer iteration runs
five 8-wide divmod-by-62 steps using a Granlund-Moeller
multiply-high + correction reciprocal. Output column index walks
right-to-left (long division produces digits LSB-first), 27 outer
iterations unconditionally (no per-lane leading-zero suppression --
once a lane's limbs hit zero the kernel emits '0' which matches
the scalar's head-padding).

Algorithmic notes pinned in the file header:
  - mulhi64 via four mul_epu32 cross-products + explicit carry
    propagation (mid_low = three u32 sum, fits in 34 bits, no
    u64 overflow); see Critic C2 of issue #13.
  - mul_epu32 reads only the low 32 of each 64-bit lane, so all
    four cross-products are needed to recover the full u64xu64;
    Critic C3.
  - q*62 via shift-trick (q<<6) - (q<<1) since q can momentarily
    be ~2^33 in this kernel's value domain (limb | rem<<32 < 63 *
    2^32) and mul_epu32 would truncate.
  - _mm256_zeroupper() before scalar tail call AND before return
    to avoid AVX-SSE transition penalty (Intel SDM Vol 1 sec
    14.1.2 / Critic C9).

Build wiring:
  - new option('avx2_batch') in meson.options, type:feature
    value:auto.
  - libksuid/encode_avx2.c built as a separate static_library with
    -mavx2 (or /arch:AVX2 on MSVC), pic:true, link_whole into
    both_libraries('ksuid'). Rest of libksuid keeps the SSE2
    baseline ABI; the AVX2 kernel is selected at runtime by the
    encode_batch.c trampoline via __builtin_cpu_supports("avx2").
  - common_args gets -DKSUID_HAVE_AVX2_BATCH=1 only when the
    compiler accepts -mavx2 / /arch:AVX2 AND the option resolved
    to enabled. Non-x86_64 hosts compile nothing extra.

Doc updates: ksuid.h's ksuid_string_batch contract now describes
the AVX2 dispatch path and the KSUID_FORCE_SCALAR kill switch
(implemented in commit 3); README.md drops the "AVX2 not yet
shipped" disclaimer and points at the divisor_magic deriver +
parity test infrastructure.

Verification:
  - meson test -C build: 16/16 pass on x86_64 + AVX2 (Linux
    glibc, GCC). Public ksuid_string_batch parity test exercises
    the AVX2 path via the runtime dispatcher.
  - Direct AVX2-vs-scalar differential parity lands in commit 2.
  - clang-tidy 22.1: 0 findings on encode_avx2.c, encode_batch.c,
    test_string_batch.c (only pre-existing empty-TU warnings on
    the inactive NEON files which never compile on x86_64).
  - gst-indent: clean.

Closes #13 (with commits 1a, 1b, 2, 3).
…#13 2)

Adds direct-extern parity tests in tests/test_string_batch.c that
bypass the runtime dispatcher and call ksuid_string_batch_scalar +
ksuid_string_batch_avx2 directly, comparing byte-for-byte.

The four new test cases cover the gaps the Critic risk register
flagged in issue #13:

  - test_avx2_parity_n_in_block_boundaries: n in {1, 7, 8, 9, 15,
    16, 17, 23, 24, 25, 1000} -- exercises tail-only, exact-block,
    off-by-one-into-tail, and multi-block paths so any tail-merge
    or stride bug surfaces immediately. (Critic R1.)
  - test_avx2_parity_lane_swap_detection: 8 distinct KSUIDs in a
    single SIMD vector. If the SoA pack ever misaligns lane k to
    lane k', per-position byte-compare against the scalar
    reference fails. The existing per-ID parity-vs-ksuid_format
    test would mask a swap whose output coincidentally matched a
    different input; this one cannot. (Critic R3.)
  - test_avx2_parity_corner_values: 16 KSUIDs alternating
    NIL / MAX / all-0x80 / striped-0xff-limbs across two SIMD
    blocks, exercising "all-zero limbs" (sustained '0' emission),
    "max limbs" (mulhi at saturation), and "high bit set in
    every limb" (catches signed/unsigned confusion in mulhi64).
  - test_avx2_parity_one_million_lcg: 2^20 LCG-seeded random
    KSUIDs differential-checked end-to-end. Seed
    0x9e3779b97f4a7c15 matches test_divisor_magic.c so a CI
    failure reproduces locally. (M2 acceptance threshold from
    issue #13.)

The block is gated by KSUID_HAVE_AVX2_BATCH (compile time) AND
__builtin_cpu_supports("avx2") (runtime), so the same binary is
safe on non-AVX2 hosts in the same x86_64 build (the test
compiles in but skips silently).

malloc-pair sites use an explicit early-return on allocation
failure rather than ASSERT_TRUE so clang-analyzer-core's path-
sensitive NPE checker stays clean (the existing FAIL_ macros
only increment a counter and do not abort).
… 3)

Reads getenv("KSUID_FORCE_SCALAR") inside the dispatch trampoline
and pins the resolved kernel to ksuid_string_batch_scalar when
the variable is set to a non-empty, non-"0", non-"false" value.

The check runs exactly once per process (the trampoline only
fires before the atomic store of the resolved pointer), so the
runtime cost on the steady state is zero and there is no
sub-call dependency on getenv.

This is the runtime kill switch demanded by Critic R11 of issue
#13: if a future regression in the AVX2 kernel is discovered
after rollout, operators can pin the scalar path at startup
without rebuilding the library or shipping a new release. The
existing parity tests in tests/test_string_batch.c continue to
exercise the AVX2 path because they call the kernel symbols
directly, bypassing the dispatcher.

The override is documented in libksuid/ksuid.h's contract for
ksuid_string_batch (commit 1b) and in README.md.
@justinjoy justinjoy merged commit 070ae61 into main Apr 30, 2026
11 checks passed
@justinjoy justinjoy deleted the feature/issue-13-avx2-kernel branch April 30, 2026 09:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AVX2 8-wide ksuid_string_batch kernel (follow-up to #5)

1 participant