v0.11.0 — full SIMD specialisation parity (NEON + AVX2)
v0.11.0 closes the SIMD specialisation table on both architectures. Every production bit width that the scalar fallback was serving at ~7-9 GB/s now has a hand-tuned SIMD kernel on both AArch64 NEON and x86_64 AVX2.
Coverage delta vs v0.10.0
| Width | NEON v0.10 | NEON v0.11 | AVX2 v0.10 | AVX2 v0.11 |
|---|---|---|---|---|
| 1 | scalar | ✓ added | scalar | ✓ added |
| 4 | ✓ shipped | ✓ | scalar | ✓ added |
| 5 | scalar | ✓ added | scalar | ✓ added |
| 8 | ✓ shipped | ✓ | scalar | ✓ added |
| 12-18 | ✓ | ✓ | ✓ | ✓ |
| 20 | scalar | ✓ added | scalar | ✓ added |
| 21 | scalar | ✓ added | scalar | ✓ added |
All new kernels in #64; the release PR is #65.
Per-width strategy
- bw=1 — broadcast each source byte to 8 lanes, AND with per-lane bit-mask, compare-eq → 0/1 outputs
- bw=4 — nibble extract (AND 0x0F + shift-right-4), interleave per parquet LSB-first packing, widen
- bw=5 — extract one u16 per lane via shuffle, variable-shift, mask, widen
- bw=8 — trivial byte-aligned widen (
vmovlchain /_mm256_cvtepu8_epi32) - bw=20 — mirrors bw=17/18: two 16-byte loads (offsets 0, 10), alternating shifts, mask 0x0F_FFFF
- bw=21 — like bw=20 but lo/hi halves use different shuffles + every lane has a distinct shift
Widths 12, 14, 15, 16, 17, 18 already had full NEON + AVX2 coverage since the Π.12 cycle. Widths still on the scalar const-generic path (bw=2, 3, 6, 7, 9, 10, 11, 13, 19, 22..32) get SIMD coverage on-demand — v0.12.0 added bw=2 and bw=3.
Full test suite (680 tests) green; CI ubuntu-latest exercises the AVX2 path.