Skip to content

v0.11.0 — full SIMD specialisation parity (NEON + AVX2)

Choose a tag to compare

@ryan-evans-git ryan-evans-git released this 20 May 00:53
· 54 commits to main since this release
d4731b9

v0.11.0 closes the SIMD specialisation table on both architectures. Every production bit width that the scalar fallback was serving at ~7-9 GB/s now has a hand-tuned SIMD kernel on both AArch64 NEON and x86_64 AVX2.

Coverage delta vs v0.10.0

Width NEON v0.10 NEON v0.11 AVX2 v0.10 AVX2 v0.11
1 scalar ✓ added scalar ✓ added
4 ✓ shipped scalar ✓ added
5 scalar ✓ added scalar ✓ added
8 ✓ shipped scalar ✓ added
12-18
20 scalar ✓ added scalar ✓ added
21 scalar ✓ added scalar ✓ added

All new kernels in #64; the release PR is #65.

Per-width strategy

  • bw=1 — broadcast each source byte to 8 lanes, AND with per-lane bit-mask, compare-eq → 0/1 outputs
  • bw=4 — nibble extract (AND 0x0F + shift-right-4), interleave per parquet LSB-first packing, widen
  • bw=5 — extract one u16 per lane via shuffle, variable-shift, mask, widen
  • bw=8 — trivial byte-aligned widen (vmovl chain / _mm256_cvtepu8_epi32)
  • bw=20 — mirrors bw=17/18: two 16-byte loads (offsets 0, 10), alternating shifts, mask 0x0F_FFFF
  • bw=21 — like bw=20 but lo/hi halves use different shuffles + every lane has a distinct shift

Widths 12, 14, 15, 16, 17, 18 already had full NEON + AVX2 coverage since the Π.12 cycle. Widths still on the scalar const-generic path (bw=2, 3, 6, 7, 9, 10, 11, 13, 19, 22..32) get SIMD coverage on-demand — v0.12.0 added bw=2 and bw=3.

Full test suite (680 tests) green; CI ubuntu-latest exercises the AVX2 path.