v0.13.0 — full SIMD specialisation table (NEON + AVX2, bw=1..=32)
v0.13.0 closes the SIMD specialisation coverage table on both architectures. After this release the const-generic scalar fallback is reserved for bit_width == 0 and the partial-tail path after each block — every full block runs through a hand-tuned intrinsic kernel on both AArch64 NEON and x86_64 AVX2.
Coverage
| Path | NEON | AVX2 |
|---|---|---|
| Raw-indices | 1..=32 | 1..=32 |
| Fused-lookup | 1..=21 | 1..=21 |
What landed
Four PRs on top of v0.12.0 added 48 new SIMD kernels:
- #69 (Phase 1) — fused-lookup parity for bw=1, 2, 3, 5, 20, 21 (12 kernels). The lookup path is the hot one for dict-encoded scans; these widths already had raw-indices SIMD but were falling through to the scalar const-generic lookup. Each width has a per-arch staging helper (mirroring the existing bw=12 callback pattern) and a wrapper with bounds-safe + bounds-checked paths.
- #70 (Phase 2) — raw-indices SIMD for bw=6, 7 (4 kernels). Targets dicts of 33–128 distinct values (status enums, country codes, low-cardinality byte-array columns).
- #71 (Phase 3) — raw-indices SIMD for bw=9, 10, 11, 13, 19 (10 kernels). Medium-dict workloads (256–512K distinct values). bw=19 uses two 16-byte loads because lane 7 starts at byte 16.
- #72 (Phase 4) — raw-indices SIMD for bw=22..32 (22 kernels). Macro-templated u32/u64 staging:
- bw=22..25: u32 staging (max value + max bit_off fits in u32)
- bw=26..31: u64 staging (33-bit value-spans overflow u32)
- bw=32: byte-aligned trivial copy — one 256-bit load/store per block
Remaining gap
Phase 4 is raw-indices only — fused-lookup for bw=22..32 falls through to the scalar const-generic lookup. At 4M..4G dict entries the gather is memory-bound, so the lookup-path SIMD savings are negligible relative to cache behaviour. Captured as a below-the-line follow-up in docs/plans/CURRENT.md; wire if a real profile shows it matters.
Testing
735 tests pass on aarch64 (M-series). Linux CI exercises the AVX2 path. Each new kernel has bit-exact pack-helper oracles covering known patterns, partial-tail sizes, random inputs (with max-value-boundary checks at bw=32), and dispatcher routing.
Crates
All five crates published to crates.io at 0.13.0:
ematix-parquet-formatematix-parquet-ioematix-parquet-cryptoematix-parquet-codecematix-parquet-async
No breaking API changes vs v0.12.0.