Skip to content

v0.13.0 — full SIMD specialisation table (NEON + AVX2, bw=1..=32)

Choose a tag to compare

@ryan-evans-git ryan-evans-git released this 20 May 13:12
· 46 commits to main since this release
9111d6b

v0.13.0 closes the SIMD specialisation coverage table on both architectures. After this release the const-generic scalar fallback is reserved for bit_width == 0 and the partial-tail path after each block — every full block runs through a hand-tuned intrinsic kernel on both AArch64 NEON and x86_64 AVX2.

Coverage

Path NEON AVX2
Raw-indices 1..=32 1..=32
Fused-lookup 1..=21 1..=21

What landed

Four PRs on top of v0.12.0 added 48 new SIMD kernels:

  • #69 (Phase 1) — fused-lookup parity for bw=1, 2, 3, 5, 20, 21 (12 kernels). The lookup path is the hot one for dict-encoded scans; these widths already had raw-indices SIMD but were falling through to the scalar const-generic lookup. Each width has a per-arch staging helper (mirroring the existing bw=12 callback pattern) and a wrapper with bounds-safe + bounds-checked paths.
  • #70 (Phase 2) — raw-indices SIMD for bw=6, 7 (4 kernels). Targets dicts of 33–128 distinct values (status enums, country codes, low-cardinality byte-array columns).
  • #71 (Phase 3) — raw-indices SIMD for bw=9, 10, 11, 13, 19 (10 kernels). Medium-dict workloads (256–512K distinct values). bw=19 uses two 16-byte loads because lane 7 starts at byte 16.
  • #72 (Phase 4) — raw-indices SIMD for bw=22..32 (22 kernels). Macro-templated u32/u64 staging:
    • bw=22..25: u32 staging (max value + max bit_off fits in u32)
    • bw=26..31: u64 staging (33-bit value-spans overflow u32)
    • bw=32: byte-aligned trivial copy — one 256-bit load/store per block

Remaining gap

Phase 4 is raw-indices only — fused-lookup for bw=22..32 falls through to the scalar const-generic lookup. At 4M..4G dict entries the gather is memory-bound, so the lookup-path SIMD savings are negligible relative to cache behaviour. Captured as a below-the-line follow-up in docs/plans/CURRENT.md; wire if a real profile shows it matters.

Testing

735 tests pass on aarch64 (M-series). Linux CI exercises the AVX2 path. Each new kernel has bit-exact pack-helper oracles covering known patterns, partial-tail sizes, random inputs (with max-value-boundary checks at bw=32), and dispatcher routing.

Crates

All five crates published to crates.io at 0.13.0:

  • ematix-parquet-format
  • ematix-parquet-io
  • ematix-parquet-crypto
  • ematix-parquet-codec
  • ematix-parquet-async

No breaking API changes vs v0.12.0.