v0.12.0 — buffer-reuse + small-bw SIMD coverage
Three perf wins on top of v0.11.0, addressing the largest remaining hot-spots from profiling a 22-query TPC-H bench.
What's new
-
ParquetFile::read_range_into(&mut Vec<u8>, offset, length)— buffer reuse on the hot path. Callers pre-allocate one chunk buffer per row group and reuse across column reads, eliminating the per-callVecalloc + zero-fill + drop-timemadvise(MADV_DONTNEED)that dominated 10–15% of CPU on scan-heavy workloads. The existingread_rangedelegates through a freshVec, so no breaking change. -
SIMD coverage for bw=2 / bw=3 raw-indices and bw=4 / bw=6 / bw=8 fused-lookup on both NEON and AVX2. Closes the small-bit-width gap the const-generic scalar path was filling at ~7–9 GB/s. bw=2 uses 4 parallel shift+mask streams interleaved via
vzip/_mm_unpacklo_epi8 + _mm_unpacklo/hi_epi16; bw=3 uses per-lane TBL gather + variable right-shift; the fused-lookup kernels follow the bw=12/14 staging-callback pattern with a bounds-safe fast path whendict_sizeproves every unpacked index fits. -
decode_rle_dictionary_indices_into— append-only, zero-alloc variant of the existingVec-returning function. Threaded throughread_column_byte_array_dict_preserved_intoso a ~50-data-page TPC-H lineitem chunk drops from 50 transientVec<u32>allocations to zero.
Also adds crates/ematix-parquet-io/examples/rg_count.rs — a small diagnostic that prints per-file rows + row-group counts.
No breaking API changes.