Release v0.12.0 — buffer-reuse + small-bw SIMD coverage · ryan-evans-git/ematix-parquet

Three perf wins on top of v0.11.0, addressing the largest remaining hot-spots from profiling a 22-query TPC-H bench.

What's new

ParquetFile::read_range_into(&mut Vec<u8>, offset, length) — buffer reuse on the hot path. Callers pre-allocate one chunk buffer per row group and reuse across column reads, eliminating the per-call Vec alloc + zero-fill + drop-time madvise(MADV_DONTNEED) that dominated 10–15% of CPU on scan-heavy workloads. The existing read_range delegates through a fresh Vec, so no breaking change.
SIMD coverage for bw=2 / bw=3 raw-indices and bw=4 / bw=6 / bw=8 fused-lookup on both NEON and AVX2. Closes the small-bit-width gap the const-generic scalar path was filling at ~7–9 GB/s. bw=2 uses 4 parallel shift+mask streams interleaved via vzip / _mm_unpacklo_epi8 + _mm_unpacklo/hi_epi16; bw=3 uses per-lane TBL gather + variable right-shift; the fused-lookup kernels follow the bw=12/14 staging-callback pattern with a bounds-safe fast path when dict_size proves every unpacked index fits.
decode_rle_dictionary_indices_into — append-only, zero-alloc variant of the existing Vec-returning function. Threaded through read_column_byte_array_dict_preserved_into so a ~50-data-page TPC-H lineitem chunk drops from 50 transient Vec<u32> allocations to zero.

Also adds crates/ematix-parquet-io/examples/rg_count.rs — a small diagnostic that prints per-file rows + row-group counts.

No breaking API changes.