Releases: ryan-evans-git/ematix-parquet
v0.14.0 — compression hygiene + cross-column page-skip
Three changes land on top of v0.13.0, all in the codec / read façade. No SIMD or write-path changes.
Correctness: LZ4_RAW decompression
decompress_lz4_raw_into previously called lz4_flex::block::decompress with compressed.len() * 255 as the worst-case output cap, then re-copied the result into the caller's buffer. Per parquet page that was a ~190 MB scratch allocation plus a full copy. On a Q06-shape SF=10 lineitem re-encoded with LZ4_RAW we measured a 45× regression vs Snappy (3109 ms vs 70 ms) entirely from the allocation pattern.
Fix: decompress_lz4_raw_into_sized(body, uncompressed_size, out) pre-sizes out from the page header's declared length and decodes in place via lz4_flex::block::decompress_into. No scratch, no copy. The size-less variant now walks the LZ4 block tags once to compute the output length before allocating.
Bench: Q06 SF=10 LZ4_RAW closed from 3109 ms → 57.88 ms (now beats Polars 62 ms by 6.7%, Snappy by 18%).
Hardening: DoS caps on ZSTD / Gzip / Brotli
Previously these called read_to_end(out) on a Read adapter with no upper bound. A malicious 1 KB compressed page could expand to many GB and OOM a worker.
Three new variants (decompress_zstd_into_capped, decompress_gzip_into_capped, decompress_brotli_into_capped) take the declared uncompressed size, pre-reserve the buffer, and error out if the decompressed stream exceeds declared + 64 bytes.
The sync read-façade decompress_into dispatch now plumbs uncompressed_page_size through every call site (~33 in the codec, 8 in the async façade). LZ4_RAW becomes the sized variant; ZSTD / Gzip / Brotli become the capped variants. Snappy keeps its own embedded-length path.
Perf: cross-column page-skip (sync read.rs)
decode_chunk_row_masked_into, read_column_byte_array_masked_into, and the dict-preserved offsets variant previously decompressed every data page before checking the row mask's popcount for that page's range. Move the popcount check (via a new data_page_num_values(hdr) helper that reads num_values from the page header — no decompress needed) before data_page_view. Pages with zero mask bits in their row range skip the codec decompression entirely.
Wall-time at Q06 SF=10 selectivity is unchanged (every page has at least some matches at ~1% × ~32K rows/page), but per-thread CPU time for extprice masked decode drops 35%. The lever fires harder on sparser-filter workloads and partitioned data.
Dead-code removal
decompress_snappy_fast_into (~210 LOC of hand-rolled unsafe with a latent copy_back_ref bug, never dispatched in production, and a 4-5 GB/s microbench claim that didn't hold on real Q06 data — was 17% slower than the snap crate) retires. The companion test file and bench_snappy example are deleted.
Tests
Codec unit tests 8 → 27. New cases for Brotli / Gzip / LZ4_RAW: round-trip, empty input, garbage input, capped-rejects-oversize. Full workspace 719 tests green on macOS-14 + Ubuntu CI. Clippy + rustfmt clean.
🤖 Generated with Claude Code
v0.13.0 — full SIMD specialisation table (NEON + AVX2, bw=1..=32)
v0.13.0 closes the SIMD specialisation coverage table on both architectures. After this release the const-generic scalar fallback is reserved for bit_width == 0 and the partial-tail path after each block — every full block runs through a hand-tuned intrinsic kernel on both AArch64 NEON and x86_64 AVX2.
Coverage
| Path | NEON | AVX2 |
|---|---|---|
| Raw-indices | 1..=32 | 1..=32 |
| Fused-lookup | 1..=21 | 1..=21 |
What landed
Four PRs on top of v0.12.0 added 48 new SIMD kernels:
- #69 (Phase 1) — fused-lookup parity for bw=1, 2, 3, 5, 20, 21 (12 kernels). The lookup path is the hot one for dict-encoded scans; these widths already had raw-indices SIMD but were falling through to the scalar const-generic lookup. Each width has a per-arch staging helper (mirroring the existing bw=12 callback pattern) and a wrapper with bounds-safe + bounds-checked paths.
- #70 (Phase 2) — raw-indices SIMD for bw=6, 7 (4 kernels). Targets dicts of 33–128 distinct values (status enums, country codes, low-cardinality byte-array columns).
- #71 (Phase 3) — raw-indices SIMD for bw=9, 10, 11, 13, 19 (10 kernels). Medium-dict workloads (256–512K distinct values). bw=19 uses two 16-byte loads because lane 7 starts at byte 16.
- #72 (Phase 4) — raw-indices SIMD for bw=22..32 (22 kernels). Macro-templated u32/u64 staging:
- bw=22..25: u32 staging (max value + max bit_off fits in u32)
- bw=26..31: u64 staging (33-bit value-spans overflow u32)
- bw=32: byte-aligned trivial copy — one 256-bit load/store per block
Remaining gap
Phase 4 is raw-indices only — fused-lookup for bw=22..32 falls through to the scalar const-generic lookup. At 4M..4G dict entries the gather is memory-bound, so the lookup-path SIMD savings are negligible relative to cache behaviour. Captured as a below-the-line follow-up in docs/plans/CURRENT.md; wire if a real profile shows it matters.
Testing
735 tests pass on aarch64 (M-series). Linux CI exercises the AVX2 path. Each new kernel has bit-exact pack-helper oracles covering known patterns, partial-tail sizes, random inputs (with max-value-boundary checks at bw=32), and dispatcher routing.
Crates
All five crates published to crates.io at 0.13.0:
ematix-parquet-formatematix-parquet-ioematix-parquet-cryptoematix-parquet-codecematix-parquet-async
No breaking API changes vs v0.12.0.
v0.12.0 — buffer-reuse + small-bw SIMD coverage
Three perf wins on top of v0.11.0, addressing the largest remaining hot-spots from profiling a 22-query TPC-H bench.
What's new
-
ParquetFile::read_range_into(&mut Vec<u8>, offset, length)— buffer reuse on the hot path. Callers pre-allocate one chunk buffer per row group and reuse across column reads, eliminating the per-callVecalloc + zero-fill + drop-timemadvise(MADV_DONTNEED)that dominated 10–15% of CPU on scan-heavy workloads. The existingread_rangedelegates through a freshVec, so no breaking change. -
SIMD coverage for bw=2 / bw=3 raw-indices and bw=4 / bw=6 / bw=8 fused-lookup on both NEON and AVX2. Closes the small-bit-width gap the const-generic scalar path was filling at ~7–9 GB/s. bw=2 uses 4 parallel shift+mask streams interleaved via
vzip/_mm_unpacklo_epi8 + _mm_unpacklo/hi_epi16; bw=3 uses per-lane TBL gather + variable right-shift; the fused-lookup kernels follow the bw=12/14 staging-callback pattern with a bounds-safe fast path whendict_sizeproves every unpacked index fits. -
decode_rle_dictionary_indices_into— append-only, zero-alloc variant of the existingVec-returning function. Threaded throughread_column_byte_array_dict_preserved_intoso a ~50-data-page TPC-H lineitem chunk drops from 50 transientVec<u32>allocations to zero.
Also adds crates/ematix-parquet-io/examples/rg_count.rs — a small diagnostic that prints per-file rows + row-group counts.
No breaking API changes.
v0.11.0 — full SIMD specialisation parity (NEON + AVX2)
v0.11.0 closes the SIMD specialisation table on both architectures. Every production bit width that the scalar fallback was serving at ~7-9 GB/s now has a hand-tuned SIMD kernel on both AArch64 NEON and x86_64 AVX2.
Coverage delta vs v0.10.0
| Width | NEON v0.10 | NEON v0.11 | AVX2 v0.10 | AVX2 v0.11 |
|---|---|---|---|---|
| 1 | scalar | ✓ added | scalar | ✓ added |
| 4 | ✓ shipped | ✓ | scalar | ✓ added |
| 5 | scalar | ✓ added | scalar | ✓ added |
| 8 | ✓ shipped | ✓ | scalar | ✓ added |
| 12-18 | ✓ | ✓ | ✓ | ✓ |
| 20 | scalar | ✓ added | scalar | ✓ added |
| 21 | scalar | ✓ added | scalar | ✓ added |
All new kernels in #64; the release PR is #65.
Per-width strategy
- bw=1 — broadcast each source byte to 8 lanes, AND with per-lane bit-mask, compare-eq → 0/1 outputs
- bw=4 — nibble extract (AND 0x0F + shift-right-4), interleave per parquet LSB-first packing, widen
- bw=5 — extract one u16 per lane via shuffle, variable-shift, mask, widen
- bw=8 — trivial byte-aligned widen (
vmovlchain /_mm256_cvtepu8_epi32) - bw=20 — mirrors bw=17/18: two 16-byte loads (offsets 0, 10), alternating shifts, mask 0x0F_FFFF
- bw=21 — like bw=20 but lo/hi halves use different shuffles + every lane has a distinct shift
Widths 12, 14, 15, 16, 17, 18 already had full NEON + AVX2 coverage since the Π.12 cycle. Widths still on the scalar const-generic path (bw=2, 3, 6, 7, 9, 10, 11, 13, 19, 22..32) get SIMD coverage on-demand — v0.12.0 added bw=2 and bw=3.
Full test suite (680 tests) green; CI ubuntu-latest exercises the AVX2 path.
v0.10.0 — write-side polish + opportunistic deferred items
v0.10.0 bundles every item deferred under v0.9.x "Scope notes" plus the below-the-phase-line catch-all list, plus an I/O unblock that lets parallel decode scale honestly.
Write-side completeness
- Multi-column / multi-RG bloom writes (#54) — per-(RG, col) SBBFs in
write_table_with_blooms_to_path - Bloom on PLAIN (non-dict) write paths (#56) —
write_{i32,i64,f64,byte_array}_column_with_bloom_to_pathfamily - Multi-column dict writes (#57) — per-column opt-in via
dict_per_column: &[bool] - Per-column codec +
WriteOptions(#59) — newwrite_table_with_options_to_pathbundling row_group_size, page_version, default_codec, codec_per_column, dict_per_column, bloom_fpps. Different columns in the same row group can use different codecs
Decode coverage
- DELTA_BINARY_PACKED u64-output unpacker (#58) —
unpack_indices64_intocovers bit_widths 1..=64;decode_delta_i64no longer errors on streams with bit_width > 32 - BYTE_ARRAY batched/streaming decode API (#60) —
read_column_byte_array_batchesmirrors the scalar batched API forVec<Vec<u8>>
Perf
- pread-based unlocked I/O (#55) — no more
Mutex<File>on parallel workers; ~28% sequential decode speedup as a side effect - NEON
pldL1 prefetch hints in dict gather (#61) — behaviour-identical, perf insensitive to dict footprint up to 1 MB - NEON unpackers for bw=4 + bw=8 (#62)
Full test suite (663 tests) green at the new version.
v0.9.2 — bloom-filter writer end-to-end (i32/i64/f64/byte_array)
Highlights
Closes the bloom-filter story. v0.9.1 shipped the in-memory SplitBlockBloomFilterBuilder; v0.9.2 wires it through the codec write path so emitted Parquet files carry consultable bloom filters that downstream readers — including the upstream Rust Parquet reader — discover via ColumnMetaData and apply automatically.
Format crate
metadata_writer::encode_column_meta_datanow writesbloom_filter_offset(field 14, i64) andbloom_filter_length(field 15, i32) when set. Previously both panicked.
Codec writer entry points
| Function | Hash input |
|---|---|
write::write_i32_column_dict_with_bloom_to_path |
4-byte LE |
write::write_i64_column_dict_with_bloom_to_path |
8-byte LE |
write::write_f64_column_dict_with_bloom_to_path |
8-byte LE (raw bit pattern, per spec) |
write::write_byte_array_column_dict_with_bloom_to_path |
raw bytes (no length prefix, per spec) |
All take (path, name, values, codec, target_fpp); build an SBBF over the column's distinct values (sized via optimal_num_blocks), emit it inline with the column chunk.
Interop
5 round-trip tests, 4 of them parquet-rs cross-checks. Each writes a column under our writer, opens it with parquet-rs's SerializedFileReader::new_with_options + ReaderProperties::set_read_bloom_filter(true), loads the filter via RowGroupReader::get_column_bloom_filter, and confirms every distinct value reports present via bf.check(&v).
Constraints / deferred follow-ups
- Single-column primitive only. Multi-column / multi-row-group bloom writes are follow-ups.
- Dict-write paths only. Bloom on PLAIN write paths is a separate item.
Crates published
ematix-parquet-format0.9.2ematix-parquet-io0.9.2ematix-parquet-crypto0.9.2ematix-parquet-codec0.9.2ematix-parquet-async0.9.2
🤖 Generated with Claude Code
v0.9.1 — u8 dict + BYTE_ARRAY adaptive + bloom builder
Highlights
Three additive opportunistic features bundled as a patch on top of v0.9.0. No API breaks, no behaviour changes for existing callers.
u8 dict-indices reader for bw ≤ 8 columns
read::DictPreservedColumnU8 { dict_bytes, dict_offsets, indices: Vec<u8> }read::read_column_byte_array_dict_preserved_u8+..._u8_intodict::decode_rle_dictionary_indices_u8+..._u8_into
Saves 3 bytes/row on dict-encoded columns with ≤ 256 unique values (most TPC-H string columns: l_returnflag, l_linestatus, status enums, etc.). Unlocks Arrow DictionaryArray<UInt8, T> materialisation in ematix-flow with a 4× smaller indices buffer. Errors deterministically on (a) dict > 256 entries, (b) any data page with bw > 8, (c) no dictionary page — caller falls back to the u32 variant.
BYTE_ARRAY adaptive façade (closes the v0.8 gap)
read::read_column_byte_array_predicate_adaptive(file, rg, col, predicate, opts, telemetry)read::AdaptiveByteArrayChunkOutput+read::AdaptiveByteArrayOutputKind { Bitmap, Values { bytes, offsets } }
Same dispatch contract as the scalar adaptive entry points (i32/i64/f64) introduced in v0.8.0. Materialized output is Arrow-style (bytes, offsets) so consumers get the same shape as read_column_byte_array_offsets. Predicate is Fn(&[u8]) -> bool evaluated against dict entries (≤ dict.len() calls per chunk).
Split-Block Bloom Filter builder
bloom::SplitBlockBloomFilterBuilderwithinsert_hash/insert_bytes/into_bytesbloom::optimal_num_blocks(n, fpp) -> u32(rounds to next power of two)
Symmetric to the Π.6c decoder. Round-trips byte-stable through SplitBlockBloomFilter::from_bytes. Full writer-integration (emitting filters into a parquet file's body + setting ColumnMetaData.bloom_filter_offset) is a deferred follow-up that needs format-crate metadata-writer work.
Crates published
ematix-parquet-format0.9.1ematix-parquet-io0.9.1ematix-parquet-crypto0.9.1ematix-parquet-codec0.9.1ematix-parquet-async0.9.1
🤖 Generated with Claude Code
v0.9.0 — Π.15 (parallel multi-RG decode + NUMA awareness)
Highlights
Ships parallel multi-row-group decode end-to-end. Multi-RG files are embarrassingly parallel (each RG is independent); v0.9 lifts the threading + NUMA awareness into the codec so consumers don't have to roll it.
New surface
- New
parallelfeature onematix-parquet-codec(rayon optional; default builds stay rayon-free). parallel::read_columns_parallel(file, &targets, opts, decode_one)— decodes a slice of(row_group, column)targets concurrently. Generic over caller closure so the same primitive handles homogeneous + heterogeneous workloads. Output preserves input order.CancellationToken(AtomicBool,Arc-cloneable). Cooperative semantics — checked at target boundaries; cancelled targets surface newCodecError::Cancelled; in-flight decodes complete.ParallelDecodeOptions { pool, cancel }— optional caller-ownedArc<rayon::ThreadPool>+ cancellation handle.
Linux-only NUMA (parallel::numa under cfg(target_os = "linux"))
NumaTopology::detect()— via/sys/devices/system/node/node*/cpulist.pin_current_thread_to_node(&topology, node)— viasched_setaffinity.build_numa_pinned_pool(num_threads)— rayon pool with workers pinned round-robin to NUMA nodes.current_node()— viagetcpu(2)syscall.alloc_local_buffer(size)— 4 KiB-stride first-touch so the buffer lands on the calling thread's node. Combined with worker pinning, chunk bytes land on the right node — nolibnumaC dep needed.
Bench harness
examples/bench_parallel_scaling.rs — synthetic 50-RG Snappy-compressed i64 file, sweeps thread counts 1, 2, 4, 8, … capped at the host CPU count; reports speedup + efficiency vs sequential. On Linux also exercises the NUMA-pinned pool. Ready to drop into a multi-socket AWS box for the plan acceptance numbers.
Constraints
- NUMA module is
cfg(target_os = "linux")— portable callers stay onread_columns_parallel; NUMA-aware callers cfg-gate their own usage. - Multi-socket scaling validation (plan acceptance #1) is deferred to AWS infra. Local single-NUMA-node host hits a
ParquetFile.file: Mutex<File>serialization bottleneck at ~1.8× peak; the bench docstring documents it. Switching topread-based unlocked I/O is a separate optimisation. - Cancellation is at target boundaries only — not within a single (rg, col) decode.
Crates published
ematix-parquet-format0.9.0ematix-parquet-io0.9.0ematix-parquet-crypto0.9.0ematix-parquet-codec0.9.0ematix-parquet-async0.9.0
🤖 Generated with Claude Code
v0.8.0 — Π.14 (adaptive predicate-dispatch)
Highlights
Ships adaptive predicate-dispatch — the codec now probes the first N pages of a chunk with the fused kernel, measures selectivity, and decides per-chunk whether to emit a bitmap (fused, wins at low selectivity) or a values vector (materialised, wins at high selectivity).
New surface
ematix_parquet_codec::adaptivemodule —Dispatch,PageProbe,AdaptiveDictPredicate,AdaptiveDispatchOptions,AdaptiveChunkOutput<T>/AdaptiveOutputKind<T>,SelectivityProbe,popcount_bitmap_prefix,probe_page_fused,run_adaptive_dict_chunk<T: Copy>.ematix_parquet_codec::read::read_column_{i32,i64,f64}_predicate_adaptive— per-type façade entry points. Predicate is evaluated against dict entries (≤dict.len()invocations per chunk); returnsAdaptiveChunkOutput<T>.- Optional
Fn(SelectivityProbe)telemetry callback exposes the per-chunk dispatch decision.
Bench-derived threshold
DEFAULT_THRESHOLD = 0.10 from a 7-point sweep on TPC-H SF1 lineitem l_shipdate (aarch64, median of 51 release-mode iterations). The crossover where materialised first beats fused+gather for a values-consuming caller lands right at 10% selectivity. Full table baked into the AdaptiveDictPredicate::DEFAULT_THRESHOLD docstring as in-code reference for future retuning.
Constraints
- Adaptive entry points are dict-only. Chunks with PLAIN-encoded data pages return
InvalidInput; callers should fall back toread_column_*_masked_intofor those. - BYTE_ARRAY not covered (T: Copy doesn't hold for
Vec<u8>) — follow-up if requested. - Bitmap-consuming callers (filter chains, COUNT aggregator) should stay on the static
decode_rle_dictionary_predicate_bitmapentry point — fused always wins for that output shape, and the adaptive runner would add per-chunk dispatch overhead for no gain.
Test coverage
20 tests across the Π.14 surface: 4 unit + 4 happy-path oracle + 7 extended acceptance (widths bw ∈ {14, 16, 18}, probe-pages edges, custom threshold override, mid-chunk selectivity shift) + 5 façade integration tests via the codec's own dict writer.
Crates published
ematix-parquet-format0.8.0ematix-parquet-io0.8.0ematix-parquet-crypto0.8.0ematix-parquet-codec0.8.0ematix-parquet-async0.8.0
🤖 Generated with Claude Code
v0.7.0 — dict-preserving BYTE_ARRAY column reader
Highlights
- New API:
read_column_byte_array_dict_preserved(+_intovariant) returning(dict_bytes, dict_offsets, indices)from a single chunk-decode pass — no per-row materialisation.
Enables Arrow consumers to assemble DictionaryArray<UInt32, Utf8|Binary> directly so downstream operators (filter / group-by / join) can stay on dict codes rather than paying gather + hash at every operator boundary.
Behaviour
- Errors deterministically when the chunk has no
DictionaryPageor when any data page falls back toPLAINafter a dict (writer fell back — chunk can't be expressed as one dict + indices). Callers can react by falling back toread_column_byte_array_offsets. - Validates every index
< dict_lenonce in the cold path so downstream gather can stay branch-free.
Why
Pairs with parquet-rs's ArrowReaderOptions::with_schema(Dictionary(...)) path on consumers like ematix-flow's FastParquet, giving the Emat reader dict-preservation parity. Unblocks Σ.E3b downstream operators (DictGroupCountExec, DictFilterExec) to receive Arrow-surface dict columns from either reader path.
Test coverage
- New oracle: dict + Uncompressed roundtrip, dict + Snappy roundtrip, PLAIN-only column → error.
- Full
cargo testworkspace passes; no regressions on existing oracles.