Skip to content

Releases: ryan-evans-git/ematix-parquet

v0.14.0 — compression hygiene + cross-column page-skip

22 May 17:57
b2eca61

Choose a tag to compare

Three changes land on top of v0.13.0, all in the codec / read façade. No SIMD or write-path changes.

Correctness: LZ4_RAW decompression

decompress_lz4_raw_into previously called lz4_flex::block::decompress with compressed.len() * 255 as the worst-case output cap, then re-copied the result into the caller's buffer. Per parquet page that was a ~190 MB scratch allocation plus a full copy. On a Q06-shape SF=10 lineitem re-encoded with LZ4_RAW we measured a 45× regression vs Snappy (3109 ms vs 70 ms) entirely from the allocation pattern.

Fix: decompress_lz4_raw_into_sized(body, uncompressed_size, out) pre-sizes out from the page header's declared length and decodes in place via lz4_flex::block::decompress_into. No scratch, no copy. The size-less variant now walks the LZ4 block tags once to compute the output length before allocating.

Bench: Q06 SF=10 LZ4_RAW closed from 3109 ms → 57.88 ms (now beats Polars 62 ms by 6.7%, Snappy by 18%).

Hardening: DoS caps on ZSTD / Gzip / Brotli

Previously these called read_to_end(out) on a Read adapter with no upper bound. A malicious 1 KB compressed page could expand to many GB and OOM a worker.

Three new variants (decompress_zstd_into_capped, decompress_gzip_into_capped, decompress_brotli_into_capped) take the declared uncompressed size, pre-reserve the buffer, and error out if the decompressed stream exceeds declared + 64 bytes.

The sync read-façade decompress_into dispatch now plumbs uncompressed_page_size through every call site (~33 in the codec, 8 in the async façade). LZ4_RAW becomes the sized variant; ZSTD / Gzip / Brotli become the capped variants. Snappy keeps its own embedded-length path.

Perf: cross-column page-skip (sync read.rs)

decode_chunk_row_masked_into, read_column_byte_array_masked_into, and the dict-preserved offsets variant previously decompressed every data page before checking the row mask's popcount for that page's range. Move the popcount check (via a new data_page_num_values(hdr) helper that reads num_values from the page header — no decompress needed) before data_page_view. Pages with zero mask bits in their row range skip the codec decompression entirely.

Wall-time at Q06 SF=10 selectivity is unchanged (every page has at least some matches at ~1% × ~32K rows/page), but per-thread CPU time for extprice masked decode drops 35%. The lever fires harder on sparser-filter workloads and partitioned data.

Dead-code removal

decompress_snappy_fast_into (~210 LOC of hand-rolled unsafe with a latent copy_back_ref bug, never dispatched in production, and a 4-5 GB/s microbench claim that didn't hold on real Q06 data — was 17% slower than the snap crate) retires. The companion test file and bench_snappy example are deleted.

Tests

Codec unit tests 8 → 27. New cases for Brotli / Gzip / LZ4_RAW: round-trip, empty input, garbage input, capped-rejects-oversize. Full workspace 719 tests green on macOS-14 + Ubuntu CI. Clippy + rustfmt clean.

🤖 Generated with Claude Code

v0.13.0 — full SIMD specialisation table (NEON + AVX2, bw=1..=32)

20 May 13:12
9111d6b

Choose a tag to compare

v0.13.0 closes the SIMD specialisation coverage table on both architectures. After this release the const-generic scalar fallback is reserved for bit_width == 0 and the partial-tail path after each block — every full block runs through a hand-tuned intrinsic kernel on both AArch64 NEON and x86_64 AVX2.

Coverage

Path NEON AVX2
Raw-indices 1..=32 1..=32
Fused-lookup 1..=21 1..=21

What landed

Four PRs on top of v0.12.0 added 48 new SIMD kernels:

  • #69 (Phase 1) — fused-lookup parity for bw=1, 2, 3, 5, 20, 21 (12 kernels). The lookup path is the hot one for dict-encoded scans; these widths already had raw-indices SIMD but were falling through to the scalar const-generic lookup. Each width has a per-arch staging helper (mirroring the existing bw=12 callback pattern) and a wrapper with bounds-safe + bounds-checked paths.
  • #70 (Phase 2) — raw-indices SIMD for bw=6, 7 (4 kernels). Targets dicts of 33–128 distinct values (status enums, country codes, low-cardinality byte-array columns).
  • #71 (Phase 3) — raw-indices SIMD for bw=9, 10, 11, 13, 19 (10 kernels). Medium-dict workloads (256–512K distinct values). bw=19 uses two 16-byte loads because lane 7 starts at byte 16.
  • #72 (Phase 4) — raw-indices SIMD for bw=22..32 (22 kernels). Macro-templated u32/u64 staging:
    • bw=22..25: u32 staging (max value + max bit_off fits in u32)
    • bw=26..31: u64 staging (33-bit value-spans overflow u32)
    • bw=32: byte-aligned trivial copy — one 256-bit load/store per block

Remaining gap

Phase 4 is raw-indices only — fused-lookup for bw=22..32 falls through to the scalar const-generic lookup. At 4M..4G dict entries the gather is memory-bound, so the lookup-path SIMD savings are negligible relative to cache behaviour. Captured as a below-the-line follow-up in docs/plans/CURRENT.md; wire if a real profile shows it matters.

Testing

735 tests pass on aarch64 (M-series). Linux CI exercises the AVX2 path. Each new kernel has bit-exact pack-helper oracles covering known patterns, partial-tail sizes, random inputs (with max-value-boundary checks at bw=32), and dispatcher routing.

Crates

All five crates published to crates.io at 0.13.0:

  • ematix-parquet-format
  • ematix-parquet-io
  • ematix-parquet-crypto
  • ematix-parquet-codec
  • ematix-parquet-async

No breaking API changes vs v0.12.0.

v0.12.0 — buffer-reuse + small-bw SIMD coverage

20 May 00:54
5c00d0f

Choose a tag to compare

Three perf wins on top of v0.11.0, addressing the largest remaining hot-spots from profiling a 22-query TPC-H bench.

What's new

  • ParquetFile::read_range_into(&mut Vec<u8>, offset, length) — buffer reuse on the hot path. Callers pre-allocate one chunk buffer per row group and reuse across column reads, eliminating the per-call Vec alloc + zero-fill + drop-time madvise(MADV_DONTNEED) that dominated 10–15% of CPU on scan-heavy workloads. The existing read_range delegates through a fresh Vec, so no breaking change.

  • SIMD coverage for bw=2 / bw=3 raw-indices and bw=4 / bw=6 / bw=8 fused-lookup on both NEON and AVX2. Closes the small-bit-width gap the const-generic scalar path was filling at ~7–9 GB/s. bw=2 uses 4 parallel shift+mask streams interleaved via vzip / _mm_unpacklo_epi8 + _mm_unpacklo/hi_epi16; bw=3 uses per-lane TBL gather + variable right-shift; the fused-lookup kernels follow the bw=12/14 staging-callback pattern with a bounds-safe fast path when dict_size proves every unpacked index fits.

  • decode_rle_dictionary_indices_into — append-only, zero-alloc variant of the existing Vec-returning function. Threaded through read_column_byte_array_dict_preserved_into so a ~50-data-page TPC-H lineitem chunk drops from 50 transient Vec<u32> allocations to zero.

Also adds crates/ematix-parquet-io/examples/rg_count.rs — a small diagnostic that prints per-file rows + row-group counts.

No breaking API changes.

v0.11.0 — full SIMD specialisation parity (NEON + AVX2)

20 May 00:53
d4731b9

Choose a tag to compare

v0.11.0 closes the SIMD specialisation table on both architectures. Every production bit width that the scalar fallback was serving at ~7-9 GB/s now has a hand-tuned SIMD kernel on both AArch64 NEON and x86_64 AVX2.

Coverage delta vs v0.10.0

Width NEON v0.10 NEON v0.11 AVX2 v0.10 AVX2 v0.11
1 scalar ✓ added scalar ✓ added
4 ✓ shipped scalar ✓ added
5 scalar ✓ added scalar ✓ added
8 ✓ shipped scalar ✓ added
12-18
20 scalar ✓ added scalar ✓ added
21 scalar ✓ added scalar ✓ added

All new kernels in #64; the release PR is #65.

Per-width strategy

  • bw=1 — broadcast each source byte to 8 lanes, AND with per-lane bit-mask, compare-eq → 0/1 outputs
  • bw=4 — nibble extract (AND 0x0F + shift-right-4), interleave per parquet LSB-first packing, widen
  • bw=5 — extract one u16 per lane via shuffle, variable-shift, mask, widen
  • bw=8 — trivial byte-aligned widen (vmovl chain / _mm256_cvtepu8_epi32)
  • bw=20 — mirrors bw=17/18: two 16-byte loads (offsets 0, 10), alternating shifts, mask 0x0F_FFFF
  • bw=21 — like bw=20 but lo/hi halves use different shuffles + every lane has a distinct shift

Widths 12, 14, 15, 16, 17, 18 already had full NEON + AVX2 coverage since the Π.12 cycle. Widths still on the scalar const-generic path (bw=2, 3, 6, 7, 9, 10, 11, 13, 19, 22..32) get SIMD coverage on-demand — v0.12.0 added bw=2 and bw=3.

Full test suite (680 tests) green; CI ubuntu-latest exercises the AVX2 path.

v0.10.0 — write-side polish + opportunistic deferred items

20 May 00:53
30dfa31

Choose a tag to compare

v0.10.0 bundles every item deferred under v0.9.x "Scope notes" plus the below-the-phase-line catch-all list, plus an I/O unblock that lets parallel decode scale honestly.

Write-side completeness

  • Multi-column / multi-RG bloom writes (#54) — per-(RG, col) SBBFs in write_table_with_blooms_to_path
  • Bloom on PLAIN (non-dict) write paths (#56) — write_{i32,i64,f64,byte_array}_column_with_bloom_to_path family
  • Multi-column dict writes (#57) — per-column opt-in via dict_per_column: &[bool]
  • Per-column codec + WriteOptions (#59) — new write_table_with_options_to_path bundling row_group_size, page_version, default_codec, codec_per_column, dict_per_column, bloom_fpps. Different columns in the same row group can use different codecs

Decode coverage

  • DELTA_BINARY_PACKED u64-output unpacker (#58) — unpack_indices64_into covers bit_widths 1..=64; decode_delta_i64 no longer errors on streams with bit_width > 32
  • BYTE_ARRAY batched/streaming decode API (#60) — read_column_byte_array_batches mirrors the scalar batched API for Vec<Vec<u8>>

Perf

  • pread-based unlocked I/O (#55) — no more Mutex<File> on parallel workers; ~28% sequential decode speedup as a side effect
  • NEON pld L1 prefetch hints in dict gather (#61) — behaviour-identical, perf insensitive to dict footprint up to 1 MB
  • NEON unpackers for bw=4 + bw=8 (#62)

Full test suite (663 tests) green at the new version.

v0.9.2 — bloom-filter writer end-to-end (i32/i64/f64/byte_array)

17 May 14:03
2792b8f

Choose a tag to compare

Highlights

Closes the bloom-filter story. v0.9.1 shipped the in-memory SplitBlockBloomFilterBuilder; v0.9.2 wires it through the codec write path so emitted Parquet files carry consultable bloom filters that downstream readers — including the upstream Rust Parquet reader — discover via ColumnMetaData and apply automatically.

Format crate

  • metadata_writer::encode_column_meta_data now writes bloom_filter_offset (field 14, i64) and bloom_filter_length (field 15, i32) when set. Previously both panicked.

Codec writer entry points

Function Hash input
write::write_i32_column_dict_with_bloom_to_path 4-byte LE
write::write_i64_column_dict_with_bloom_to_path 8-byte LE
write::write_f64_column_dict_with_bloom_to_path 8-byte LE (raw bit pattern, per spec)
write::write_byte_array_column_dict_with_bloom_to_path raw bytes (no length prefix, per spec)

All take (path, name, values, codec, target_fpp); build an SBBF over the column's distinct values (sized via optimal_num_blocks), emit it inline with the column chunk.

Interop

5 round-trip tests, 4 of them parquet-rs cross-checks. Each writes a column under our writer, opens it with parquet-rs's SerializedFileReader::new_with_options + ReaderProperties::set_read_bloom_filter(true), loads the filter via RowGroupReader::get_column_bloom_filter, and confirms every distinct value reports present via bf.check(&v).

Constraints / deferred follow-ups

  • Single-column primitive only. Multi-column / multi-row-group bloom writes are follow-ups.
  • Dict-write paths only. Bloom on PLAIN write paths is a separate item.

Crates published

  • ematix-parquet-format 0.9.2
  • ematix-parquet-io 0.9.2
  • ematix-parquet-crypto 0.9.2
  • ematix-parquet-codec 0.9.2
  • ematix-parquet-async 0.9.2

🤖 Generated with Claude Code

v0.9.1 — u8 dict + BYTE_ARRAY adaptive + bloom builder

17 May 13:06
c6dafca

Choose a tag to compare

Highlights

Three additive opportunistic features bundled as a patch on top of v0.9.0. No API breaks, no behaviour changes for existing callers.

u8 dict-indices reader for bw ≤ 8 columns

  • read::DictPreservedColumnU8 { dict_bytes, dict_offsets, indices: Vec<u8> }
  • read::read_column_byte_array_dict_preserved_u8 + ..._u8_into
  • dict::decode_rle_dictionary_indices_u8 + ..._u8_into

Saves 3 bytes/row on dict-encoded columns with ≤ 256 unique values (most TPC-H string columns: l_returnflag, l_linestatus, status enums, etc.). Unlocks Arrow DictionaryArray<UInt8, T> materialisation in ematix-flow with a 4× smaller indices buffer. Errors deterministically on (a) dict > 256 entries, (b) any data page with bw > 8, (c) no dictionary page — caller falls back to the u32 variant.

BYTE_ARRAY adaptive façade (closes the v0.8 gap)

  • read::read_column_byte_array_predicate_adaptive(file, rg, col, predicate, opts, telemetry)
  • read::AdaptiveByteArrayChunkOutput + read::AdaptiveByteArrayOutputKind { Bitmap, Values { bytes, offsets } }

Same dispatch contract as the scalar adaptive entry points (i32/i64/f64) introduced in v0.8.0. Materialized output is Arrow-style (bytes, offsets) so consumers get the same shape as read_column_byte_array_offsets. Predicate is Fn(&[u8]) -> bool evaluated against dict entries (≤ dict.len() calls per chunk).

Split-Block Bloom Filter builder

  • bloom::SplitBlockBloomFilterBuilder with insert_hash / insert_bytes / into_bytes
  • bloom::optimal_num_blocks(n, fpp) -> u32 (rounds to next power of two)

Symmetric to the Π.6c decoder. Round-trips byte-stable through SplitBlockBloomFilter::from_bytes. Full writer-integration (emitting filters into a parquet file's body + setting ColumnMetaData.bloom_filter_offset) is a deferred follow-up that needs format-crate metadata-writer work.

Crates published

  • ematix-parquet-format 0.9.1
  • ematix-parquet-io 0.9.1
  • ematix-parquet-crypto 0.9.1
  • ematix-parquet-codec 0.9.1
  • ematix-parquet-async 0.9.1

🤖 Generated with Claude Code

v0.9.0 — Π.15 (parallel multi-RG decode + NUMA awareness)

17 May 11:44
4c20f9f

Choose a tag to compare

Highlights

Ships parallel multi-row-group decode end-to-end. Multi-RG files are embarrassingly parallel (each RG is independent); v0.9 lifts the threading + NUMA awareness into the codec so consumers don't have to roll it.

New surface

  • New parallel feature on ematix-parquet-codec (rayon optional; default builds stay rayon-free).
  • parallel::read_columns_parallel(file, &targets, opts, decode_one) — decodes a slice of (row_group, column) targets concurrently. Generic over caller closure so the same primitive handles homogeneous + heterogeneous workloads. Output preserves input order.
  • CancellationToken (AtomicBool, Arc-cloneable). Cooperative semantics — checked at target boundaries; cancelled targets surface new CodecError::Cancelled; in-flight decodes complete.
  • ParallelDecodeOptions { pool, cancel } — optional caller-owned Arc<rayon::ThreadPool> + cancellation handle.

Linux-only NUMA (parallel::numa under cfg(target_os = "linux"))

  • NumaTopology::detect() — via /sys/devices/system/node/node*/cpulist.
  • pin_current_thread_to_node(&topology, node) — via sched_setaffinity.
  • build_numa_pinned_pool(num_threads) — rayon pool with workers pinned round-robin to NUMA nodes.
  • current_node() — via getcpu(2) syscall.
  • alloc_local_buffer(size) — 4 KiB-stride first-touch so the buffer lands on the calling thread's node. Combined with worker pinning, chunk bytes land on the right node — no libnuma C dep needed.

Bench harness

examples/bench_parallel_scaling.rs — synthetic 50-RG Snappy-compressed i64 file, sweeps thread counts 1, 2, 4, 8, … capped at the host CPU count; reports speedup + efficiency vs sequential. On Linux also exercises the NUMA-pinned pool. Ready to drop into a multi-socket AWS box for the plan acceptance numbers.

Constraints

  • NUMA module is cfg(target_os = "linux") — portable callers stay on read_columns_parallel; NUMA-aware callers cfg-gate their own usage.
  • Multi-socket scaling validation (plan acceptance #1) is deferred to AWS infra. Local single-NUMA-node host hits a ParquetFile.file: Mutex<File> serialization bottleneck at ~1.8× peak; the bench docstring documents it. Switching to pread-based unlocked I/O is a separate optimisation.
  • Cancellation is at target boundaries only — not within a single (rg, col) decode.

Crates published

  • ematix-parquet-format 0.9.0
  • ematix-parquet-io 0.9.0
  • ematix-parquet-crypto 0.9.0
  • ematix-parquet-codec 0.9.0
  • ematix-parquet-async 0.9.0

🤖 Generated with Claude Code

v0.8.0 — Π.14 (adaptive predicate-dispatch)

16 May 18:59
bb2dac3

Choose a tag to compare

Highlights

Ships adaptive predicate-dispatch — the codec now probes the first N pages of a chunk with the fused kernel, measures selectivity, and decides per-chunk whether to emit a bitmap (fused, wins at low selectivity) or a values vector (materialised, wins at high selectivity).

New surface

  • ematix_parquet_codec::adaptive module — Dispatch, PageProbe, AdaptiveDictPredicate, AdaptiveDispatchOptions, AdaptiveChunkOutput<T> / AdaptiveOutputKind<T>, SelectivityProbe, popcount_bitmap_prefix, probe_page_fused, run_adaptive_dict_chunk<T: Copy>.
  • ematix_parquet_codec::read::read_column_{i32,i64,f64}_predicate_adaptive — per-type façade entry points. Predicate is evaluated against dict entries (≤ dict.len() invocations per chunk); returns AdaptiveChunkOutput<T>.
  • Optional Fn(SelectivityProbe) telemetry callback exposes the per-chunk dispatch decision.

Bench-derived threshold

DEFAULT_THRESHOLD = 0.10 from a 7-point sweep on TPC-H SF1 lineitem l_shipdate (aarch64, median of 51 release-mode iterations). The crossover where materialised first beats fused+gather for a values-consuming caller lands right at 10% selectivity. Full table baked into the AdaptiveDictPredicate::DEFAULT_THRESHOLD docstring as in-code reference for future retuning.

Constraints

  • Adaptive entry points are dict-only. Chunks with PLAIN-encoded data pages return InvalidInput; callers should fall back to read_column_*_masked_into for those.
  • BYTE_ARRAY not covered (T: Copy doesn't hold for Vec<u8>) — follow-up if requested.
  • Bitmap-consuming callers (filter chains, COUNT aggregator) should stay on the static decode_rle_dictionary_predicate_bitmap entry point — fused always wins for that output shape, and the adaptive runner would add per-chunk dispatch overhead for no gain.

Test coverage

20 tests across the Π.14 surface: 4 unit + 4 happy-path oracle + 7 extended acceptance (widths bw ∈ {14, 16, 18}, probe-pages edges, custom threshold override, mid-chunk selectivity shift) + 5 façade integration tests via the codec's own dict writer.

Crates published

  • ematix-parquet-format 0.8.0
  • ematix-parquet-io 0.8.0
  • ematix-parquet-crypto 0.8.0
  • ematix-parquet-codec 0.8.0
  • ematix-parquet-async 0.8.0

🤖 Generated with Claude Code

v0.7.0 — dict-preserving BYTE_ARRAY column reader

16 May 16:57
8801a7d

Choose a tag to compare

Highlights

  • New API: read_column_byte_array_dict_preserved (+ _into variant) returning (dict_bytes, dict_offsets, indices) from a single chunk-decode pass — no per-row materialisation.

Enables Arrow consumers to assemble DictionaryArray<UInt32, Utf8|Binary> directly so downstream operators (filter / group-by / join) can stay on dict codes rather than paying gather + hash at every operator boundary.

Behaviour

  • Errors deterministically when the chunk has no DictionaryPage or when any data page falls back to PLAIN after a dict (writer fell back — chunk can't be expressed as one dict + indices). Callers can react by falling back to read_column_byte_array_offsets.
  • Validates every index < dict_len once in the cold path so downstream gather can stay branch-free.

Why

Pairs with parquet-rs's ArrowReaderOptions::with_schema(Dictionary(...)) path on consumers like ematix-flow's FastParquet, giving the Emat reader dict-preservation parity. Unblocks Σ.E3b downstream operators (DictGroupCountExec, DictFilterExec) to receive Arrow-surface dict columns from either reader path.

Test coverage

  • New oracle: dict + Uncompressed roundtrip, dict + Snappy roundtrip, PLAIN-only column → error.
  • Full cargo test workspace passes; no regressions on existing oracles.