v0.9.0 — Π.15 (parallel multi-RG decode + NUMA awareness)
Highlights
Ships parallel multi-row-group decode end-to-end. Multi-RG files are embarrassingly parallel (each RG is independent); v0.9 lifts the threading + NUMA awareness into the codec so consumers don't have to roll it.
New surface
- New
parallelfeature onematix-parquet-codec(rayon optional; default builds stay rayon-free). parallel::read_columns_parallel(file, &targets, opts, decode_one)— decodes a slice of(row_group, column)targets concurrently. Generic over caller closure so the same primitive handles homogeneous + heterogeneous workloads. Output preserves input order.CancellationToken(AtomicBool,Arc-cloneable). Cooperative semantics — checked at target boundaries; cancelled targets surface newCodecError::Cancelled; in-flight decodes complete.ParallelDecodeOptions { pool, cancel }— optional caller-ownedArc<rayon::ThreadPool>+ cancellation handle.
Linux-only NUMA (parallel::numa under cfg(target_os = "linux"))
NumaTopology::detect()— via/sys/devices/system/node/node*/cpulist.pin_current_thread_to_node(&topology, node)— viasched_setaffinity.build_numa_pinned_pool(num_threads)— rayon pool with workers pinned round-robin to NUMA nodes.current_node()— viagetcpu(2)syscall.alloc_local_buffer(size)— 4 KiB-stride first-touch so the buffer lands on the calling thread's node. Combined with worker pinning, chunk bytes land on the right node — nolibnumaC dep needed.
Bench harness
examples/bench_parallel_scaling.rs — synthetic 50-RG Snappy-compressed i64 file, sweeps thread counts 1, 2, 4, 8, … capped at the host CPU count; reports speedup + efficiency vs sequential. On Linux also exercises the NUMA-pinned pool. Ready to drop into a multi-socket AWS box for the plan acceptance numbers.
Constraints
- NUMA module is
cfg(target_os = "linux")— portable callers stay onread_columns_parallel; NUMA-aware callers cfg-gate their own usage. - Multi-socket scaling validation (plan acceptance #1) is deferred to AWS infra. Local single-NUMA-node host hits a
ParquetFile.file: Mutex<File>serialization bottleneck at ~1.8× peak; the bench docstring documents it. Switching topread-based unlocked I/O is a separate optimisation. - Cancellation is at target boundaries only — not within a single (rg, col) decode.
Crates published
ematix-parquet-format0.9.0ematix-parquet-io0.9.0ematix-parquet-crypto0.9.0ematix-parquet-codec0.9.0ematix-parquet-async0.9.0
🤖 Generated with Claude Code