Skip to content

v0.9.0 — Π.15 (parallel multi-RG decode + NUMA awareness)

Choose a tag to compare

@ryan-evans-git ryan-evans-git released this 17 May 11:44
· 73 commits to main since this release
4c20f9f

Highlights

Ships parallel multi-row-group decode end-to-end. Multi-RG files are embarrassingly parallel (each RG is independent); v0.9 lifts the threading + NUMA awareness into the codec so consumers don't have to roll it.

New surface

  • New parallel feature on ematix-parquet-codec (rayon optional; default builds stay rayon-free).
  • parallel::read_columns_parallel(file, &targets, opts, decode_one) — decodes a slice of (row_group, column) targets concurrently. Generic over caller closure so the same primitive handles homogeneous + heterogeneous workloads. Output preserves input order.
  • CancellationToken (AtomicBool, Arc-cloneable). Cooperative semantics — checked at target boundaries; cancelled targets surface new CodecError::Cancelled; in-flight decodes complete.
  • ParallelDecodeOptions { pool, cancel } — optional caller-owned Arc<rayon::ThreadPool> + cancellation handle.

Linux-only NUMA (parallel::numa under cfg(target_os = "linux"))

  • NumaTopology::detect() — via /sys/devices/system/node/node*/cpulist.
  • pin_current_thread_to_node(&topology, node) — via sched_setaffinity.
  • build_numa_pinned_pool(num_threads) — rayon pool with workers pinned round-robin to NUMA nodes.
  • current_node() — via getcpu(2) syscall.
  • alloc_local_buffer(size) — 4 KiB-stride first-touch so the buffer lands on the calling thread's node. Combined with worker pinning, chunk bytes land on the right node — no libnuma C dep needed.

Bench harness

examples/bench_parallel_scaling.rs — synthetic 50-RG Snappy-compressed i64 file, sweeps thread counts 1, 2, 4, 8, … capped at the host CPU count; reports speedup + efficiency vs sequential. On Linux also exercises the NUMA-pinned pool. Ready to drop into a multi-socket AWS box for the plan acceptance numbers.

Constraints

  • NUMA module is cfg(target_os = "linux") — portable callers stay on read_columns_parallel; NUMA-aware callers cfg-gate their own usage.
  • Multi-socket scaling validation (plan acceptance #1) is deferred to AWS infra. Local single-NUMA-node host hits a ParquetFile.file: Mutex<File> serialization bottleneck at ~1.8× peak; the bench docstring documents it. Switching to pread-based unlocked I/O is a separate optimisation.
  • Cancellation is at target boundaries only — not within a single (rg, col) decode.

Crates published

  • ematix-parquet-format 0.9.0
  • ematix-parquet-io 0.9.0
  • ematix-parquet-crypto 0.9.0
  • ematix-parquet-codec 0.9.0
  • ematix-parquet-async 0.9.0

🤖 Generated with Claude Code