Skip to content

parallel multi-file VCF ingest, SIMD genotype parsing#83

Merged
vineetver merged 1 commit intomasterfrom
perf/simd-vcf-ingest
Apr 12, 2026
Merged

parallel multi-file VCF ingest, SIMD genotype parsing#83
vineetver merged 1 commit intomasterfrom
perf/simd-vcf-ingest

Conversation

@vineetver
Copy link
Copy Markdown
Owner

@vineetver vineetver commented Apr 12, 2026

Summary

  • Parallel multi-file VCF ingest via rayon. Files chunked across N workers (round-robin), each worker processes its chunk sequentially with shared buffers. N bounded by thread count.
  • memchr SIMD for genotype tab scanning (AVX2/SSE4.2, 32 bytes/cycle vs 1).
  • RecordContext struct replaces 15-parameter process_record. Dedup ascii_uppercase_into. Fix double read_sample_names call.
  • Validate sample consistency across all files before spawning workers.
  • Output: Send + Sync for thread safety.

Benchmarks (UKB chr22, 200K samples, 16 cores, 64G)

master (1 block) this PR (1 block) this PR (23 blocks)
wall 5m10s 3m04s 7m03s
CPU efficiency 10% 37% 65%
peak RAM 31G 31G 22G
variants 17,847 17,847 414,980

Sequential extrapolation for 23 blocks: ~97 min. Parallel: 7m03s. 14x speedup.

Test plan

  • cargo test -- 270 tests pass
  • Single-file produces identical output to master
  • 23-block parallel: 414,980 variants, no OOM at 64G
  • Sample validation catches mismatched headers
  • Memory bounded: scales to 100+ files without OOM

Partially addresses #74.

Process multiple VCF files concurrently via rayon. Each file gets its
own worker with independent BGZF reader, per-chrom parquet writers, and
optional genotype writer. Workers write part_{id}.parquet per chromosome,
scan_and_register merges metadata after all workers join.

23 UKB chr22 blocks (163GB BGZF, 414K variants, 200K samples):
  sequential extrapolation: ~97 min
  parallel (16 cores, 64G): 9m28s — 10x speedup, 68% CPU efficiency

Also: memchr SIMD for genotype tab scanning (41% single-file speedup),
RecordContext replaces 15-parameter process_record, dedup ascii_uppercase_into,
validate sample consistency across files before spawning workers,
Output trait is Send+Sync for thread safety.
@vineetver vineetver force-pushed the perf/simd-vcf-ingest branch from b73670f to 93ac488 Compare April 12, 2026 04:40
@vineetver vineetver merged commit 5c96066 into master Apr 12, 2026
3 checks passed
@vineetver vineetver deleted the perf/simd-vcf-ingest branch April 12, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant