Summary
bcftools view -R sites.bed.gz target.vcf.gz is 350-400× slower than a sequential scan when the BED contains many densely-clustered single-base regions — a common pattern in PRS / ancestry pipelines (HGDP+1kGP, AADR 1240k, PGS Catalog). A production run on 84M such regions against a 23 GB VCF did not complete in 11+ hours of 100% CPU before we killed it.
Reproducer
python3 gen_reproducer.py --regions 1000000 --prefix synth1m
bgzip -f synth1m.bed && tabix -p bed synth1m.bed.gz
bgzip -f synth1m.vcf && tabix -p vcf synth1m.vcf.gz
bcftools view -R synth1m.bed.gz synth1m.vcf.gz > /dev/null
gen_reproducer.py (N densely-clustered single-base positions, written to both BED and VCF)
import argparse, random
# Approx GRCh38 autosome lengths (Mbp) — used as relative weights only.
CHROM_LEN_MBP = [248,242,198,190,181,170,159,145,138,133,
135,133,114,107,101, 90, 83, 80, 58, 64, 46, 50]
def gen_positions(n_total, avg_spacing, seed=42):
rng = random.Random(seed)
rate = 1.0 / avg_spacing
total = sum(CHROM_LEN_MBP)
for chrom_i, mbp in enumerate(CHROM_LEN_MBP, start=1):
length, share = mbp * 1_000_000, n_total * mbp // total
pos = 0
for _ in range(share):
pos += max(1, int(rng.expovariate(rate)))
if pos >= length: break
yield str(chrom_i), pos
ap = argparse.ArgumentParser()
ap.add_argument("--regions", type=int, default=1_000_000)
ap.add_argument("--avg-spacing", type=int, default=10)
ap.add_argument("--prefix", default="synth")
args = ap.parse_args()
positions = list(gen_positions(args.regions, args.avg_spacing))
with open(f"{args.prefix}.bed", "w") as f:
for chrom, pos in positions:
f.write(f"{chrom}\t{pos-1}\t{pos}\n")
with open(f"{args.prefix}.vcf", "w") as f:
f.write("##fileformat=VCFv4.2\n")
for i, mbp in enumerate(CHROM_LEN_MBP, start=1):
f.write(f"##contig=<ID={i},length={mbp * 1_000_000}>\n")
f.write('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">\n')
f.write("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tS1\tS2\tS3\n")
for chrom, pos in positions:
f.write(f"{chrom}\t{pos}\t.\tA\tG\t100\tPASS\t.\tGT\t0/0\t0/0\t0/0\n")
Measurements (bcftools 1.23.1, htslib 1.23.1, macOS arm64)
N = 1,000,000 regions = 1,000,000 records:
| run |
real |
× baseline |
view -Ou vcf.gz > /dev/null (baseline) |
0.45 s |
1× |
view -R bed.gz (default --regions-overlap 1) |
156.6 s |
348× |
view -R bed.gz --regions-overlap 0 |
193.1 s |
429× |
view -R bed.gz --regions-overlap 2 |
239.7 s |
533× |
view -T bed.gz (streaming, --targets-overlap 0) |
1.34 s |
3.0× |
view -T bed.gz --targets-overlap 1 |
1.68 s |
3.7× |
view -T bed.gz --targets-overlap 2 |
1.08 s |
2.4× |
At N=5M: baseline 1.79 s, view -R default 705.5 s (394×). The 84M production case extrapolates to ≥ 12,000 s (~3.3 h) of iterator setup alone, before any meaningful work.
Two takeaways: --regions-overlap value is irrelevant to the bottleneck; -T (streaming) handles the identical workload at near-baseline speed.
Root cause (htslib synced_bcf_reader.c)
_readers_next_region() calls _reader_seek() per BED region, which invokes a fresh tbx_itr_queryi(). For N regions, N iterator setup+teardown cycles. With 84M dense 1bp regions, that's 84M iterator queries when ~22 (one per autosome) would suffice.
// synced_bcf_reader.c:595 — runs N times for N BED entries
static int _readers_next_region(bcf_srs_t *files) {
...
if ( bcf_sr_regions_next(files->regions)<0 ) return -1;
for (i=0; i<files->nreaders; i++)
_reader_seek(&files->readers[i], // ← fresh tbx_itr_queryi
files->regions->seq_names[files->regions->iseq],
files->regions->start, files->regions->end);
return 0;
}
Suggested fix
Filing as a feature request — semantics are correct, just impractically slow at this scale.
Preferred: region coalescing in _readers_next_region. If the next BED region starts within K bp of the current iterator position, keep streaming through the existing iterator and let the per-record overlap check at line 733 filter records in gaps. Collapses 84M iterator queries → ~22, generalizes to any dense-region workload. K tunable, default ~64 KB ≈ one BGZF block.
Alternative: opt-in --regions-stream flag forcing in-memory regidx + top-to-bottom VCF stream (semantically equivalent to -T but inheriting -R's default overlap mode). Lowest-risk; preserves current behavior.
The existing -T path is the correctness baseline (1.34 s vs 0.45 s at 1M — see table); a regression test built on the reproducer above is straightforward.
bcftools 1.23.1 / htslib 1.23.1 / macOS arm64; reproducible on Linux x86_64 per the production observation above.
Summary
bcftools view -R sites.bed.gz target.vcf.gzis 350-400× slower than a sequential scan when the BED contains many densely-clustered single-base regions — a common pattern in PRS / ancestry pipelines (HGDP+1kGP, AADR 1240k, PGS Catalog). A production run on 84M such regions against a 23 GB VCF did not complete in 11+ hours of 100% CPU before we killed it.Reproducer
gen_reproducer.py(N densely-clustered single-base positions, written to both BED and VCF)Measurements (bcftools 1.23.1, htslib 1.23.1, macOS arm64)
N = 1,000,000 regions = 1,000,000 records:
view -Ou vcf.gz > /dev/null(baseline)view -R bed.gz(default--regions-overlap 1)view -R bed.gz --regions-overlap 0view -R bed.gz --regions-overlap 2view -T bed.gz(streaming,--targets-overlap 0)view -T bed.gz --targets-overlap 1view -T bed.gz --targets-overlap 2At N=5M: baseline 1.79 s,
view -Rdefault 705.5 s (394×). The 84M production case extrapolates to ≥ 12,000 s (~3.3 h) of iterator setup alone, before any meaningful work.Two takeaways:
--regions-overlapvalue is irrelevant to the bottleneck;-T(streaming) handles the identical workload at near-baseline speed.Root cause (htslib
synced_bcf_reader.c)_readers_next_region()calls_reader_seek()per BED region, which invokes a freshtbx_itr_queryi(). For N regions, N iterator setup+teardown cycles. With 84M dense 1bp regions, that's 84M iterator queries when ~22 (one per autosome) would suffice.Suggested fix
Filing as a feature request — semantics are correct, just impractically slow at this scale.
Preferred: region coalescing in
_readers_next_region. If the next BED region starts within K bp of the current iterator position, keep streaming through the existing iterator and let the per-record overlap check at line 733 filter records in gaps. Collapses 84M iterator queries → ~22, generalizes to any dense-region workload. K tunable, default ~64 KB ≈ one BGZF block.Alternative: opt-in
--regions-streamflag forcing in-memory regidx + top-to-bottom VCF stream (semantically equivalent to-Tbut inheriting-R's default overlap mode). Lowest-risk; preserves current behavior.The existing
-Tpath is the correctness baseline (1.34 s vs 0.45 s at 1M — see table); a regression test built on the reproducer above is straightforward.bcftools 1.23.1 / htslib 1.23.1 / macOS arm64; reproducible on Linux x86_64 per the production observation above.