Skip to content

view -R: pathological performance with large single-base region sets (84M regions, 11+ hours unfinished) #2557

@carstenerickson

Description

@carstenerickson

Summary

bcftools view -R sites.bed.gz target.vcf.gz is 350-400× slower than a sequential scan when the BED contains many densely-clustered single-base regions — a common pattern in PRS / ancestry pipelines (HGDP+1kGP, AADR 1240k, PGS Catalog). A production run on 84M such regions against a 23 GB VCF did not complete in 11+ hours of 100% CPU before we killed it.

Reproducer

python3 gen_reproducer.py --regions 1000000 --prefix synth1m
bgzip -f synth1m.bed && tabix -p bed synth1m.bed.gz
bgzip -f synth1m.vcf && tabix -p vcf synth1m.vcf.gz
bcftools view -R synth1m.bed.gz synth1m.vcf.gz > /dev/null
gen_reproducer.py (N densely-clustered single-base positions, written to both BED and VCF)
import argparse, random

# Approx GRCh38 autosome lengths (Mbp) — used as relative weights only.
CHROM_LEN_MBP = [248,242,198,190,181,170,159,145,138,133,
                 135,133,114,107,101, 90, 83, 80, 58, 64, 46, 50]

def gen_positions(n_total, avg_spacing, seed=42):
    rng = random.Random(seed)
    rate = 1.0 / avg_spacing
    total = sum(CHROM_LEN_MBP)
    for chrom_i, mbp in enumerate(CHROM_LEN_MBP, start=1):
        length, share = mbp * 1_000_000, n_total * mbp // total
        pos = 0
        for _ in range(share):
            pos += max(1, int(rng.expovariate(rate)))
            if pos >= length: break
            yield str(chrom_i), pos

ap = argparse.ArgumentParser()
ap.add_argument("--regions", type=int, default=1_000_000)
ap.add_argument("--avg-spacing", type=int, default=10)
ap.add_argument("--prefix", default="synth")
args = ap.parse_args()
positions = list(gen_positions(args.regions, args.avg_spacing))

with open(f"{args.prefix}.bed", "w") as f:
    for chrom, pos in positions:
        f.write(f"{chrom}\t{pos-1}\t{pos}\n")

with open(f"{args.prefix}.vcf", "w") as f:
    f.write("##fileformat=VCFv4.2\n")
    for i, mbp in enumerate(CHROM_LEN_MBP, start=1):
        f.write(f"##contig=<ID={i},length={mbp * 1_000_000}>\n")
    f.write('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">\n')
    f.write("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tS1\tS2\tS3\n")
    for chrom, pos in positions:
        f.write(f"{chrom}\t{pos}\t.\tA\tG\t100\tPASS\t.\tGT\t0/0\t0/0\t0/0\n")

Measurements (bcftools 1.23.1, htslib 1.23.1, macOS arm64)

N = 1,000,000 regions = 1,000,000 records:

run real × baseline
view -Ou vcf.gz > /dev/null (baseline) 0.45 s
view -R bed.gz (default --regions-overlap 1) 156.6 s 348×
view -R bed.gz --regions-overlap 0 193.1 s 429×
view -R bed.gz --regions-overlap 2 239.7 s 533×
view -T bed.gz (streaming, --targets-overlap 0) 1.34 s 3.0×
view -T bed.gz --targets-overlap 1 1.68 s 3.7×
view -T bed.gz --targets-overlap 2 1.08 s 2.4×

At N=5M: baseline 1.79 s, view -R default 705.5 s (394×). The 84M production case extrapolates to ≥ 12,000 s (~3.3 h) of iterator setup alone, before any meaningful work.

Two takeaways: --regions-overlap value is irrelevant to the bottleneck; -T (streaming) handles the identical workload at near-baseline speed.

Root cause (htslib synced_bcf_reader.c)

_readers_next_region() calls _reader_seek() per BED region, which invokes a fresh tbx_itr_queryi(). For N regions, N iterator setup+teardown cycles. With 84M dense 1bp regions, that's 84M iterator queries when ~22 (one per autosome) would suffice.

// synced_bcf_reader.c:595 — runs N times for N BED entries
static int _readers_next_region(bcf_srs_t *files) {
    ...
    if ( bcf_sr_regions_next(files->regions)<0 ) return -1;
    for (i=0; i<files->nreaders; i++)
        _reader_seek(&files->readers[i],            // ← fresh tbx_itr_queryi
                     files->regions->seq_names[files->regions->iseq],
                     files->regions->start, files->regions->end);
    return 0;
}

Suggested fix

Filing as a feature request — semantics are correct, just impractically slow at this scale.

Preferred: region coalescing in _readers_next_region. If the next BED region starts within K bp of the current iterator position, keep streaming through the existing iterator and let the per-record overlap check at line 733 filter records in gaps. Collapses 84M iterator queries → ~22, generalizes to any dense-region workload. K tunable, default ~64 KB ≈ one BGZF block.

Alternative: opt-in --regions-stream flag forcing in-memory regidx + top-to-bottom VCF stream (semantically equivalent to -T but inheriting -R's default overlap mode). Lowest-risk; preserves current behavior.

The existing -T path is the correctness baseline (1.34 s vs 0.45 s at 1M — see table); a regression test built on the reproducer above is straightforward.


bcftools 1.23.1 / htslib 1.23.1 / macOS arm64; reproducible on Linux x86_64 per the production observation above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions