Skip to content

schlein-lab/branch

Repository files navigation

BRANCH — Lossless Assembly Graph for Low-Frequency CNV Discovery

🌐 branch-assembler.com · 🔬 Companion viewer: VariantPaths (variantpaths.com)

Overview

BRANCH is a HiFi-read genome assembler built to be state-of-the-art at low-frequency copy-number variants. It produces a lossless, CN-aware assembly graph where branches are graph bifurcations (not tumor clones). Every variant call carries VAF evidence from reads, in-silico PCR, and k-mer counts.

Status

Phase 0 end-to-end pipeline is working on both HiFi and ONT R10.4.1 samples (--read-tech ont, see docs/ont-support.md). Known gaps: unitig collapse not yet final, RAM consumption not fully deterministic, CPU utilisation low (single-threaded overlap), per-contig chromosome projection pending (see branch project below).

Pipeline

reader → graph_build → graph_compactor → graph_filter → assemble

Subcommands

  • branch assemble — reads → minimizer overlap → lossless graph → GFA + FASTA + PAF + BED.
  • branch analyze — mosdepth regions → copy-number inference with paralog awareness.
  • branch project (v0.4.3) — three-layer reference projection: linear (CHM13 + GRCh38 via minimap2), pangenome (HPRC v1.1 via minigraph / GraphAligner), somatic delta vs. nearest pangenome path. Map in, comparison out — every branch is matched against the standard human genome and a public collection of known DNA variation, the residual edit distance to the closest known sequence is reported, brand-new changes are flagged. See docs/branch-project-design.md.

Output contract

  • BED: branch intervals (chrom, start, end, branch_id, VAF, CN).
  • Consensus FASTA per branch.
  • VAF evidence channels: supporting reads, primer-bracketed in-silico PCR amplicons, k-mer counts on read sequence.
  • Genome-wide repeat CN for main path and every branch, normalised against single-copy reference amplicons.

Visualizing BRANCH output

VariantPaths is the companion standalone viewer. It reads .vpf (topology + VAF + dbVar match) and .vpz (alt-path nucleotide sequences) files, both built from BRANCH outputs:

# 1. BRANCH produces bubble BED + GFA per sample
branch assemble  --fastq sample.fq  --out sample.gfa  --out-reads sample.gaf
branch analyze   --graph sample.gfa --reads sample.gaf --out-bed sample.bubbles.bed

# 2. Aggregate + classify across samples (dbVar overlap, recurrence, VAF)
python3 phase_d/scripts/11_branch_atlas.py \
    --inputs "sampleA.bubbles.bed:sampleA,sampleB.bubbles.bed:sampleB" \
    --out per_bubble_master.tsv

# 3. Convert to portable .vpf (topology) + .vpz (sequences)
#    schlein-lab/variantpaths/build_vpf.py
#    schlein-lab/variantpaths/build_vpz.py

# 4. Open in VariantPaths
variantpaths sample.vpf sample.vpz reference.fa

See variantpaths.com for full feature docs.

Build

cmake -S . -B build
cmake --build build -j
ctest --test-dir build --output-on-failure

Requires C++20, CMake ≥ 3.20, zlib, pthread. Vendored in third_party/: htslib, ksw2, abPOA. CUDA backend is opt-in via -DBRANCH_BUILD_CUDA=ON.

Running on an HPC cluster

A sbatch-compatible driver lives in workflow/; adapt the partition, account, and paths to your own site. The pipeline is filesystem-agnostic — point --fastq / --bam at your reads and --out at a writable output directory.

Tech stack

  • C++20 core.
  • htslib (BAM/CRAM I/O), ksw2 (affine-gap alignment), abPOA (partial-order consensus).
  • CMake build; ASan + TSan required in CI.

Repository layout

  • src/ — core C++ sources.
  • docs/architecture.md — pipeline internals, graph data model, classification problem.
  • docs/branch-project-design.md — reference-projection subcommand design.
  • docs/cnv_roadmap.md — low-frequency CNV roadmap.
  • docs/graph-format-spec.md — on-disk graph format.
  • workflow/ — SLURM / Snakemake driver scaffold.
  • tests/ — unit + integration tests (GoogleTest).

Design principles

Lossless graph, CN-aware nodes, VAF-tagged edges, SV-first phasing, multi-allelic branches.

About

BRANCH — Breakpoint-Resolved Assembly of Non-diploid Copy-number Heterogeneity. Somatic mosaic-aware genome assembler for PacBio HiFi with vPCR-based CNV quantification.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors