Skip to content

BioGraph for Population Scale Genomics

Rob Flickenger edited this page Aug 9, 2021 · 2 revisions

The ongoing expansion of genomics sequencing continues to produce data at an incredible pace. Global sequencing capacity has expanded by approximately one exabase per year since 2015, and will approach one zettabase per year by 2025.

As your genomics project approaches population scale, you will face the unique challenge of comparing large genomic datasets in a comprehensive and cost-effective way. Traditional analysis pipelines begin to break down as projects grow from dozens, to hundreds, to thousands of whole-genome samples.

  • How can variant calls be normalized and meaningfully compared across thousands of samples?
  • How does one add an additional sample to a population, and then genotype across new variants without triggering a full reanalysis of the population (also known as the N+1 problem)?
  • What happens when the project needs to change to the latest human reference, or use a population-specific reference?
  • How can structural variations be analyzed in a meaningful way at scale?

BioGraph was created to address these issues.

Efficient graph storage for read data

BioGraph is an efficient file format for storing unaligned read data as a graph structure that can be rapidly queried by sequence. This format is much smaller and faster to query than FASTQ, and can be used immediately without the decompression that slows down analysis of traditional formats (such as BAM, CRAM, or SRA).

BioGraph reads are indexed by sequence, rather than position with respect to genetic reference (as is the case with aligned BAM or CRAM). Nucleotide sequences in this format can be queried in constant time per base in the query, regardless of the number of sequences in the BioGraph. This capability unlocks novel approaches to variant detection and validation unavailable in any other analysis tool, and allows BioGraph to provide industry-leading structural variant detection.

The biograph program converts NGS reads to the BioGraph format, discovers variants on those reads with respect to a reference, and performs quality score classification, genotyping, and a number of other analysis and utility functions. It is the primary interface for interacting with BioGraph files and producing analysis results.

BioGraph is cloud-ready

BioGraph was designed with high-volume pipelines in mind. While most steps of the pipeline are written in C++ for modern Linux systems, all operations are highly parallelized, taking advantage of all available CPUs and memory by default.

After conversion, the BioGraph format is considerably smaller than FASTQ, and roughly the same size as a CRAM file. But unlike traditional read formats that require on-the-fly decompression before analysis, BioGraph files can be queried immediately.

While BioGraph runs well in traditional cluster environments, exceptional performance can be obtained in cloud environments (such as AWS, Google Cloud, or Azure), where processors, RAM, and scratch disk performance can be scaled as needed.


Next: Installation

Clone this wiki locally