Skip to content

Quick Start

Rob Flickenger edited this page Aug 9, 2021 · 1 revision

The biograph full_pipeline command will convert reads to the BioGraph format, call variants against a reference, calculate coverage, and assign genotypes and quality scores using a machine learning model.

$ . bg7/bin/activate
(bg7)$ biograph reference --in hs37d5.fasta --refdir hs37d5/ 
...
(bg7)$ biograph full_pipeline --biograph my.bg --ref hs37d5/ \
  --reads /path/to/my_reads.bam \
  --model /path/to/biograph_model-6.0.5.ml \
  --tmp /path/to/large/tmp/
2020-09-10 16:02:01,944 [INFO] Running biograph full_pipeline
...
2020-09-10 17:38:53,981 [INFO] Finished full_pipeline
(bg7)$ 

Results of the analysis are saved inside the BioGraph in the analysis/ folder.

(bg7)$ ls my.bg/analysis/
results.vcf.gz  results.vcf.gz.tbi

Running the entire full_pipeline requires reads and a reference in BioGraph reference format. The full path to the BioGraph classifier model should also be provided.

References

BioGraph uses a genetic reference to speed up the read conversion process, as well as for variant calling and genotyping. It requires a number of indices in addition to the FASTA file itself that are all kept in a single reference directory.

Conversion from a BWA indexed FASTA to the BioGraph reference format only needs to be done once for each reference of interest (hs37d5, GRCh38, etc.) The resulting reference directory is about 14GB and takes roughly a half hour to process. The resulting reference directory can be reused for all subsequent analyses.

For example, the following command will convert the file hs37d5.fasta and save the resulting BioGraph reference to the directory ./hs37d5/:

(bg7)$ biograph reference --in /path/to/hs37d5.fasta --refdir ./hs37d5/

Prebuilt references (including hs37d5 and GRCh38) may be downloaded from AWS S3 at s3://spiral-public/references/. The README.txt under each prefix includes information about each reference, including where the source FASTA was downloaded from.

Note that BioGraph references consist of folders with several files inside.

Logging and QC

The logs from every BioGraph command are saved under the qc/ folder inside the BioGraph. Additional statistics and reports are saved to the json and html files.

(bg7)$ ls my.bg/qc/
create_log.txt                            timings.json
create_stats.json                         variants_log.txt
kmer_quality_report-BELOW_MIN_COUNT.html  variants_stats.json
kmer_quality_report.html

For more details about what is stored in the various files under the BioGraph directory, see What is Inside a BioGraph?

Keeping intermediary files

The intermediary files generated by each step in the pipeline are automatically removed at the end of the analysis. These files can be quite large and are generally not required. However, you may wish to keep some or all intermediaries for QC or other purposes.

The --keep vcf parameter will keep intermediary VCFs, --keep jl will keep all dataframes, and --keep all will keep everything.

(bg7)$ biograph full_pipeline --biograph my.bg --ref hs37d5/ \
  --reads /path/to/my_reads.bam \
  --model /path/to/biograph_model-6.0.5.ml \
  --keep all
...
(bg7)$ ls my.bg/analysis/
coverage.vcf.gz      df.jl             discovery.vcf.gz.tbi  results.vcf.gz
coverage.vcf.gz.tbi  discovery.vcf.gz  grm.jl                results.vcf.gz.tbi

Each step in the BioGraph pipeline has several optional parameters. These are covered in detail in the next section, Customizing the BioGraph Pipeline.

Clone this wiki locally