Skip to content
Rob Flickenger edited this page Aug 9, 2021 · 2 revisions

The biograph create command converts reads from BAM, CRAM, or FASTQ to BioGraph format. Conversion of a 30x human typically takes about an hour on a 64 core machine, and produces a BioGraph of approximately 30GB.

Essential parameters

  • --reads: input file to process. See the Supported input read formats discussion below.
  • --out: output BioGraph. It should end with .bg.
  • --ref: the BioGraph reference directory. If your input reads are in CRAM format, this reference must match the reference FASTA used to compress the CRAM. Otherwise, you can use any BioGraph reference to create a BioGraph from reads. No reference-specific information is stored in the BioGraph, but the closer the reference matches the input reads, the faster the create process will work.
  • --tmp: Specify the temporary storage location. This directory should be as large and fast as possible. See System Requirements for specific requirements. Best performance is achieved using NVMe or SATA SSD in a striped RAID configuration.
  • --max-mem: Maximum system memory to use, in gigabytes. The create process uses 48GB by default, but runs faster with more memory. --max-mem 64 works well for 30X datasets, but more is usually better. See Optimizing performance for more guidance on how to choose the best value for --max-mem.
  • --threads: number of concurrent threads. Default: one thread per available CPU

Additional parameters

  • --force: overwrite any existing BioGraph
  • --id: optional accession ID for this BioGraph. If unset, it is derived from the output BioGraph filename.

Supported input read formats

BioGraph has native support for BAM, SAM, CRAM, and FASTQ. The FASTQ may be optionally gzip compressed, and read pairs are supported in interleaved format or as separate files.

Any number of files may be imported when running biograph create. This is useful for converting unmerged multilane FASTQ files, or in situations where reads have already been aligned and split into smaller pieces, such as by chromosome.

BAM/SAM

Simply supply the path to the BAM file to --reads:

(bg7)$ biograph create --reads my_reads.bam --ref my_ref/ --out my.bg --tmp /raid/tmp/

If your reads consist of more than one BAM or SAM file, use --reads multiple times:

(bg7)$ biograph create --ref my_ref/ --out my.bg --tmp /raid/tmp/ \
  --reads chr1.bam \
  --reads chr2.bam \
...
  --reads chrY.bam

CRAM

For reads in CRAM format, the reference specified must match the reference originally used to encode the CRAM. This does not tie the resulting BioGraph to the specified reference; it is only used to uncompress the CRAM and for efficiency in conversion to BioGraph. The resulting BioGraph may be analyzed with respect to any reference without the need for reconversion.

All other options work the same as with BAM files:

(bg7)$ biograph create --reads my.cram --out my.bg --ref hs37d5/ --tmp /raid/tmp/

If your input reads span multiple CRAM files, specify --reads multiple times. In this case, all input CRAM files must be compressed with the same reference:

(bg7)$ biograph create --out my.bg --ref hs37d5/ --tmp /raid/tmp/ \
  --reads chr1.cram
  --reads chr2.cram
...
  --reads chrX.cram

FASTQ

FASTQ may be optionally gzip compressed. If your input reads are in paired FASTQ format with pairs in separate files, use --pair to specify the second file:

(bg7)$ biograph create --reads reads_1.fq.gz --pair reads_2.fq.gz \
    --out my.bg --ref hs37d5/ --tmp /raid/tmp/

If your reads include multiple files per pair, specify the --reads and --pair options multiple times. Ensure that the order of the reads and pairs matches their mates:

(bg7)$ biograph create \
    --reads reads_a_1.fq.gz --pair reads_a_2.fq.gz \
    --reads reads_b_1.fq.gz --pair reads_b_2.fq.gz \
    --out my.bg --ref hs37d5/ --tmp /raid/tmp/

Or alternatively:

(bg7)$ biograph create \
    --reads reads_a_1.fq.gz --reads reads_b_1.fq.gz \
    --pair reads_a_2.fq.gz --pair reads_b_2.fq.gz \
    --out my.bg --ref hs37d5/ --tmp /raid/tmp/

If your input reads are in interleaved FASTQ format, use the --interleaved option:

(bg7)$ biograph create --reads reads.fq.gz --out my.bg --ref hs37d5/ --interleaved

If a single FASTQ file is specified without the --interleaved option, the reads are assumed to be unpaired.

SRA and other formats

All other formats must be converted to one of the above formats before importing. For SRA format, this can be achieved on-the-fly with the sam-dump command from the SRA Toolkit. The resulting SAM or BAM file can be quite large but it need not be saved locally; it can be streamed directly to biograph create on STDIN.

Streaming Reads on STDIN

Depending on your computing architecture, it may be useful to store input reads in a location other than a local file system (such as an AWS S3 bucket, Azure blob storage, Google Cloud Storage, a URL endpoint, or some other object storage scheme). You can save time and storage space by streaming reads directly into BioGraph.

BioGraph can accept BAM, SAM, CRAM, or uncompressed FASTQ as input on STDIN by specifying --reads -. Streaming is supported for the create and full_pipeline commands. This allows for high-speed streaming from external storage without the need to save the reads to disk prior to processing:

(bg7)$ aws s3 cp s3://my-bucket/reads.bam - | biograph create --reads - ...

To stream SRA, convert it to SAM using the sam-dump -u command from the SRA Toolkit:

(bg7)$ sam-dump -u SRR3990320 | biograph create --reads - ...

Specify the input format for streaming

When streaming from STDIN, the input format is assumed to be BAM or CRAM. To stream unpaired FASTQ, use the --format fastq option for create. For interleaved paired FASTQ (where each read and its mate appear in alternating lines in a single file), include the --interleaved option:

(bg7)$ aws s3 cp s3://my-bucket/unpaired_reads.fastq - | \
  biograph create --reads - --format fastq ...

(bg7)$ aws s3 cp s3://my-bucket/paired_reads.fastq - | \
  biograph create --reads - --format fastq --interleaved ...

In either case the reads should be uncompressed. Pass compressed reads through a decompression program as needed:

(bg7)$ aws s3 cp s3://my-bucket/reads.fq.gz - | zcat | \
  biograph create --reads - --format fastq --interleaved ...

All of these options are similar for biograph full_pipeline, but they should be specified using the --create parameter. The previous example would be run through full_pipeline like this:

(bg7)$ aws s3 cp s3://my-bucket/reads.fq.gz - | zcat | \
  biograph full_pipeline --reads - --create "--format fastq --interleaved" ...

It is not possible to stream non-interleaved paired FASTQ entirely on STDIN, since it is impossible to distinguish multiple files in a Unix pipe. But it is possible to stream one non-interleaved FASTQ on STDIN if its pair is available as a local file. Local gzip compressed FASTQ files are automatically uncompressed.

(bg7)$ aws s3 cp s3://my-bucket/reads_1.fq.gz - | zcat | \
  biograph create --reads - --format fastq --interleaved \
  --pair /path/to/reads_2.fq.gz ...

Advanced options

The options described so far cover the vast majority of use cases. There are many more options available for biograph create that tune the create algorithm or provide additional debugging.

These options generally should not be changed unless recommended by Spiral:

  • --cache: Cache as much as possible in RAM. Enabling --cache can help work around performance issues in some environments by forcing more data to stay in memory. See Optimizing Performance for details.
  • --debug: Turn on verbose logging
  • --keep-tmp: Retain the temp directory after completion
  • --stats: Create stats are normally saved to the qc/create_stats.json file inside the BioGraph. Use this option to save to a different location.
  • --tmp-encoding: Temporary files use fast gzip compression to minimize I/O and maximize performance. The default setting is gzip1. Changing this to null disables compression completely, which uses significantly more temporary space but can be faster on some systems. Changing it to gzip will use less temporary space, but consumes more CPU cycles.

Kmerization and Read Correction

Sequencing errors must be removed from reads before they are converted to the BioGraph format. For a discussion of the Read Correction process, see Understanding Read Correction.

You can choose a number of options to influence the kmerization and read correction steps. For most datasets, the defaults settings provide optimal results.

Kmerization Options:
  --min-kmer-count arg (=5)        The integer minimum kmer count. Reads with
                                   kmers less abundant than this will be
                                   corrected or dropped. (min 1)
  --kmer-size arg (=30)            The size of kmers to use for kmer
                                   generation.
  --trim-after-portion arg (=0.7)  Trim the end of reads until they pass read
                                   correction, down to a minimum of the given
                                   portion of the read length. 1 = no automatic
                                   trimming.

Read Correction Options:
  --max-corrections arg (=8)       Correct up to the specified number of bases.
  --min-good-run arg (=2)          Minimum number of good bases between
                                   corrections.
  --min-reads arg (=0.4)           Minimum fraction of reads that must survive
                                   read correction
  --warn-reads arg (=0.7)          Warn when this fraction of reads does not
                                   survive read correction

Getting Help

You can see all available basic options for create by specifying --help.

(bg7)$ biograph create --help
create version 7.0.0

Usage: create [OPTIONS] --reads <file> --ref <refdir> --out <biograph> 
  [--pair <fastq pairs>] [...]

Convert reads to BioGraph format.

General Options:
  -h [ --help ]           Display this help message
  --help-all              Show help for advanced options
  --tmp arg               Basepath to temporary space. Defaults to a random 
                          directory under /tmp/
  --out arg               Output BioGraph name (.bg)
  --ref arg               Reference directory
  --reads arg             Input file to process (fastq, bam, cram. Use - 
                          for STDIN)
  --format arg (=auto)    Input file format when using STDIN (fastq, bam, cram)
  --interleaved           Input reads are interleaved (fastq only)
  --pair arg              Second input file containing read pairs (fastq only)
  --id arg                Optional accession ID for this sample
  -f [ --force ]          Overwrite existing BioGraph

  --max-mem arg           Maximum memory to use, in GiB (48)

Customizing the BioGraph Pipeline