-
Notifications
You must be signed in to change notification settings - Fork 10
create
The biograph create
command converts reads from BAM, CRAM, or FASTQ to BioGraph format. Conversion of a 30x human typically takes about an hour on a 64 core machine, and produces a BioGraph of approximately 30GB.
-
--reads
: input file to process. See the Supported input read formats discussion below. -
--out
: output BioGraph. It should end with.bg
. -
--ref
: the BioGraph reference directory. If your input reads are in CRAM format, this reference must match the reference FASTA used to compress the CRAM. Otherwise, you can use any BioGraph reference to create a BioGraph from reads. No reference-specific information is stored in the BioGraph, but the closer the reference matches the input reads, the faster the create process will work. -
--tmp
: Specify the temporary storage location. This directory should be as large and fast as possible. See System Requirements for specific requirements. Best performance is achieved using NVMe or SATA SSD in a striped RAID configuration. -
--max-mem
: Maximum system memory to use, in gigabytes. Thecreate
process uses 48GB by default, but runs faster with more memory.--max-mem 64
works well for 30X datasets, but more is usually better. See Optimizing performance for more guidance on how to choose the best value for--max-mem
. -
--threads
: number of concurrent threads. Default: one thread per available CPU
-
--force
: overwrite any existing BioGraph -
--id
: optional accession ID for this BioGraph. If unset, it is derived from the output BioGraph filename.
BioGraph has native support for BAM, SAM, CRAM, and FASTQ. The FASTQ may be optionally gzip compressed, and read pairs are supported in interleaved format or as separate files.
Any number of files may be imported when running biograph create
. This is useful for converting unmerged multilane FASTQ files, or in situations where reads have already been aligned and split into smaller pieces, such as by chromosome.
Simply supply the path to the BAM file to --reads
:
(bg7)$ biograph create --reads my_reads.bam --ref my_ref/ --out my.bg --tmp /raid/tmp/
If your reads consist of more than one BAM or SAM file, use --reads
multiple times:
(bg7)$ biograph create --ref my_ref/ --out my.bg --tmp /raid/tmp/ \
--reads chr1.bam \
--reads chr2.bam \
...
--reads chrY.bam
For reads in CRAM format, the reference specified must match the reference originally used to encode the CRAM. This does not tie the resulting BioGraph to the specified reference; it is only used to uncompress the CRAM and for efficiency in conversion to BioGraph. The resulting BioGraph may be analyzed with respect to any reference without the need for reconversion.
All other options work the same as with BAM files:
(bg7)$ biograph create --reads my.cram --out my.bg --ref hs37d5/ --tmp /raid/tmp/
If your input reads span multiple CRAM files, specify --reads
multiple times. In this case, all input CRAM files must be compressed with the same reference:
(bg7)$ biograph create --out my.bg --ref hs37d5/ --tmp /raid/tmp/ \
--reads chr1.cram
--reads chr2.cram
...
--reads chrX.cram
FASTQ may be optionally gzip compressed. If your input reads are in paired FASTQ format with pairs in separate files, use --pair
to specify the second file:
(bg7)$ biograph create --reads reads_1.fq.gz --pair reads_2.fq.gz \
--out my.bg --ref hs37d5/ --tmp /raid/tmp/
If your reads include multiple files per pair, specify the --reads
and --pair
options multiple times. Ensure that the order of the reads and pairs matches their mates:
(bg7)$ biograph create \
--reads reads_a_1.fq.gz --pair reads_a_2.fq.gz \
--reads reads_b_1.fq.gz --pair reads_b_2.fq.gz \
--out my.bg --ref hs37d5/ --tmp /raid/tmp/
Or alternatively:
(bg7)$ biograph create \
--reads reads_a_1.fq.gz --reads reads_b_1.fq.gz \
--pair reads_a_2.fq.gz --pair reads_b_2.fq.gz \
--out my.bg --ref hs37d5/ --tmp /raid/tmp/
If your input reads are in interleaved FASTQ format, use the --interleaved
option:
(bg7)$ biograph create --reads reads.fq.gz --out my.bg --ref hs37d5/ --interleaved
If a single FASTQ file is specified without the --interleaved
option, the reads are assumed to be unpaired.
All other formats must be converted to one of the above formats before importing. For SRA format, this can be achieved on-the-fly with the sam-dump
command from the SRA Toolkit. The resulting SAM or BAM file can be quite large but it need not be saved locally; it can be streamed directly to biograph create
on STDIN.
Depending on your computing architecture, it may be useful to store input reads in a location other than a local file system (such as an AWS S3 bucket, Azure blob storage, Google Cloud Storage, a URL endpoint, or some other object storage scheme). You can save time and storage space by streaming reads directly into BioGraph.
BioGraph can accept BAM, SAM, CRAM, or uncompressed FASTQ as input on STDIN by specifying --reads -
. Streaming is supported for the create
and full_pipeline
commands. This allows for high-speed streaming from external storage without the need to save the reads to disk prior to processing:
(bg7)$ aws s3 cp s3://my-bucket/reads.bam - | biograph create --reads - ...
To stream SRA, convert it to SAM using the sam-dump -u
command from the SRA Toolkit:
(bg7)$ sam-dump -u SRR3990320 | biograph create --reads - ...
When streaming from STDIN, the input format is assumed to be BAM or CRAM. To stream unpaired FASTQ, use the --format fastq
option for create
. For interleaved paired FASTQ (where each read and its mate appear in alternating lines in a single file), include the
--interleaved
option:
(bg7)$ aws s3 cp s3://my-bucket/unpaired_reads.fastq - | \
biograph create --reads - --format fastq ...
(bg7)$ aws s3 cp s3://my-bucket/paired_reads.fastq - | \
biograph create --reads - --format fastq --interleaved ...
In either case the reads should be uncompressed. Pass compressed reads through a decompression program as needed:
(bg7)$ aws s3 cp s3://my-bucket/reads.fq.gz - | zcat | \
biograph create --reads - --format fastq --interleaved ...
All of these options are similar for biograph full_pipeline
, but they should be specified using the --create
parameter. The previous example would be run through full_pipeline
like this:
(bg7)$ aws s3 cp s3://my-bucket/reads.fq.gz - | zcat | \
biograph full_pipeline --reads - --create "--format fastq --interleaved" ...
It is not possible to stream non-interleaved paired FASTQ entirely on STDIN, since it is impossible to distinguish multiple files in a Unix pipe. But it is possible to stream one non-interleaved FASTQ on STDIN if its pair is available as a local file. Local gzip compressed FASTQ files are automatically uncompressed.
(bg7)$ aws s3 cp s3://my-bucket/reads_1.fq.gz - | zcat | \
biograph create --reads - --format fastq --interleaved \
--pair /path/to/reads_2.fq.gz ...
The options described so far cover the vast majority of use cases. There are many more options available for biograph create
that tune the create algorithm or provide additional debugging.
These options generally should not be changed unless recommended by Spiral:
-
--cache
: Cache as much as possible in RAM. Enabling--cache
can help work around performance issues in some environments by forcing more data to stay in memory. See Optimizing Performance for details. -
--debug
: Turn on verbose logging -
--keep-tmp
: Retain the temp directory after completion -
--stats
: Create stats are normally saved to the qc/create_stats.json file inside the BioGraph. Use this option to save to a different location. -
--tmp-encoding
: Temporary files use fast gzip compression to minimize I/O and maximize performance. The default setting isgzip1
. Changing this tonull
disables compression completely, which uses significantly more temporary space but can be faster on some systems. Changing it togzip
will use less temporary space, but consumes more CPU cycles.
Sequencing errors must be removed from reads before they are converted to the BioGraph format. For a discussion of the Read Correction process, see Understanding Read Correction.
You can choose a number of options to influence the kmerization and read correction steps. For most datasets, the defaults settings provide optimal results.
Kmerization Options:
--min-kmer-count arg (=5) The integer minimum kmer count. Reads with
kmers less abundant than this will be
corrected or dropped. (min 1)
--kmer-size arg (=30) The size of kmers to use for kmer
generation.
--trim-after-portion arg (=0.7) Trim the end of reads until they pass read
correction, down to a minimum of the given
portion of the read length. 1 = no automatic
trimming.
Read Correction Options:
--max-corrections arg (=8) Correct up to the specified number of bases.
--min-good-run arg (=2) Minimum number of good bases between
corrections.
--min-reads arg (=0.4) Minimum fraction of reads that must survive
read correction
--warn-reads arg (=0.7) Warn when this fraction of reads does not
survive read correction
You can see all available basic options for create by specifying --help
.
(bg7)$ biograph create --help
create version 7.0.0
Usage: create [OPTIONS] --reads <file> --ref <refdir> --out <biograph>
[--pair <fastq pairs>] [...]
Convert reads to BioGraph format.
General Options:
-h [ --help ] Display this help message
--help-all Show help for advanced options
--tmp arg Basepath to temporary space. Defaults to a random
directory under /tmp/
--out arg Output BioGraph name (.bg)
--ref arg Reference directory
--reads arg Input file to process (fastq, bam, cram. Use -
for STDIN)
--format arg (=auto) Input file format when using STDIN (fastq, bam, cram)
--interleaved Input reads are interleaved (fastq only)
--pair arg Second input file containing read pairs (fastq only)
--id arg Optional accession ID for this sample
-f [ --force ] Overwrite existing BioGraph
--max-mem arg Maximum memory to use, in GiB (48)