-
Notifications
You must be signed in to change notification settings - Fork 10
coverage
The biograph coverage
command takes a BioGraph, reference, and the list of calls from the discovery step, and produces a VCF containing more accurate read coverage for each variant.
(bg7)$ biograph coverage -b my.bg -r my_ref/ -v discovery.vcf.gz --out coverage.vcf
As with the discovery
step, coverage
outputs unsorted, uncompressed VCF which must be sorted and compressed for subsequent steps. Omitting the --out
parameter will write the output to STDOUT for easy streaming:
(bg7)$ biograph coverage -b my.bg -r my_ref/ -v discovery.vcf.gz \
| vcf-sort | bgzip > coverage.vcf.gz && tabix coverage.vcf.gz
When running coverage on multisample VCFs, only the sample used for the BioGraph is output. The default sample name is the sample name of the BioGraph unless otherwise specified by the --sample
option.
(bg7)$ biograph coverage -b my.bg -r my_ref/ -v ajtrio.vcf.gz --sample HG002 \
--out HG002.vcf
-
--biograph
or-b
: the BioGraph -
--variants
or-v
: the discovery VCF -
--reference
or-r
: the reference (in BioGraph reference format) -
--output
or-o
: the output VCF file. If not specified, the VCF will be written to STDOUT. -
--sample
or-s
: select the sample column to use in multi-sample VCFs. This option can be omitted for single sample VCFs, or when the sample name of the BioGraph matches the desired sample column in the VCF.
Large contiguous regions of reference are automatically broken down into smaller chunks to maximize parallelization. A contig break can be made anywhere that is sufficiently free of variant calls. This behavior can be tuned using the --min-clearance
and --min-contig-size
options.
--min-clearance
specifies the minimum number of bases between variants where a break may be made. It must be at least double the --max-insert
size and will be automatically increased if the specified value is too small. Specify a value of 0 to disable microcontig generation.
--min-contig-size
specifies the smallest contig that will be created. While smaller microcontigs help parallelize work, they also create some additional overhead. Too many microcontigs can lead to underutilized worker threads, as more time must be spent on managing the work queue.
The default values work well for single sample VCFs up to merged VCF populations of about 1000. Consider decreasing either parameter if working with very large populations or extremely dense regions of variation.
The biograph coverage
command will attempt to calculate coverage for every variant in the discovery VCF. To constrain coverage calculation to a specific genomic region, use the --bed
option.
When running the full BioGraph pipeline, it is usually more efficient to specify a BED file for discovery. The resulting discovery VCF will only contain calls from the regions of interest, which are then processed by the coverage
step.
It can be useful to include a BED for coverage when analyzing a subset of a previous discovery run. The BED file should contain the regions you wish to include:
$ cat chr1.bed
chr1 10000 248946422
(bg7)$ biograph coverage -b my.bg -r grch38/ -v discovery.vcf.gz -o chr1.vcf \
--bed chr1.bed
By default, coverage
will consider read pairs separated by 200 to 1000 bases. The minimum and maximum insert sizes can be set by --min-insert
and --max-insert
respectively.
-
--passonly
: Process only VCF entries with a filter line of PASS. -
--max-coverage-paths
: Some genomic regions (such as highly repetitive regions) are too complex to traverse using short read data.coverage
will attempt to trace up to the number of paths specified here before giving up and issuing a warning. Up to 3000 possible coverage paths are traversed by default. Increase this number to attempt to try more paths. This will significantly increase the required processing time but may improve coverage calculation in repetitive regions. -
--max-reads-per-entry
: By default, every applicable read counts towards coverage. Setting this option will impose an upper bound on the number of reads counted. This can help reduce runtimes in areas with unusually high coverage (repetitive regions, mitochondria, etc.) where the count of reads beyond a minimum threshold adds no useful information. -
--phasing
and--max-coverage-alleles
(EXPERIMENTAL): Set--phasing=True
to enable phasing optimization. When enabled, coverage will only be calculated for up to the number of alleles specified by--max-coverage-alleles
, including reference. This can help reduce long runtimes when analyzing complex repetitive regions, with a slight reduction in sensitivity. -
--threads
: Use the specified number of threads. By default, one thread is allocated per available processor. -
--cache
: Attempt to cache as much as possible in RAM. This should be used only on systems with sufficient memory when the BioGraph is located on slow storage, such as NFS. -
--debug
: Enable verbose logging.
To see a list of all options, use the --help
switch:
(bg7)$ biograph coverage --help
usage: coverage [-h] -b BG -v VCF -r REF [-R REGIONS] [-s SAMPLE] [-o OUT]
[-d DF] [--strip-fmt] [-m MIN_INSERT] [-M MAX_INSERT]
[--min-clearance MIN_CLEARANCE]
[--min-contig-size MIN_CONTIG_SIZE] [-p] [--cache]
[-t THREADS] [--max-reads-per-entry MAX_READS_PER_ENTRY]
[--max-coverage-paths MAX_COVERAGE_PATHS] [--phasing]
[--debug] [--max-coverage-alleles MAX_COVERAGE_ALLELES]
[--filter-dup-align] [--ideal-insert IDEAL_INSERT]
[--placer-max-ambig PLACER_MAX_AMBIG]
Genotype events in a VCF
VCF must have one entry per chrom:start-end. Use `bcftools norm -m+any`
By default, check the BioGraph file for the sample to annotate.
If BG sample name is not in the vcf's sample fields, `--sample` must be specified.
Outputs all VCF records and the single SAMPLE column (i.e. other samples are removed)
optional arguments:
-h, --help show this help message and exit
-b BG, --biograph BG Merged BioGraph file containing individuals
-v VCF, --variants VCF
The vcf containing all samples
-r REF, --reference REF
Reference genome folder
-R REGIONS, --regions REGIONS, --bed REGIONS
Bed file of regions to process
-s SAMPLE, --sample SAMPLE
Sample in merged BioGraph to use (None)
-o OUT, --output OUT Annotated vcf file output (/dev/stdout)
-d DF, --dataframe DF
Output coverage DataFrame (coverage.df)
--strip-fmt Strip any existing FORMAT fields
-m MIN_INSERT, --min-insert MIN_INSERT
Minimum insert size to consider paired (200)
-M MAX_INSERT, --max-insert MAX_INSERT
Maximum insert size to consider paired (1000)
--min-clearance MIN_CLEARANCE
Find microcontig boundaries separated by at least this
many reference bases (0 to disable, default = 2000)
--min-contig-size MIN_CONTIG_SIZE
Minimum contig size to make when computing
microcontigs (default = 3000)
-p, --passonly Only attempt to genotype PASS variants
--cache Attempt to cache as much as possible in RAM
-t THREADS, --threads THREADS
Number of threads to use (48)
--max-reads-per-entry MAX_READS_PER_ENTRY
If nonzero, maximum number of reads per seqset entry
to use; additional reads are ignored (0)
--max-coverage-paths MAX_COVERAGE_PATHS
If nonzero, limit number of parallel coverage paths to
trace (3000)
--phasing Skip paths counterindicated by phasing information
--debug Verbose logging
--max-coverage-alleles MAX_COVERAGE_ALLELES
Maximum number of alleles through reference to
calculate coverage on, including the reference path.
Only used when phasing is enabled.
--filter-dup-align Count each read only once per local area
--ideal-insert IDEAL_INSERT
Place paired reads only once, optimizing for closest
to the given insert size.
--placer-max-ambig PLACER_MAX_AMBIG
Maximum number of ambiguous choices before the pair
placer will discard a read