Skip to content

coverage

Rob Flickenger edited this page Aug 9, 2021 · 1 revision

The biograph coverage command takes a BioGraph, reference, and the list of calls from the discovery step, and produces a VCF containing more accurate read coverage for each variant.

(bg7)$ biograph coverage -b my.bg -r my_ref/ -v discovery.vcf.gz --out coverage.vcf

As with the discovery step, coverage outputs unsorted, uncompressed VCF which must be sorted and compressed for subsequent steps. Omitting the --out parameter will write the output to STDOUT for easy streaming:

(bg7)$ biograph coverage -b my.bg -r my_ref/ -v discovery.vcf.gz \
  | vcf-sort | bgzip > coverage.vcf.gz && tabix coverage.vcf.gz

When running coverage on multisample VCFs, only the sample used for the BioGraph is output. The default sample name is the sample name of the BioGraph unless otherwise specified by the --sample option.

(bg7)$ biograph coverage -b my.bg -r my_ref/ -v ajtrio.vcf.gz --sample HG002 \
  --out HG002.vcf

Essential parameters

  • --biograph or -b: the BioGraph
  • --variants or -v: the discovery VCF
  • --reference or -r: the reference (in BioGraph reference format)
  • --output or -o: the output VCF file. If not specified, the VCF will be written to STDOUT.
  • --sample or -s: select the sample column to use in multi-sample VCFs. This option can be omitted for single sample VCFs, or when the sample name of the BioGraph matches the desired sample column in the VCF.

Maximizing performance with microcontigs

Large contiguous regions of reference are automatically broken down into smaller chunks to maximize parallelization. A contig break can be made anywhere that is sufficiently free of variant calls. This behavior can be tuned using the --min-clearance and --min-contig-size options.

--min-clearance specifies the minimum number of bases between variants where a break may be made. It must be at least double the --max-insert size and will be automatically increased if the specified value is too small. Specify a value of 0 to disable microcontig generation.

--min-contig-size specifies the smallest contig that will be created. While smaller microcontigs help parallelize work, they also create some additional overhead. Too many microcontigs can lead to underutilized worker threads, as more time must be spent on managing the work queue.

The default values work well for single sample VCFs up to merged VCF populations of about 1000. Consider decreasing either parameter if working with very large populations or extremely dense regions of variation.

Using BED regions for faster coverage

The biograph coverage command will attempt to calculate coverage for every variant in the discovery VCF. To constrain coverage calculation to a specific genomic region, use the --bed option.

When running the full BioGraph pipeline, it is usually more efficient to specify a BED file for discovery. The resulting discovery VCF will only contain calls from the regions of interest, which are then processed by the coverage step.

It can be useful to include a BED for coverage when analyzing a subset of a previous discovery run. The BED file should contain the regions you wish to include:

$ cat chr1.bed
chr1    10000    248946422
(bg7)$ biograph coverage -b my.bg -r grch38/ -v discovery.vcf.gz -o chr1.vcf \
  --bed chr1.bed

Min and max insert sizes

By default, coverage will consider read pairs separated by 200 to 1000 bases. The minimum and maximum insert sizes can be set by --min-insert and --max-insert respectively.

Additional options

  • --passonly: Process only VCF entries with a filter line of PASS.
  • --max-coverage-paths: Some genomic regions (such as highly repetitive regions) are too complex to traverse using short read data. coverage will attempt to trace up to the number of paths specified here before giving up and issuing a warning. Up to 3000 possible coverage paths are traversed by default. Increase this number to attempt to try more paths. This will significantly increase the required processing time but may improve coverage calculation in repetitive regions.
  • --max-reads-per-entry: By default, every applicable read counts towards coverage. Setting this option will impose an upper bound on the number of reads counted. This can help reduce runtimes in areas with unusually high coverage (repetitive regions, mitochondria, etc.) where the count of reads beyond a minimum threshold adds no useful information.
  • --phasing and --max-coverage-alleles (EXPERIMENTAL): Set --phasing=True to enable phasing optimization. When enabled, coverage will only be calculated for up to the number of alleles specified by --max-coverage-alleles, including reference. This can help reduce long runtimes when analyzing complex repetitive regions, with a slight reduction in sensitivity.
  • --threads: Use the specified number of threads. By default, one thread is allocated per available processor.
  • --cache: Attempt to cache as much as possible in RAM. This should be used only on systems with sufficient memory when the BioGraph is located on slow storage, such as NFS.
  • --debug: Enable verbose logging.

Getting help

To see a list of all options, use the --help switch:

(bg7)$ biograph coverage --help
usage: coverage [-h] -b BG -v VCF -r REF [-R REGIONS] [-s SAMPLE] [-o OUT]
                [-d DF] [--strip-fmt] [-m MIN_INSERT] [-M MAX_INSERT]
                [--min-clearance MIN_CLEARANCE]
                [--min-contig-size MIN_CONTIG_SIZE] [-p] [--cache]
                [-t THREADS] [--max-reads-per-entry MAX_READS_PER_ENTRY]
                [--max-coverage-paths MAX_COVERAGE_PATHS] [--phasing]
                [--debug] [--max-coverage-alleles MAX_COVERAGE_ALLELES]
                [--filter-dup-align] [--ideal-insert IDEAL_INSERT]
                [--placer-max-ambig PLACER_MAX_AMBIG]

Genotype events in a VCF

VCF must have one entry per chrom:start-end. Use `bcftools norm -m+any`
By default, check the BioGraph file for the sample to annotate.
If BG sample name is not in the vcf's sample fields, `--sample` must be specified.

Outputs all VCF records and the single SAMPLE column  (i.e. other samples are removed)

optional arguments:
  -h, --help            show this help message and exit
  -b BG, --biograph BG  Merged BioGraph file containing individuals
  -v VCF, --variants VCF
                        The vcf containing all samples
  -r REF, --reference REF
                        Reference genome folder
  -R REGIONS, --regions REGIONS, --bed REGIONS
                        Bed file of regions to process
  -s SAMPLE, --sample SAMPLE
                        Sample in merged BioGraph to use (None)
  -o OUT, --output OUT  Annotated vcf file output (/dev/stdout)
  -d DF, --dataframe DF
                        Output coverage DataFrame (coverage.df)
  --strip-fmt           Strip any existing FORMAT fields
  -m MIN_INSERT, --min-insert MIN_INSERT
                        Minimum insert size to consider paired (200)
  -M MAX_INSERT, --max-insert MAX_INSERT
                        Maximum insert size to consider paired (1000)
  --min-clearance MIN_CLEARANCE
                        Find microcontig boundaries separated by at least this
                        many reference bases (0 to disable, default = 2000)
  --min-contig-size MIN_CONTIG_SIZE
                        Minimum contig size to make when computing
                        microcontigs (default = 3000)
  -p, --passonly        Only attempt to genotype PASS variants
  --cache               Attempt to cache as much as possible in RAM
  -t THREADS, --threads THREADS
                        Number of threads to use (48)
  --max-reads-per-entry MAX_READS_PER_ENTRY
                        If nonzero, maximum number of reads per seqset entry
                        to use; additional reads are ignored (0)
  --max-coverage-paths MAX_COVERAGE_PATHS
                        If nonzero, limit number of parallel coverage paths to
                        trace (3000)
  --phasing             Skip paths counterindicated by phasing information
  --debug               Verbose logging
  --max-coverage-alleles MAX_COVERAGE_ALLELES
                        Maximum number of alleles through reference to
                        calculate coverage on, including the reference path.
                        Only used when phasing is enabled.
  --filter-dup-align    Count each read only once per local area
  --ideal-insert IDEAL_INSERT
                        Place paired reads only once, optimizing for closest
                        to the given insert size.
  --placer-max-ambig PLACER_MAX_AMBIG
                        Maximum number of ambiguous choices before the pair
                        placer will discard a read
Clone this wiki locally