consensus calling tool for cox Galaxy instance.
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
consensus_tool
data
.gitignore
README.md
setup.py

README.md

Description:

This is an implementation of an ensemble variant calling method. Specifically, it takes VCF files generated by various calling algorithms and merges them according to specified thresholds on variant and genotype concordance. The resulting VCF can range from a strict consensus among inputs, to a union of all possible observations.

Usage:

Test data is located in the data/ directory. The following command:

python ./consensus_tool/consensus_genotyper.py --geno-thresh 3 --site-thresh 3 ./data/*vcf > test.output.vcf

Will take the three test files in the data directory and generate a strict consensus of sites and genotypes (i.e. 3/3 files contain the variant site, and 3/3 files agree on the genotype for a sample at that site).

Some things to keep in mind:

  • Multi-sample VCF files are currently supported, and the output will contain only samples which are found in all input files.
  • Files must be sorted by physical position. This can be achieved using any VCF utility such as vcf-sort in vcftools. The caller works by iterating simultaneously across all input files until a matching variant record is found. If a VCF file is not sorted similarly, it is unlikely that any overlapping sites will be found.
  • VCF files must be indexed with tabix. This also requires that they be zipped with bgzip.

###Check out the wiki for a more detailed tutorial!

Options:

usage: consensus_genotyper.py [-h] [--site-threshold SITETHRESH]
                              [--genotype-threshold GENOTHRESH]
                              [--ignore-missing]
                              VCFS [VCFS ...]

Find sites and genotypes that aggree among an arbitrary number of VCF files.

positional arguments:
  VCFS                  List of VCF files for input.

optional arguments:
  -h, --help            show this help message and exit
  --site-threshold SITETHRESH, -s SITETHRESH
                        Number of inputs which must agree for a site to be
                        included in the output.
  --genotype-threshold GENOTHRESH, -g GENOTHRESH
                        Number of inputs which must agree for a genotype to be
                        marked as non-missing.
  --ignore-missing, -m  Flag specifying how to handle missing genotypes in the
                        vote. If present, missing genotypes are excluded from
                        the genotype concordance vote unless all genotypes are
                        missing.   usage: consensus_genotyper.py [-h] VCFS [VCFS ...]

    Find sites and genotypes which aggree among an arbitrary number of VCF files.
    
    positional arguments:
      VCFS        List of VCF files for input.

    optional arguments:
      -h, --help  show this help message and exit