This is an implementation of an ensemble variant calling method. Specifically, it takes VCF files generated by various calling algorithms and merges them according to specified thresholds on variant and genotype concordance. The resulting VCF can range from a strict consensus among inputs, to a union of all possible observations.
Test data is located in the data/ directory. The following command:
python ./consensus_tool/consensus_genotyper.py --geno-thresh 3 --site-thresh 3 ./data/*vcf > test.output.vcf
Will take the three test files in the data directory and generate a strict consensus of sites and genotypes (i.e. 3/3 files contain the variant site, and 3/3 files agree on the genotype for a sample at that site).
Some things to keep in mind:
- Multi-sample VCF files are currently supported, and the output will contain only samples which are found in all input files.
- Files must be sorted by physical position. This can be achieved using any VCF utility such as vcf-sort in vcftools. The caller works by iterating simultaneously across all input files until a matching variant record is found. If a VCF file is not sorted similarly, it is unlikely that any overlapping sites will be found.
- VCF files must be indexed with tabix. This also requires that they be zipped with bgzip.
usage: consensus_genotyper.py [-h] [--site-threshold SITETHRESH] [--genotype-threshold GENOTHRESH] [--ignore-missing] VCFS [VCFS ...] Find sites and genotypes that aggree among an arbitrary number of VCF files. positional arguments: VCFS List of VCF files for input. optional arguments: -h, --help show this help message and exit --site-threshold SITETHRESH, -s SITETHRESH Number of inputs which must agree for a site to be included in the output. --genotype-threshold GENOTHRESH, -g GENOTHRESH Number of inputs which must agree for a genotype to be marked as non-missing. --ignore-missing, -m Flag specifying how to handle missing genotypes in the vote. If present, missing genotypes are excluded from the genotype concordance vote unless all genotypes are missing. usage: consensus_genotyper.py [-h] VCFS [VCFS ...] Find sites and genotypes which aggree among an arbitrary number of VCF files. positional arguments: VCFS List of VCF files for input. optional arguments: -h, --help show this help message and exit