HaploGrep is a tool for mtDNA haplogroup classification. We provide HaploGrep as a fast and free haplogroup classification web service or as a commandline tool. You can upload your mtDNA profiles aligned to rCRS or RSRS (beta) and receive mitochondrial haplogroups in return. FASTA, VCF and hsd input files are supported. As of today (August 18, 2021), HaploGrep and the updated HaploGrep2 have been cited over 920 times (Google Scholar - August 18, 2021).
Download and Install
curl -sL haplogrep.now.sh | bash ./haplogrep
If you want to use our web service, please click here.
The haplogroup classifications in Haplogrep are based on the revised tree by Dür et al, 2021, which is an update of the latest PhyloTree version 17 by van Oven, 2016 based on the work of van Oven & Kayser, 2009.
Additionally please cite (1) Dür et al, 2021 if you use the latest Phylotree17_FU1 tree, (2) van Oven, 2016 for PhyloTree 17 or van Oven & Kayser, 2009 in case an older PhyloTree version has been used.
Currently two tools are available.
- Classify allows to classify input profiles (hsd, fasta, VCF) into haplogroups.
- Distance calculates the distance between two haplogroups.
Run HaploGrep Classification with test data
# Download test data wget https://github.com/seppinho/haplogrep-cmd/raw/master/test-data/vcf/HG00097.vcf.gz # Run Haplogrep Classification ./haplogrep classify --in HG00097.vcf.gz --format vcf --out haplogroups.txt
Input File Formats
VCF or Fasta
The recommended input format is a single-sample/multi-sample VCF (*.vcf.gz or *.vcf).
For alignment, bwa version 0.7.17 is used. For each input sequence, HaploGrep excludes positions from the tested range that are (1) not covered by the input fragment or (2) has marked with a N in the sequence.
You can also specify your profiles in the original HaploGrep hsd format, which is a simple tab-delimited file format consisting of 4 columns (ID, Range, Haplogroup and Polymorphisms).
Sample1 1-16569 H100 263G 315.1C 750G 1041G 1438G 4769G 8860G 9410G 12358G 13656C 15326G 16189C 16192T 16519C Sample2 1-16569 ? 73G 263G 315.1C 750G 1438G 3010A 3107C 4769G 5111T 8860G 10257T 12358G 15326G 16145A 16222T 16519C
For readability, the polymorphisms are also tab-delimited (so columns >= 4). A hsd example can be found here.
||Please provide the input file name|
||Please provide the input format of your data - valid options are:
||Please provide an output name|
||By default HaploGrep expects that your data is aligned against rCRS (which is included in the human references hg19 and hg38). If your data is aligned against RSRS, add the
||To change the classification metric to Hamming Distance (
||For additional information on mtSNPs (e.g. found or remaining polymorphisms) please add the
||The used Phylotree version can be changed using the
||If you are using genotyping arrays, please add the
||Add this option to skip our rules that fixes the mtDNA nomenclature for fasta import. Click here for further information. Applying the rules is the default option since v2.4.0|
||To export the best n hits for each sample add the
||Create a graph of all input samples by using the
This tool allows to calculate the distance between two haplogroups.
||Input file must include 2 columns named "hg1" and "hg2" seperated by ";"|
||Output location of distance file|
mtDNA reference sequences
Several mtDNA references exist, HaploGrep supports rCRS and RSRS. Please checkout our blog post to learn more about this topic.
If you are using HaploGrep for genotyping array data, please have a look at the
--chip parameter above.
When using fasta as an input format, HaploGrep uses bwa mem to align data. Since the mitochondrial phylogeny is using a 3′ alignment, indels are often not correctly placed for haplogroup classification, when using standard-aligner designed for nuclear DNA. In some cases, where haplogroup defining indels are expected (e.g. missing 8281d-8289d) this can yield to a lower haplogroup quality. To adjust for that, we provide a set of currently 66 rules that can be applied prior to classification. The rules have been estimated based on 7,848 fasta files in 4 steps:
- Downloading Phylotree defining sequences from GenBank
- Aligning data with bwa mem
- Classifying the profiles using HaploGrep
- Comparing final fasta profiles with the Phylotree input profiles (remaining vs. not found) in a txt format (derived from parsing Phylotree).
For example, the subsequent rule changes input polymorphisms
309.1C 309.2C 315.1C.
Heteroplasmies (VCF only)
Heteroplasmies are often stored as heterozygous genotypes (0/1). If a AF tag (= Allele Frequency) is specified in the VCF file, we add variants with a AF > 0.90 to the input profile. Mutation Server is able to create a valid VCF including heteroplasmies starting from BAM or CRAM.
Please have a look at mitoverse to check for heteroplasmies and contamination in your NGS data.
Check out our blog regarding mtDNA topics.