A tool for genotyping Mycobacterium tuberculosis (Mtb) isolates from whole-genome sequencing (WGS) data
python3 with numpy, pandas and scikit-allel
Inputs: vcf files (<sample>.vcf.gz
) in one directory, one file per sample.
Output: lineage.csv
has the form
sample_id | genotype | genotype_specific_snp | .. |
---|---|---|---|
sample_1 | L2.2.M4.1 | L2.2 (24/24), L2.2.M4.1 (21/21), L2.2.Modern (6/6), L2 (6/6), L2.2.M4 (3/3) | .. |
sample_2 | L2.2.M4 | L2.2 (24/24), L2.2.1 (10/10), L2.2.Modern (6/6), L2 (6/6), L2.2.M4 (3/3) | .. |
sample_3 | L1.2.2.2 | L1.2 (56/56), L1.2.2 (53/53), L1.2.2.2 (19/19), L1 (15/15) | .. |
.. | .. | .. | .. |
The column genotype_specific_snp
contains a list of counts of genotype-specific SNPs found in the sample according to the default scheme (see below). For instance, L2.2 (24/24)
means the sample has 24 SNPs out of 24 L2.2-specific SNPs in this scheme.
The most specific sublineage level with a high proportion of genotype-specific SNPs present is returned as the predicted genotype in the column genotype
.
Other columns are specific schemes from published studies (optional).
mtbtyper.py vcf_dir [options]
option | description |
---|---|
-o , --out |
output directory (default: current working directory) |
-f , --fout |
output file name (default: lineage.csv) |
-e , --vcf_end |
ending pattern of vcf file (default: vcf.gz) |
--all_schemes |
add prediction from all available SNP schemes (default: false) |
--snpdb |
path to genotyping SNP schemes (default: snpdb) |
--quiet |
suppress screen output (default: false) |
mtbtyper.py vcf -o lineage --all_schemes
In this example, the input vcf files are placed in a directory called vcf
. The program outputs to lineage/lineage.csv
, which include all available schemes (see below).
The default scheme is a combination of the best scheme for each group of Mtb. It contains over 130 genotypes at different levels of classification hierarchy; see e.g. Coll et al. (2014).
genotypes | source |
---|---|
L1 sublineages | Netikul et al. (2022) |
L2 sublineages | Thawornwattana et al. (2021) |
L4.5.1 | Mokrousov et al. (2017) |
L4.5.2, L4.5.3 | Ajawatanawong et al. (2019) |
Animal-adapted lineages | Lipworth et al. (2019) |
Animal-adapted lineages + L6 | Unpublished |
L8 | Napier et al. (2020) |
Other L1-L7 lineages | Coll et al. (2014) (diagnostic SNPs) |
Other schemes are based on individual published studies. Use --all_schemes
flag to also output SNP counts from these schemes.
scheme | description | reference |
---|---|---|
coll2014 |
L1-L7 from a global collection | Coll et al. (2014), available here |
coll2014_diag |
Diagnostic subset of coll2014 |
Coll et al. (2014) |
coll2014_barcode |
Barcoding subset of coll2014_diag |
Coll et al. (2014), Table S3 |
cr1 |
L1-L4 from Chiang Rai, Thailand | Ajawatanawong et al. (2019) |
freschi2021 |
L1-L4; implemented in fast-lineage-caller | Freschi et al. (2021) |
freschi2021_hierarchical |
Same as freschi2021 but with different genotype names |
Freschi et al. (2021) |
l1 |
L1 from a globally representative collection | Netikul et al. (2022), Table S6 |
l1_barcode |
Barcoding subset of l1 |
Netikul et al. (2022), Table S6 |
l2 |
L2 from a globally representative collection | Thawornwattana et al. (2021), Table S7 |
l2_barcode |
Barcoding subset of l2 |
Thawornwattana et al. (2021), Table S8 |
lipworth2019 |
L1-L6 and animal-adapted strains from a UK collection, implemented in snpit | Lipworth et al. (2019) |
merker2015 |
L2 from a global collection | Merker et al. (2015), Table S8 |
napier2020 |
Revised scheme of coll2014 |
Napier et al. (2020), Table S2 |
napier2020_barcode |
Barcoding subset of napier2020 |
Napier et al. (2020), Table S3 |
shitikov2017 |
L2 from a global collection | Shitikov et al. (2017), Table S6 |