Metagenomics Index Correction
This repository contains scripts used to prepare, compare and analyse metagenomic classifications using custom index databases, either based on default NCBI or GTDB taxonomic systems. These scripts were used in the preparation of the Méric, Wick et al. (2019) manuscript entitled: Correcting index databases improves metagenomic studies.
Our custom index databases are hosted on figshare:
Table of contents
- Using custom indices
- Scripts for analysis/index creation used in the manuscript
Using custom indices
We used Centrifuge (version 1.0.4) in our manuscript because it uses less RAM than alternative tools. Centrifuge indices consist of a few files named
Please consult the Centrifuge manual for full instructions on its usage, but in brief the steps are:
- Download a pre-compiled index (from our our figshare project).
- Run Centrifuge on a paired-end Illumina read set:
centrifuge -p 16 -x index_dir -1 *_1.fq.gz -2 *_2.fq.gz
-S classifications --report-file report.tsv
- If you need a paired-end read set to try this out, here is the SRS015782 faecal metagenome from the Human Microbiome Project that we used in the study.
- We found that Centrifuge took from 10–45 minutes, depending on the size of the sample and the speed of the computer/cluster.
- Generate a Kraken-style summary report:
centrifuge-kreport -x index_dir classifications > kreport.txt
Kraken2 is an alternative metagenomics classification tool. It is faster than Centrifuge but may use more RAM. Its indices consist of a few
*.k2d files. Please consult the Kraken2 manual for usage instructions. As part of this work, we generated Kraken2 (and Kraken1) custom indices too, which can downloaded from our figshare project.
Scripts for analysis/index creation used in the manuscript
To run these scripts you will need:
Custom classification indices using GTDB definitions
tax_from_gtdb.py script generates NCBI-style taxonomy files using the GTDB definitions. It can also prepare the sequence files necessary for building a Centrifuge and/or Kraken2 database.
Usage – just make the taxonomy files:
tax_from_gtdb.py --gtdb taxonomy.tsv --nodes nodes.dmp --names names.dmp
Usage – prepare for building a Centrifuge database:
tax_from_gtdb.py --gtdb taxonomy.tsv --assemblies genomes --nodes ex.tree --names ex.name --conversion ex.conv --cat_fasta ex.fa
Usage – prepare for building a Kraken2 database:
tax_from_gtdb.py --gtdb taxonomy.tsv --assemblies genomes --nodes nodes.dmp --names names.dmp --kraken_dir kraken_genomes
--gtdb: taxonomy file obtained from the downloads section of the GTDB website. Required. GTDB provides separate taxonomy files for bacteria and archaea, but these can be concatenated together to make a single taxonomy file for this script.
--genomes: a directory containing all the reference genomes sequences in FASTA format (can be gzipped). Required if you are preparing for a Centrifuge/Kraken2 database build.
--nodes: filepath to save the nodes file (equivalent to NCBI's
--names: filepath to save the names file (equivalent to NCBI's
--conversion: filepath to save the conversion table for Centrifuge.
--cat_fasta: filepath to save the concatenated reference seq file for Centrifuge
--kraken_dir: a directory to save the Kraken2-ready reference FASTAs.
Taxonomic rank counts from Centrifuge output
count_classifications.py script allows the breakdown of read classification to different taxonomic ranks from a raw classification output from Centrifuge.
For reads with multiple classifications to different tax IDs, this script will find the LCA of the tax IDs to give a final classification. It also categories each read based on the taxonomic rank of its final classification (unclassified, root, domain, phylum, class, order, family, genus or species).
The NCBI taxonomy contains many additional and in-between ranks (e.g. subspecies, superfamily). If the read's final classification falls into one of these then the script will give it the rank of the first ancestor with a standard rank. E.g. subspecies -> species, superfamily -> order.
count_classifications.py --centrifuge classifications --tree nodes.file --prefix out_prefix 1> read_set_summary 2> per_read_summary
--centrifuge: the output of Centrifuge in tab-delimited format
--tree: the taxonomy file used to build the Centrifuge index
--prefix: prefix of the tsv files this script will make
read_set_summary: a single-row table summarising the read set (to stdout)
per_read_summary: a one-row-per-read table describing each read (to stderr)
Custom dereplication thresholds for GTDB assemblies
"Dereplication" refers to the threshold-based selection of representative reference genomes for phylogenetically-similar clusters, and has been used in the GTDB study to provide a reduced, less-redundant list of 28941 bacterial and archaeal reference genomes that are representative of similarity clusters on a phylogenetic tree. By default, the "dereplicated" list proposed on the downloads section of the GTDB website contains (release 86; September 2018) 27372 bacterial and 1569 archaeal genomes. As explained in the GTDB publication, the dereplication was made according to various criteria, but mostly defining two genomes as being "replicates" when their Mash distance was ≤0.05 (∼ANI of 95%).
dereplicate_assemblies.py script uses Mash-based clustering to dereplicate GTDB assemblies using a user-defined Mash threshold. This allows for a reduction in the number of assemblies (making for faster and smaller Centrifuge index databases), while still retaining a variable (user-defined) amount of genomic diversity.
dereplicate_assemblies.py --threads 32 --threshold 0.005 all_assemblies_dir derep_assemblies_dir bac_and_arc_taxonomy_r86.tsv
--threads: number of CPU threads
--threshold: Mash distance threshold. The GTDB_r86_46k index from the associated publication used 0.005.
all_assemblies_dir: path to all assemblies are located, typically all assemblies from GenBank/RefSeq.
bac_and_arc_taxonomy_r86.tsv: GTDB taxonomy for all bacterial genomes from the GTDB website
derep_assemblies_dir: output path
Counting N-heavy reads in FASTQ sequencing reads
read_set_n_count.py script has been written to deal with the fact that some HMP read sets are full of
N-heavy reads. It takes one or more read files as input and outputs a table which shows information on the number of reads which are mostly
read_set_n_count.py file.fastq > counts
file.fastq: read file to be counted
counts: a tab-delimited file with the following four columns: filename, number of reads, number of N reads, percentage of N reads.
file.fastq 44880075 44382676 98.892%
Finding reads unclassified in one index but not another
find_unclassified.py script can compare Centrifuge classifications between two indices and identify reads that are unclassified using one index but not another. In order to make sense, this script must be run using the outcomes of two classifications of the same sample using two different indices. This script was used in our analysis to reclassify using the
nt database reads that had remained unclassified using the GTDB_r86_46k index.
find_unclassified.py index1_centrifuge_output index2_centrifuge_output | grep -v unclassified > output
index2_centrifuge_output: the two classification files to be compared
output: classification in the same format as the regular Centrifuge output (to stdout)