Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Metagenomics Index Correction

This repository contains scripts used to prepare, compare and analyse metagenomic classifications using custom index databases, either based on default NCBI or GTDB taxonomic systems. These scripts were used in the preparation of the Méric, Wick et al. (2019) manuscript entitled: Correcting index databases improves metagenomic studies.

Our custom index databases are hosted on figshare:

Table of contents

Using custom indices


We used Centrifuge (version 1.0.4) in our manuscript because it uses less RAM than alternative tools. Centrifuge indices consist of a few files named *.cf.

Please consult the Centrifuge manual for full instructions on its usage, but in brief the steps are:

  1. Download a pre-compiled index (from our our figshare project).
  2. Run Centrifuge on a paired-end Illumina read set:
    • centrifuge -p 16 -x index_dir -1 *_1.fq.gz -2 *_2.fq.gz
      -S classifications --report-file report.tsv
    • If you need a paired-end read set to try this out, here is the SRS015782 faecal metagenome from the Human Microbiome Project that we used in the study.
    • We found that Centrifuge took from 10–45 minutes, depending on the size of the sample and the speed of the computer/cluster.
  3. Generate a Kraken-style summary report:
    • centrifuge-kreport -x index_dir classifications > kreport.txt


Kraken2 is an alternative metagenomics classification tool. It is faster than Centrifuge but may use more RAM. Its indices consist of a few *.k2d files. Please consult the Kraken2 manual for usage instructions. As part of this work, we generated Kraken2 (and Kraken1) custom indices too, which can downloaded from our figshare project.

Scripts for analysis/index creation used in the manuscript


To run these scripts you will need:

  • Python 3.5 (or later)
  • Biopython
  • Mash installed (able to be run from the command line)

Custom classification indices using GTDB definitions

The script generates NCBI-style taxonomy files using the GTDB definitions. It can also prepare the sequence files necessary for building a Centrifuge and/or Kraken2 database.

Usage – just make the taxonomy files: --gtdb taxonomy.tsv --nodes nodes.dmp --names names.dmp

Usage – prepare for building a Centrifuge database: --gtdb taxonomy.tsv --assemblies genomes --nodes ex.tree --names --conversion ex.conv --cat_fasta ex.fa

Usage – prepare for building a Kraken2 database: --gtdb taxonomy.tsv --assemblies genomes --nodes nodes.dmp --names names.dmp --kraken_dir kraken_genomes


  • --gtdb: taxonomy file obtained from the downloads section of the GTDB website. Required. GTDB provides separate taxonomy files for bacteria and archaea, but these can be concatenated together to make a single taxonomy file for this script.
  • --genomes: a directory containing all the reference genomes sequences in FASTA format (can be gzipped). Required if you are preparing for a Centrifuge/Kraken2 database build.


Taxonomic rank counts from Centrifuge output

The script allows the breakdown of read classification to different taxonomic ranks from a raw classification output from Centrifuge.

For reads with multiple classifications to different tax IDs, this script will find the LCA of the tax IDs to give a final classification. It also categories each read based on the taxonomic rank of its final classification (unclassified, root, domain, phylum, class, order, family, genus or species).

The NCBI taxonomy contains many additional and in-between ranks (e.g. subspecies, superfamily). If the read's final classification falls into one of these then the script will give it the rank of the first ancestor with a standard rank. E.g. subspecies -> species, superfamily -> order.

Usage: --centrifuge classifications --tree nodes.file --prefix out_prefix 1> read_set_summary 2> per_read_summary


  • --centrifuge: the output of Centrifuge in tab-delimited format
  • --tree: the taxonomy file used to build the Centrifuge index


  • --prefix: prefix of the tsv files this script will make
  • read_set_summary: a single-row table summarising the read set (to stdout)
  • per_read_summary: a one-row-per-read table describing each read (to stderr)

Custom dereplication thresholds for GTDB assemblies

"Dereplication" refers to the threshold-based selection of representative reference genomes for phylogenetically-similar clusters, and has been used in the GTDB study to provide a reduced, less-redundant list of 28941 bacterial and archaeal reference genomes that are representative of similarity clusters on a phylogenetic tree. By default, the "dereplicated" list proposed on the downloads section of the GTDB website contains (release 86; September 2018) 27372 bacterial and 1569 archaeal genomes. As explained in the GTDB publication, the dereplication was made according to various criteria, but mostly defining two genomes as being "replicates" when their Mash distance was ≤0.05 (∼ANI of 95%).

Here, our script uses Mash-based clustering to dereplicate GTDB assemblies using a user-defined Mash threshold. This allows for a reduction in the number of assemblies (making for faster and smaller Centrifuge index databases), while still retaining a variable (user-defined) amount of genomic diversity.

Usage: --threads 32 --threshold 0.005 all_assemblies_dir derep_assemblies_dir bac_and_arc_taxonomy_r86.tsv


  • --threads: number of CPU threads
  • --threshold: Mash distance threshold. The GTDB_r86_46k index from the associated publication used 0.005.


  • all_assemblies_dir: path to all assemblies are located, typically all assemblies from GenBank/RefSeq.
  • bac_and_arc_taxonomy_r86.tsv: GTDB taxonomy for all bacterial genomes from the GTDB website


  • derep_assemblies_dir: output path

Counting N-heavy reads in FASTQ sequencing reads

The simple script has been written to deal with the fact that some HMP read sets are full of N-heavy reads. It takes one or more read files as input and outputs a table which shows information on the number of reads which are mostly N.

Usage: file.fastq > counts


  • file.fastq: read file to be counted


  • counts: a tab-delimited file with the following four columns: filename, number of reads, number of N reads, percentage of N reads.
    e.g. file.fastq 44880075 44382676 98.892%

Finding reads unclassified in one index but not another

The script can compare Centrifuge classifications between two indices and identify reads that are unclassified using one index but not another. In order to make sense, this script must be run using the outcomes of two classifications of the same sample using two different indices. This script was used in our analysis to reclassify using the nt database reads that had remained unclassified using the GTDB_r86_46k index.

Usage: index1_centrifuge_output index2_centrifuge_output | grep -v unclassified > output


  • index1_centrifuge_output and index2_centrifuge_output: the two classification files to be compared


  • output: classification in the same format as the regular Centrifuge output (to stdout)


GNU General Public License, version 3


No description, website, or topics provided.




No releases published


No packages published