Skip to content

08. Taxonomic Classification

Krista Ternus edited this page Nov 30, 2020 · 6 revisions

Taxonomic Classification

Table of Contents

Workflow Overview

This workflow will perform taxonomic classification on filtered Illumina paired-end reads with Mash version 2.2, Sourmash version 2.1.0, Kaiju version 1.6.1, Kraken2 version 2.0.8-beta, Bracken version 2.2, and KrakenUniq version 0.5.8, MTSv version 1.0.6, as well as visualizations of Kaiju results with KronaTools version 2.7. It has been tested to run offline following the execution of the Read Filtering Workflow. Kaiju can also be run on assembled contigs, which are output by the Assembly Workflow.

Required Files

If you have not already, you will need to activate your metscale environment and perform the Offline Setup before running the taxonomic classification workflow:

[user@localhost ~]$ conda activate metscale

(metscale)[user@localhost ~]$ cd metscale/workflows

(metscale)[user@localhost workflows]$ python download_offline_files.py --workflow taxonomic_classification  

Singularity Images

In the metscale/container_images/ directory, you should see the following Singularity images that were created when running the taxonomic_classification or all flags during the Offline Setup:

Tool Singularity Image Size
Mash mash_2.2--h3d38be6_0.sif 44 MB
Sourmash sourmash_2.1.0--py27he1b5a44_0.sif 271 MB
Kaiju kaiju_1.6.1--pl5.22.0_0.sif 37 MB
Krona krona_2.7--pl5.22.0_1.sif 21 MB
Kraken2 kraken2_2.0.8_beta--pl526h6bb024c_0.sif 327 MB
Bracken bracken_2.2--py27h2d50403_1.sif 262 MB
KrakenUniq krakenuniq_0.5.8--pl526he860b03_0.sif 54 MB
MTSv mtsv_1.0.6--py36hc9558a2_1.sif 675 MB

Databases for Taxonomic Classification

You should also have the following databases downloaded in the metscale/workflows/data directory:

Tool Downloaded Database Description
Mash https://obj.umiacs.umd.edu/diamond/RefSeq_10K_Sketches.msh.gz Microbial signatures from NCBI RefSeq, as used in the original Mash publication
Sourmash s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k21-2017.05.09.tar.gz Microbial genome signatures from NCBI GenBank, prepared with a k-mer length of 21
Sourmash s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz Microbial genome signatures from NCBI GenBank, prepared with a k-mer length of 31
Sourmash s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k51-2017.05.09.tar.gz Microbial genome signatures from NCBI GenBank, prepared with a k-mer length of 51
Sourmash s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-refseq-sbt-k21-2017.05.09.tar.gz Microbial genome signatures from NCBI RefSeq, prepared with a k-mer length of 21
Sourmash s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-refseq-sbt-k31-2017.05.09.tar.gz Microbial genome signatures from NCBI RefSeq, prepared with a k-mer length of 31
Sourmash s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-refseq-sbt-k51-2017.05.09.tar.gz Microbial genome signatures from NCBI RefSeq, prepared with a k-mer length of 51
Kaiju http://kaiju.binf.ku.dk/database/kaiju_index_nr_euk.tgz Protein sequences from NCBI nr, including bacteria, archaea, viruses, & fungi
Kraken2 ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken_8GB_202003.tgz Kraken 2 database comprised of RefSeq bacteria, archaea, and viral genomes, as well as the GRCh38 human genome
Bracken ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken_8GB_202003.tgz Contains three Bracken database files within this sub-directory - database{100, 150, 200}mers.kmer_distrib
KrakenUniq https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_8GB.tgz Kraken 1 database comprised of RefSeq bacteria, archaeal, and viral genomes
MTSv https://rcdata.nau.edu/fofanov_lab/ Compressed_MTSV_database_files/complete_genome.tar.gz Directory and sub-directory that contains the complete reference database files for MTSv
Tool Downloaded Database Size MD5 Checksum
Mash RefSeq_10K_Sketches.msh.gz 6.3 GB 01de4f52d6fe509e146fdeacee0fb8d1 *
Sourmash microbe-genbank-sbt-k21-2017.05.09.tar.gz 4.2 GB d9ed08309ff56a558dcabf4ef1af2c4e **
Sourmash microbe-genbank-sbt-k31-2017.05.09.tar.gz 4.2 GB 759b18b7dcb4f42ff04592d94db3685e **
Sourmash microbe-genbank-sbt-k51-2017.05.09.tar.gz 4.2 GB 2dbe310b7f5a396e76ceba155ff5d424 **
Sourmash microbe-refseq-sbt-k21-2017.05.09.tar.gz 3.6 GB bf594377fe644a741a7cbd78e0b791ac **
Sourmash microbe-refseq-sbt-k31-2017.05.09.tar.gz 3.6 GB 8bec1b006779fe4100a58dcf23c9d92f **
Sourmash microbe-refseq-sbt-k51-2017.05.09.tar.gz 3.6 GB eb4e5e1210403baab4f0fbe86ecc8917 **
Kaiju kaiju_index_nr_euk.tgz 28 GB 89d4dab3c6fd23432d99357e0c4687de *
KrakenUniq minikraken_20171019_8GB.tgz 5.7 GB 5d626afb34d87925990889d27e411c29 *
Kraken2 and Bracken minikraken_8GB_202003.tgz 5.6 GB 51f9cd572b0630fdaf4b5595c672b5b9 *
MTSv complete_genome.tar.gz 39 GB a2dcdbb77ab38e4b474eefb6dddadab1 *

*Uncompressed during offline setup

**Uncompressed during tool execution

If you are missing any of these files, or if you suspect any of these databases are corrupted (e.g., interruptions during the offline download), you should re-run the appropriate setup command, as per instructions in the Offline Setup. It can take a while to download these databases, and an interruption during their download is a common reason for the taxonomic classification workflow not running properly. You can check to see if the MD5 checksums of your downloaded databases match to the expected MD5 checksums in the above table. Here's an example of how to run the md5sum command:

$ cd metscale/workflows/data
$ md5sum kaiju_index_nr_euk.tgz
89d4dab3c6fd23432d99357e0c4687de  kaiju_index_nr_euk.tgz

If the databases were uncompressed and used in a previous workflow, they will appear like this in the metscale/workflows/data directory in the future (note that some of these are hidden directories):

Tool Uncompressed Database Size MD5 Checksum
Mash RefSeq_10K_Sketches.msh 6.8 GB f89c78f3dfbed846dc48727642a97c3a
Sourmash .sbt.genbank-k21/ 18 GB uncompressed hidden directory
Sourmash .sbt.genbank-k31/ 18 GB uncompressed hidden directory
Sourmash .sbt.genbank-k51/ 18 GB uncompressed hidden directory
Sourmash genbank-k21.sbt.json 27 MB b22f0e51672019c04423d8afb03175a0
Sourmash genbank-k31.sbt.json 27 MB 62c475f0f8357a7219f15f466814e9d3
Sourmash genbank-k51.sbt.json 27 MB 690f9de5f7100f2ecae3ef6aab0a39e0
Sourmash .sbt.refseq-k21/ 15 GB uncompressed hidden directory
Sourmash .sbt.refseq-k31/ 15 GB uncompressed hidden directory
Sourmash .sbt.refseq-k51/ 15 GB uncompressed hidden directory
Sourmash refseq-k21.sbt.json 25 MB 8d33669b8618cb3e76e5cd8e48a814c1
Sourmash refseq-k31.sbt.json 25 MB 5347dc7fca3779ae47f2fcfd4f50b15c
Sourmash refseq-k51.sbt.json 25 MB 46a5638f5fa07afe4811a58e9f7c7ded
Kaiju kaiju_db_nr_euk.fmi 48 GB f9e30d8a03925a07ff19c32bcf05df3d
Kaiju names.dmp 134 MB 2e3e40199d42a529802a02f118fb033b
Kaiju nodes.dmp 104 MB 8c4e4d3acdec1a0a837ab71d53c06828
Kraken2 minikraken_8GB_20200312/hash.k2d 7.5G 1e7612c690b728ecefecb8d4a182f58c
Kraken2 minikraken_8GB_20200312/opts.k2d 56 0107fefa85081a8ad341e75c8ee300b2
Kraken2 minikraken_8GB_20200312/seqid2taxid.map 2.8 MB 4be51cd7214ff22bbd6339821b08b06b
Kraken2 minikraken_8GB_20200312/taxo.k2d 2.2 MB b99be394ad419e64c8c55aea14b238f7
Bracken minikraken_8GB_20200312/database50mers.kmer_distrib 2.9 MB 5a46e3b8e4585d5f58f2f10879d28272
Bracken minikraken_8GB_20200312/database100mers.kmer_distrib 2.9 MB 3a737368f629627995026dc1b40711b1
Bracken minikraken_8GB_20200312/database150mers.kmer_distrib 2.6 MB d1eb27e527b36469cd19aef62d491ebe
Bracken minikraken_8GB_20200312/database200mers.kmer_distrib 2.4 MB 38b3320b6701a11ab79f88ea3ea01b77
Bracken minikraken_8GB_20200312/database250mers.kmer_distrib 2.2 MB 603413ad165b860e5fe5af954402268a
KrakenUniq minikraken_20171019_8GB/database.idx 513 MB 3e7cf2135ef4945b92ed02dad96c8f65
KrakenUniq minikraken_20171019_8GB/database.kdb 7.6 GB 7ab07372f6fdb7b78322b7802f72d6bc
KrakenUniq minikraken_20171019_8GB/database.kdb.counts 178 KB c9791458a3bbe97cc8db0b22016894d1
KrakenUniq minikraken_20171019_8GB/taxDB 77 MB 90132b2ad1e02ca4e8fc50cdab905c67
KrakenUniq minikraken_20171019_8GB/taxonomy/names.dmp 139 MB 107ed68d9a24fc47f2e46d4c53206ca8
KrakenUniq minikraken_20171019_8GB/taxonomy/nodes.dmp 108 MB dc6023b70a83f5f01eb2a2dd9d883a07
MTSv Oct-28-2019/artifacts/complete_genome.fas 121 GB aad15187a99c26fc234725932f411574
MTSv Oct-28-2019/artifacts/complete_genome_ff.txt 5.8 MB 6b515496a94240ce27b343b03b8dd66b
MTSv Oct-28-2019/artifacts/complete_genome.json 17 KB variable
MTSv Oct-28-2019/artifacts/complete_genome.p 157 MB 7ee3e27583f1d027a36ef92a723ed91f
MTSv Oct-28-2019/artifacts/nucl_gb.accession2taxid.gz 1.8 GB 93dcf9204dfe383aca117f4b978c4ed9
MTSv Oct-28-2019/artifacts/nucl_wgs.accession2taxid.gz 3.2 GB d954bf090d05eb39778b8dd18364762f
MTSv Oct-28-2019/artifacts/pdb.accession2taxid.gz 3.2 MB 64de81d3ad2cc772b8978a85634deefc
MTSv Oct-28-2019/artifacts/prot.accession2taxid.gz 5.4 MB 7d197b60ca8408b860916be7c13f3380
MTSv Oct-28-2019/artifacts/taxdump.tar.gz 48 MB c9c83e0f08c66d9f2e7eee65194bb519
MTSv Oct-28-2019/artifacts/tree.index 133 MB b8779584df7c78947ae79c08b9d60aa0

Input Files

The taxonomic classification workflow uses input files generated from prior workflows in the MetScale pipeline. The input files for each of the taxonomic classification tools are described below, and all of these files should be located in the metscale/workflows/data directory.

Sourmash Gather Input Files

If the Comparison Workflow was previously run, the taxonomic classification workflow will use the previously generated MinHash signatures as input files for sourmash gather. These are the files that you should see in the metscale/workflows/data directory after generating signatures for the example SRR606249_subset10 dataset in the Comparison Workflow:

File Name File Size Signature Source
SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig 1.1 MB Signatures from conservatively filtered reads (trim2)
SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig 883 KB Signatures from aggressively filtered reads (trim30)

Note that running the Comparison Workflow before taxonomic classification is not required. If the comparison workflow was not run before the taxonomic classification workflow, then the signatures will be generated from the filtered reads at the time the taxonomic classification workflow is executed. Each signature file includes three different k-mer lengths (k=21, k=31, k=51) that will correspond to the pre-computed databases that are used with sourmash gather in the taxonomic classification workflow.

Kaiju, Kraken2, Bracken, and KrakenUniq Input Files

Below are the input files for the lowest common ancestor (LCA) read-based assignment tools. You should see these in the metscale/workflows/data directory after running the example SRR606249_subset10 dataset through the read filtering workflow:

File Name File Size
SRR606249_subset10_1_reads_trim2_1.fq.gz 381 MB
SRR606249_subset10_1_reads_trim2_2.fq.gz 374 MB
SRR606249_subset10_1_reads_trim30_1.fq.gz 365 MB
SRR606249_subset10_1_reads_trim30_2.fq.gz 359 MB

Mash and MTSv Input Files

If the complete Read Filtering was previously run, the taxonomic classification workflow will use the previously generated trimmed, interleaved, and zipped file(s) as input files for Mash dist, Mash screen, and MTSv. You should see these interleaved files in the metscale/workflows/data directory after generating them for the example SRR606249_subset10 dataset in the Read Filtering workflow:

File Name File Size Signature Source
SRR606249_subset10_1_reads_trim2_interleaved_reads.fq.gz 688 MB Interleaved and zipped file from conservatively filtered reads (trim2)
SRR606249_subset10_1_reads_trim30_interleaved_reads.fq.gz 579 MB Interleaved and zipped file from aggressively filtered reads (trim30)

Note that running the Read Filtering Workflow before taxonomic classification is not required. If the full read filtering workflow was not run before the taxonomic classification workflow, then the interleaved files will be generated at the time the taxonomic classification workflow is executed. If you have interleaved and zipped reads you can also execute MTSv by jumping into the workflow (see Different Workflow Entry Points for time stamping files to jump into the workflow with your interleaved *fq.gz files).

Workflow Execution

Workflows are executed according to the sample names and workflow parameters, as specified in the config file. For more information about config files, see the Getting Started wiki page.

After the config file is ready, be sure to specify the Singularity bind path from the metscale/workflows directory before running the taxonomic classification workflow.

cd metscale/workflows
export SINGULARITY_BINDPATH="data:/tmp"  

Execution of the taxonomic classification workflow can then be performed with the following command:

snakemake --use-singularity {rules} {other options}

The following rules are available for execution in the taxonomic classification workflow (yellow stars indicate the terminal rules):

The taxonomic classification rules and their parameters are listed under "workflows" in the metscale/workflows/config/default_workflowconfig.settings config file.

The following rules are available for execution in the taxonomic classification workflow, and these rules and their parameters are also listed under "workflows" in the metscale/workflows/config/default_workflowconfig.settings default config file:

Tool Rule Description
Mash tax_class_mash_screen_workflow Mash screens a database to estimate how well each genome is contained in a metagenome
Mash tax_class_mash_dist_workflow Mash estimates the distance between two isolate genomes to assess similarity
Sourmash tax_class_signatures_workflow Sourmash computes the MinHash signatures of filtered reads
Sourmash tax_class_gather_workflow Sourmash gather finds the best match for each genome within a metagenome
Kaiju tax_class_kaijureport_workflow With trimmed reads as inputs, Kaiju performs taxonomic classifications and reports all organisms identified
Kaiju tax_class_kaijureport_contigs_workflow With contigs as inputs, Kaiju performs taxonomic classifications and reports all organisms identified
Kraken2 tax_class_kraken2_workflow Executes Kraken2 on trimmed reads for taxonomic classification
Bracken tax_class_bracken_workflow Executes Bracken for re-estimation of Kraken2 taxonomic classifications at specific taxon level(s)
KrakenUniq tax_class_krakenuniq_workflow Executes KrakenUniq on trimmed reads for taxonomic classification
MTSv tax_class_create_mtsv_db_workflow Installs and configures the downloaded MTSv database. This only needs to be executed the first time you run MTSv
MTSv tax_class_run_mtsv_workflow Copies over the MTSv .cfg file, updates it with data specified in workflow parameters, and runs MTSv with the parameters given to it. This command will run through MTSv init and analyses, which include MTSv readprep, binning, analysis, reporting, and summary.

Mash Execution

This command will execute mash screen on interleaved, filtered reads from metagenomes:

snakemake --use-singularity tax_class_mash_screen_workflow

This command will execute mash dist on interleaved, filtered reads from isolate genomes:

snakemake --use-singularity tax_class_mash_dist_workflow

If interleaved files have not been generated before running mash, then those will be generated by read filtering workflow when running these commands.

Sourmash Execution

NOTE: The tax_class_signatures_workflow rule runs by calling to the comparison workflow's snakefile, which generates MinHash signatures. If the comparison workflow was already run, snakemake will recognize that the signature file already exists and will not re-execute. Because of this, if you wish to change parameters for the tax_class_signatures_workflow rule, you must also update those parameters in the comparison workflow to ensure that the intended signature file is generated.

To execute sourmash, you can run all rules independently or in tandem by specifying multiple sourmash rules in one run. The following command will run all of the available rules for sourmash:

snakemake --use-singularity tax_class_signatures_workflow tax_class_gather_workflow

Kraken2 Execution

The following command will run Kraken2 on your trimmed read files:

snakemake --use-singularity tax_class_kraken2_workflow

Bracken Execution

The following command will run Bracken on your Kraken2 output files. Alternatively, you can call this by itself as a terminal rule to Kraken2 and Bracken:

snakemake --use-singularity tax_class_bracken_workflow

KrakenUniq Execution

The following command will run KrakenUniq on your trimmed read files:

snakemake --use-singularity tax_class_krakenuniq_workflow

Kaiju Execution

The following command will run kaiju on trimmed reads:

snakemake --use-singularity tax_class_kaijureport_workflow

The following command will run kaiju on assembled contigs, which may be helpful in improving accuracy for this amino acid-based classifier:

snakemake --use-singularity tax_class_kaijureport_contigs_workflow

If users are interested in exploring how Kaiju's performance could be improved with different parameter and filtering options, a number of additional rules are available for Kaiju:

Additional Kaiju Rule Description
tax_class_kaijureport_filtered_workflow With filtered reads as inputs, Kaiju summarizes genera with >=1% of the total reads
tax_class_kaijureport_filtered_contigs_workflow With contigs as inputs, Kaiju summarizes genera with >=1% of the total reads
tax_class_kaijureport_filteredclass_workflow Kaiju summarizes genera with >=1% of all classified reads
tax_class_kaijureport_filteredclass_contigs_workflow With contigs as inputs, Kaiju summarizes genera with >=1% of all classified reads
tax_class_add_taxonnames_workflow Kaiju adds taxon names to its report
tax_class_add_taxonnames_contigs_workflow Kaiju adds taxon names to its report generated from contig file(s)
tax_class_kaiju_species_summary_workflow Kaiju provides a species-level summary of the organisms identified
tax_class_kaiju_species_summary_contigs_workflow Kaiju provides a species-level summary of the organisms identified from contig file(s)
tax_class_visualize_krona_kaijureport_workflow Krona plots are generated from all Kaiju genus-level results
tax_class_visualize_krona_kaijureport_contigs_workflow Krona plots are generated from all Kaiju genus-level results generated from contig file(s)
tax_class_visualize_krona_kaijureport_filtered_workflow Krona plots are generated from Kaiju genera with >=1% of the total reads
tax_class_visualize_krona_kaijureport_filtered_contigs_workflow Krona plots are generated from Kaiju genera with >=1% of the total reads generated from contig file(s)
tax_class_visualize_krona_kaijureport_filteredclass_workflow Krona plots are generated from Kaiju genera with >=1% of all classified reads
tax_class_visualize_krona_kaijureport_filteredclass_contigs_workflow Krona plots are generated from Kaiju genera with >=1% of all classified reads generated from contig file(s)
tax_class_visualize_krona_species_summary_workflow Krona plots are generated from Kaiju species summary report
tax_class_visualize_krona_species_summary_contigs_workflow Krona plots are generated from Kaiju species summary report generated from contig file(s)

The following command will execute all available rules for Kaiju on trimmed reads:

snakemake --use-singularity tax_class_add_taxonnames_workflow tax_class_kaiju_species_summary_workflow tax_class_visualize_krona_kaijureport_filtered_workflow tax_class_visualize_krona_kaijureport_filteredclass_workflow tax_class_visualize_krona_species_summary_workflow

The following command will execute all available rules for Kaiju and Krona, but using contigs as inputs instead of reads:

snakemake --use-singularity tax_class_add_taxonnames_contigs_workflow tax_class_visualize_krona_kaijureport_contigs_workflow tax_class_visualize_krona_kaijureport_filtered_contigs_workflow tax_class_visualize_krona_kaijureport_filteredclass_contigs_workflow tax_class_visualize_krona_species_summary_contigs_workflow

MTSv Execution

To execute MTSv, you can run both rules independently or in tandem by specifying multiple MTSv rules in one run. The following command will run all of the available rules for MTSv:

snakemake --use-singularity tax_class_create_mtsv_db_workflow tax_class_run_mtsv_workflow

NOTE: The tax_class_create_mtsv_db_workflow only needs to be executed the first time MTSv is run, and after that all analyses can be run by executing tax_class_run_mtsv_workflow.

Additional options for snakemake can be found in the snakemake documentation.

To change or specify your own parameters for this or any of the workflows prior to execution, see (see Workflow Architecture)

Output

After successful execution of the taxonomic classification workflow, you will find all of its outputs in the metscale/workflows/data/ directory. You should see the files below after running the example dataset with each of the taxonomic classification rules.

Mash Output

Mash Rule Output File Output Description
tax_class_mash_screen_workflow {sample}_1_reads_trim{quality threshold}_{database}_mash_screen.tab List of genomes contained within the metagenome
tax_class_mash_screen_workflow {sample}_1_reads_trim{quality threshold}_{database}_mash_screen.sorted.tab List of genomes contained within the metagenome, sorted to show best hits at the top
tax_class_mash_dist_workflow {sample}_1_reads_trim{quality threshold}_{database}_mash_distances.tab List of genomes matching to the query
tax_class_mash_dist_workflow {sample}_1_reads_trim{quality threshold}_{database}_mash_distances.sorted.tab List of genomes matching to the query, sorted to show best hits at the top

Sourmash Gather Output

Sourmash Rule Output File Output Description
tax_class_signatures_workflow {sample}_1_reads_trim{quality threshold}_scaled{scaled value}.k{k values}.sig MinHash sketches of filtered reads, generated and saved in a signature file by sourmash compute
tax_class_gather_workflow {sample}_1_reads_trim{quality threshold}_k{k value}.gather_unassigned.csv Unassigned hashes in a single signature generated by sourmash gather
tax_class_gather_workflow {sample}_1_reads_trim{quality threshold}_k{k value}.gather_matches.csv Signatures of matches generated by sourmash gather
tax_class_gather_workflow {sample}_1_reads_trim{quality threshold}_k{k value}.gather_output.csv Taxonomic classifications generated by sourmash gather

Although a few different files are generated by sourmash gather, only the *gather_output.csv file is used for the human interpretation of results. These columns are expected in the primary sourmash gather output file - *gather_output.csv (see sourmash documentation for more information):

Name Description
intersect_bp Approximate number of base pairs shared between the query metagenome and the matched reference genome
f_orig_query Fraction of the original query metagenome shared with the reference genome match
f_match Fraction of the reference genome match shared with the query metagenome
f_unique_to_query Fraction of the query metagenome that uniquely matches to the displayed reference genome. Note that matches are subtracted from the query metagenome as they are found, so f_unique_to_query is iteratively calculated on the remaining metagenome after the prior matches are subtracted. Moving from the top row to the bottom row, the matches are reported in the order that they were identified in *gather_output.csv.
f_unique_weighted f_unique_to_query weighted by abundance
average_abund Average abundance of the match
name Name of the match in the database
filename Name of the database Sequence Bloom Tree (SBT) used
md5 A checksum value to verify data integrity

Kraken2 Output

Rule Output File Output Description
tax_class_kraken2_workflow {sample}_1_reads_trim{quality threshold}_kraken2_classified_{database}_confidence{threshold}_{read_direction}.fq Classified reads from Kraken2 output
tax_class_kraken2_workflow {sample}_1_reads_trim{quality threshold}_kraken2_unclassified_{database}_confidence{threshold}_{read_direction}.fq Unclassified reads from Kraken2 output
tax_class_kraken2_workflow {sample}_1_reads_trim{quality threshold}_kraken2_{database}_confidence{threshold}.report Report to summarize all Kraken2 read-based taxon classifications
tax_class_kraken2_workflow {sample}_1_reads_trim{quality threshold}_kraken2_{database}_confidence{threshold}.out File generated to describe each of the Kraken2 classified (C) and unclassified (U) reads

Bracken Output

Rule Output File Output Description
tax_class_bracken_workflow {sample}_1_reads_trim{quality threshold}_kraken2_{kraken2 database}_confidence{threshold}_bracken.report TSV report of Kraken2 results that are input into Bracken for re-classification
tax_class_bracken_workflow {sample}_1_reads_trim{quality threshold}_kraken2_{kraken 2 database}_confidence{threshold}_bracken_db -{bracken database}_r-{read_length}_l-{level}_t-{threshold} TSV file of Bracken re-classified reads from Kraken2 at a specific taxon level

KrakenUniq Output

Rule Output File Output Description
tax_class_krakenuniq_workflow {sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_classified FASTA file of all classified reads identified by KrakenUniq
tax_class_krakenuniq_workflow {sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_unclassified FASTA file of all unclassified reads identified by KrakenUniq
tax_class_krakenuniq_workflow {sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_report Report to summarize all KrakenUniq read-based taxon classifications
tax_class_krakenuniq_workflow {sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_out Text file generated to describe all KrakenUniq classified (C) and unclassified (U) reads

Kaiju Output

Kaiju Rule Output File Output Description
tax_class_kaijureport_workflow or tax_class_kaijureport_contigs_workflow {sample}_1_reads_trim{quality threshold}_{assembler}.kaiju.out Standard Kaiju output file
tax_class_kaijureport_workflow or tax_class_kaijureport_contigs_workflow {sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus.summary Genus-level summary of all Kaiju results
tax_class_kaijureport_filtered_workflow or tax_class_kaijureport_filtered_contigs_workflow {sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_filtered1_total.summary Kaiju genus-level summary with no genera <1% of total reads
tax_class_kaijureport_filteredclass_workflow or tax_class_kaijureport_filteredclass_contigs_workflow {sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_filtered1_classified.summary Kaiju genus-level summary with no genera <1% of all classified reads
tax_class_add_taxonnames_workflow or tax_class_add_taxonnames_contigs_workflow {sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_names.out Kaiju output file with taxon names appended to it
tax_class_kaiju_species_summary_workflow or tax_class_kaiju_species_summary_contigs_workflow {sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_out_species.summary Species-level summary of all kaiju results

These columns are expected in the primary kaiju output file - *.out (see kaiju documentation for more information):

Name Description
Column 1 Either C or U, indicating whether the read is classified or unclassified
Column 2 Name of the read
Column 3 NCBI taxon identifier of the assigned taxon
Column 4 The length or score of the best match used for classification
Column 5 The taxon identifiers of all database sequences with the best match
Column 6 The accession numbers of all database sequences with the best match
Column 7 Matching fragment sequence(s)

Krona Output

Three rules are available to generate krona plots from the genus-level summaries of the Kaiju results, and one rule will generate a krona plot from the species-level summary.

Kaiju/Krona Rule Output File Output Description
tax_class_visualize_krona_kaijureport_workflow or tax_class_visualize_krona_kaijureport_contigs_workflow {sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_krona.html Krona plot of all Kaiju genus-level results
tax_class_visualize_krona_kaijureport_filtered_workflow or tax_class_visualize_krona_kaijureport_filtered_contigs_workflow {sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_krona_filtered1_total.html Krona plot of genus-level Kaiju results after removing genera with <1% of total reads
tax_class_visualize_krona_kaijureport_filteredclass_workflow or tax_class_visualize_krona_kaijureport_filteredclass_contigs_workflow {sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_krona_filtered1_classified.html Krona plot of genus-level Kaiju results after removing genera with <1% of all classified reads
tax_class_visualize_krona_species_summary_workflow or tax_class_visualize_krona_species_summary_contigs_workflow {sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_species_krona.html Krona plot of all species-level Kaiju results

MTSv Output

Rule Output File Output Description
tax_class_create_mtsv_db_workflow Oct-28-2019/ Unzips and formats the MTSv database in this subdirectory
tax_class_run_mtsv_workflow <your_sample_name>_MTSV/ A directory with your sample name as the prefix is generated and contains all the output files of MTSv analyses. Subdirectories containing these files include: Analysis/, Binning/, Logs/, Params/, QueryFastas/, Reports/, and Summary/

MTSv outputs include *html files with summary statistics (Report/binning_report.html, Report/readprep_report.html, and Report/summary_report.html). A final taxonomic summary table for each sample is written to Summary/summary.csv.

Additional Information

Command Line Equivalents

To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools.

Mash Commands

The equivalents of the following commands generate Mash distance taxonomic classification outputs with trim2:

mash sketch -m 2 SRR606249_subset10_trim2_interleaved_reads.fq
mash dist refseq.genomes.k21s1000.msh SRR606249_subset10_trim2_interleaved_reads.fq.msh > SRR606249_subset10_trim2_refseq.genomes.k21s1000_mash_distances.tab
sort -gk3 SRR606249_subset10_trim2_refseq.genomes.k21s1000_mash_distances.tab > SRR606249_subset10_trim2_refseq.genomes.k21s1000_mash_distances.sorted.tab

The equivalents of the following commands generate Mash distance taxonomic classification outputs with trim30:

mash sketch -m 2 SRR606249_subset10_trim30_interleaved_reads.fq
mash dist RefSeq88n.msh SRR606249_subset10_trim30_interleaved_reads.fq.msh > SRR606249_subset10_trim30_refseq.genomes.RefSeq88n_mash_distances.tab
sort -gk3 SRR606249_subset10_trim30_refseq.genomes.RefSeq88n_mash_distances.tab > SRR606249_subset10_trim30_refseq.genomes.k21s1000_mash_distances.sorted.tab

The equivalents of the following commands generate Mash screen taxonomic classification outputs with trim2:

mash screen -p 4 RefSeq88n.msh SRR606249_subset10_trim2_interleaved_reads.fq > SRR606249_subset10_trim2_interleaved_reads.RefSeq88n.screen.tab
sort -gr SRR606249_subset10_trim2_interleaved_reads.RefSeq88n.screen.tab > SRR606249_subset10_trim2_interleaved_reads.RefSeq88n.screen.sorted.tab

The equivalents of the following commands generate Mash screen taxonomic classification outputs with trim30:

mash screen -p 4 refseq.genomes.k21s1000.msh SRR606249_subset10_trim30_interleaved_reads.fq > SRR606249_subset10_trim30_interleaved_reads.k21s1000.screen.tab
sort -gr SRR606249_subset10_trim30_interleaved_reads.k21s1000.screen.tab > SRR606249_subset10_trim30_interleaved_reads.k21s1000.screen.sorted.tab

Sourmash Commands

This *sig file generated by sourmash compute is used as the input file for sourmash compare and sourmash gather. If it was not previously generated in the Comparison Workflow, the signature file will be generated by the tax_class_signatures_workflow rule.

The MinHash signatures of reads filtered with a quality score threshold of 2 are assigned to reference genomes with the equivalent of the following sourmash gather commands (one command for each value of k - 21, 31, and 51):

sourmash gather -k 51 SRR606249_subset10_trim2_scaled10000.k51.sig genbank-k51.sbt.json refseq-k51.sbt.json -o SRR606249_subset10_trim2_k51.gather_output.csv --output-unassigned SRR606249_subset10_trim2_k51.gather_unassigned.csv --save-matches SRR606249_subset10_trim2_k51.gather_matches.csv
sourmash gather -k 31 SRR606249_subset10_trim2_scaled10000.k31.sig genbank-k31.sbt.json refseq-k31.sbt.json -o SRR606249_subset10_trim2_k31.gather_output.csv --output-unassigned SRR606249_subset10_trim2_k31.gather_unassigned.csv --save-matches SRR606249_subset10_trim2_k31.gather_matches.csv
sourmash gather -k 21 SRR606249_subset10_trim2_scaled10000.k21.sig genbank-k21.sbt.json refseq-k21.sbt.json -o SRR606249_subset10_trim2_k21.gather_output.csv --output-unassigned SRR606249_subset10_trim2_k21.gather_unassigned.csv --save-matches SRR606249_subset10_trim2_k21.gather_matches.csv

The MinHash signatures of reads filtered with a quality score threshold of 30 are assigned to reference genomes with the equivalent of the following sourmash gather commands (one command for each value of k - 21, 31, and 51):

sourmash gather -k 51 SRR606249_subset10_trim30_scaled10000.k51.sig genbank-k51.sbt.json refseq-k51.sbt.json -o SRR606249_subset10_trim30_k51.gather_output.csv --output-unassigned SRR606249_subset10_trim30_k51.gather_unassigned.csv --save-matches SRR606249_subset10_trim30_k51.gather_matches.csv
sourmash gather -k 31 SRR606249_subset10_trim30_scaled10000.k31.sig genbank-k31.sbt.json refseq-k31.sbt.json -o SRR606249_subset10_trim30_k31.gather_output.csv --output-unassigned SRR606249_subset10_trim30_k31.gather_unassigned.csv --save-matches SRR606249_subset10_trim30_k31.gather_matches.csv
sourmash gather -k 21 SRR606249_subset10_trim30_scaled10000.k21.sig genbank-k21.sbt.json refseq-k21.sbt.json -o SRR606249_subset10_trim30_k21.gather_output.csv --output-unassigned SRR606249_subset10_trim30_k21.gather_unassigned.csv --save-matches SRR606249_subset10_trim30_k21.gather_matches.csv

Kraken2 Commands

The equivalents of the following commands generate taxonomic classification outputs with Kraken2 for trim2 and trim30:

kraken2 --db minikraken_8GB_20200312/ --paired SRR606249_subset10_1_reads_trim2_1.fq.gz SRR606249_subset10_1_reads_trim2_2.fq.gz --use-names --confidence 0 --report SRR606249_subset10_trim2_kraken2.report --unclassified-out SRR606249_subset10_trim2_kraken2_unclassified#.fq.gz --classified-out SRR606249_subset10_trim2_kraken2_classified#.fq.gz --output SRR606249_subset10_trim2_kraken2.out.txt
kraken2 --db minikraken_8GB_20200312/ --paired SRR606249_subset10_1_reads_trim30_1.fq.gz SRR606249_subset10_1_reads_trim30_2.fq.gz --use-names --confidence 0 --report SRR606249_subset10_trim30_kraken2.report --unclassified-out SRR606249_subset10_trim30_kraken2_unclassified#.fq.gz --classified-out SRR606249_subset10_trim30_kraken2_classified#.fq.gz --output SRR606249_subset10_trim30_kraken2.out.txt

Bracken Commands

The equivalents of the following commands generate Bracken species-level taxonomic classification outputs using the Kraken2 reports previously generated with trim2 and trim30:

bracken -d minikraken_8GB_20200312/ -i SRR606249_subset10_1_reads_trim2_kraken2_minikraken_8GB_20200312_confidence0.report -r 100 -l S -t 0 -o SRR606249_subset10_1_reads_trim2_kraken2_minikraken_8GB_20200312_confidence0_bracken_db-default_r-100_l-S_t-0
bracken -d minikraken_8GB_20200312/ -i SRR606249_subset10_1_reads_trim30_kraken2_minikraken_8GB_20200312_confidence0.report -r 100 -l S -t 0 -o SRR606249_subset10_1_reads_trim30_kraken2_minikraken_8GB_20200312_confidence0_bracken_db-default_r-100_l-S_t-0

KrakenUniq Commands

The equivalents of the following commands generate taxonomic classification outputs with KrakenUniq for trim2 and trim30:

krakenuniq --paired SRR606249_subset10_1_reads_trim2_1.fq.gz SRR606249_subset10_1_reads_trim2_2.fq.gz --db minikraken_20171019_8GB/ --hll-precision 12 --report-file SRR606249_subset10_trim2_krakenuniq_report --unclassified-out SRR606249_subset10_trim2_krakenuniq_unclassified --classified-out SRR606249_subset10_trim2_krakenuniq_classified --output SRR606249_subset10_trim2_krakenuniq_out
krakenuniq --paired SRR606249_subset10_1_reads_trim30_1.fq.gz SRR606249_subset10_1_reads_trim30_2.fq.gz --db minikraken_20171019_8GB/ --hll-precision 12 --report-file SRR606249_subset10_trim30_krakenuniq_report --unclassified-out SRR606249_subset10_trim30_krakenuniq_unclassified --classified-out SRR606249_subset10_trim30_krakenuniq_classified --output SRR606249_subset10_trim30_krakenuniq_out

Kaiju Commands

The equivalents of the following commands will run Kaiju and generate its standard report for all of the organisms identified (one command is listed below for the trim2 filtered reads and one for trim30):

kaiju -x -v -t nodes.dmp -f kaiju_db_nr_euk.fmi -i SRR606249_subset10_trim_1.fq.gz -j SRR606249_subset10_trim2_2.fq.gz -o SRR606249_subset10_trim2.kaiju.out -z 4
kaiju -x -v -t nodes.dmp -f kaiju_db_nr_euk.fmi -i SRR606249_subset10_trim30_1.fq.gz -j SRR606249_subset10_trim30_2.fq.gz -o SRR606249_subset10_trim30.kaiju.out -z 4

The standard kaiju report includes NCBI taxonomy IDs, but it does not include the taxon names. The equivalents of the following commands are used to append the corresponding taxon names to the last column in Kaiju's output file with addTaxonNames:

addTaxonNames -t nodes.dmp -n names.dmp –u –p –i SRR606249_subset10_trim2.kaiju.out -o SRR606249_subset10_trim2.kaiju_names.out
addTaxonNames -t nodes.dmp -n names.dmp –u –p –i SRR606249_subset10_trim30.kaiju.out -o SRR606249_subset10_trim30.kaiju_names.out

A summary report for a specified taxon rank can be created from the kaiju output with kaijuReport. These summary reports can be directly input into krona with ktImportText.

The equivalents of the following commands generate genus-level summaries with kaijuReport for all organisms identified in the standard kaiju report:

kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r genus -o SRR606249_subset10_trim2.kaiju_genus.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r genus -o SRR606249_subset10_trim30.kaiju_genus.summary

The equivalents of the following commands generate species-level summaries with kaijuReport for all organisms identified in the standard kaiju report:

kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r species -o SRR606249_subset10_trim2.kaiju_out_species.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r species -o SRR606249_subset10_trim30.kaiju_out_species.summary

The equivalents of the following commands generate genus-level summaries with kaijuReport for all genera comprising at least 1% of the total reads:

kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r genus -m 1 -o SRR606249_subset10_trim2.kaiju_genus_filtered1_total.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r genus -m 1 -o SRR606249_subset10_trim30.kaiju_genus_filtered1_total.summary

The equivalents of the following commands generate genus-level summaries with kaijuReport for genera comprising at least 1% of all classified reads:

kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r genus -m 1 -u -o SRR606249_subset10_trim2.kaiju_genus_filtered1_classified.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r genus -m 1 -u -o SRR606249_subset10_trim30.kaiju_genus_filtered1_classified.summary

The equivalents of the following commands are used to generate krona visualizations from the kaijuReport genus-level summaries with ktImportText (three different krona plots for trim2 and three for trim30):

ktImportText -o SRR606249_subset10_trim2.kaiju_genus_krona.html SRR606249_subset10_trim2.kaiju_genus.summary
ktImportText -o SRR606249_subset10_trim2.kaiju_genus_krona_filtered1_total.html SRR606249_subset10_trim2.kaiju_genus_filtered1_total.summary
ktImportText -o SRR606249_subset10_trim2.kaiju_genus_krona_filtered1_classified.html SRR606249_subset10_trim2.kaiju_genus_filtered1_classified.summary
ktImportText -o SRR606249_subset10_trim30.kaiju_genus_krona.html SRR606249_subset10_trim30.kaiju_genus.summary
ktImportText -o SRR606249_subset10_trim30.kaiju_genus_krona_filtered1_total.html SRR606249_subset10_trim30.kaiju_genus_filtered1_total.summary
ktImportText -o SRR606249_subset10_trim30.kaiju_genus_krona_filtered1_classified.html SRR606249_subset10_trim30.kaiju_genus_filtered1_classified.summary

MTSv Commands

For command line equivalents for executing MTSv analyses including Readprep, Binning, Analysis, and Summary, please see the MTSv GitHub page for Binning and Analysis Quick Start Guide

Expected Output Files for the Example Dataset

Below is a more detailed description of the output files expected in the metscale/workflows/data/ directory after the taxonomic classification workflow has been successfully run.

Using the filtered reads generated by the Read Filtering Workflow:

File Name File Size
SRR606249_subset10_1_reads_trim2_1.fq.gz 381 MB
SRR606249_subset10_1_reads_trim2_2.fq.gz 374 MB
SRR606249_subset10_1_reads_trim30_1.fq.gz 327 MB
SRR606249_subset10_1_reads_trim30_2.fq.gz 313 MB

Example Mash Output

File Name File Size
SRR606249_subset10_1_reads_trim30_RefSeq_10K_Sketches.msh_mash_screen.tab 6.7 MB
SRR606249_subset10_1_reads_trim30_RefSeq_10K_Sketches.msh_mash_screen.sorted.tab 6.7 MB
SRR606249_subset10_1_reads_trim30_RefSeq_10K_Sketches.msh_mash_distances.tab 11 MB
SRR606249_subset10_1_reads__trim30_RefSeq_10K_Sketches.msh_mash_distances.sorted.tab 11 MB

Example Sourmash Gather Output

The following files are produced by sourmash after computing the signatures of filtered reads with the tax_class_signatures_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig 1.1 MB
SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig 883 KB

The following files are produced by sourmash after running sourmash gather with the tax_class_gather_workflow rule (this rule produces three files for each value of k and trim quality threshold):

File Name File Size
SRR606249_subset10_1_reads_trim2_k21.gather_matches.csv 2.2 MB
SRR606249_subset10_1_reads_trim2_k21.gather_output.csv 19 KB
SRR606249_subset10_1_reads_trim2_k21.gather_unassigned.csv 52 KB
SRR606249_subset10_1_reads_trim30_k21.gather_matches.csv 2.2 MB
SRR606249_subset10_1_reads_trim30_k21.gather_output.csv 19 KB
SRR606249_subset10_1_reads_trim30_k21.gather_unassigned.csv 11 KB
SRR606249_subset10_1_reads_trim2_k31.gather_matches.csv 2.2 MB
SRR606249_subset10_1_reads_trim2_k31.gather_output.csv 19 KB
SRR606249_subset10_1_reads_trim2_k31.gather_unassigned.csv 58 KB
SRR606249_subset10_1_reads_trim30_k31.gather_matches.csv 2.2 MB
SRR606249_subset10_1_reads_trim30_k31.gather_output.csv 19 KB
SRR606249_subset10_1_reads_trim30_k31.gather_unassigned.csv 12 KB
SRR606249_subset10_1_reads_trim2_k51.gather_matches.csv 2.2 MB
SRR606249_subset10_1_reads_trim2_k51.gather_output.csv 20 KB
SRR606249_subset10_1_reads_trim2_k51.gather_unassigned.csv 57 KB
SRR606249_subset10_1_reads_trim30_k51.gather_matches.csv 2.2 MB
SRR606249_subset10_1_reads_trim30_k51.gather_output.csv 20 KB
SRR606249_subset10_1_reads_trim30_k51.gather_unassigned.csv 14 KB

The f_match reported by sourmash in the *gather_output.csv file reports the fraction of the reference genome match contained within the query metagenome. The reference genome matches are listed according to their GenBank IDs, which can be connected to NCBI TaxIDs and species names.

For the example dataset, there was no loss of sensitivity in increasing the k-mer length or trimming quality threshold - all 62 species were detected in all scenarios. Although different k-mer lengths and read trimming parameters produced similar sourmash gather f_match results, the smallest fraction of reference genomes was typically detected in the k=51 trim30 dataset. Increasing specificity with longer k-mer lengths and more aggressive trimming tended to lower the fraction of each true positive reference genome detected; however, in some cases the increased specificity of longer k-mer lengths resulted in detecting more of the reference genome than shorter k-mer lengths (e.g., Thermotoga sp. RQ2).

Although sourmash gather performs best at the species level, it will report matches below the species level. Two of the species in the Shakya subset 10 example dataset had strain variants, and no false positive strains were detected for these organisms (i.e., Methanococcus maripaludis strain C5, Methanococcus maripaludis strain S2, Shewanella baltica strain OS185, Shewanella baltica strain OS223). False positives for other organisms detected at the species level tended to have smaller f_match values, which suggests that there may be a threshold of genome coverage required for a high confidence species call to be made. Alternatively, these could be low abundance, real contaminants within the sample that was sequenced.

Overall, k-mer length had a larger impact on sourmash results than the trimming quality threshold. The smallest fraction of false positive genomes was detected in the k=51 datasets. The k=51 and trim30 parameters are recommended as defaults if users only want to run sourmash with one trimming quality threshold and value of k.

Example Kaiju Output

The following *.out files are produced by kaiju to report the taxonomic classifications of reads after running the tax_class_kaijureport_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_trim2.kaiju.out 628 MB
SRR606249_subset10_1_reads_trim30.kaiju.out 592 MB

The tax_class_add_taxonnames_workflow rule will append taxon names to the *.out files to produce the following files:

File Name File Size
SRR606249_subset10_1_reads_trim2.kaiju_names.out 1.3 GB
SRR606249_subset10_1_reads_trim30.kaiju_names.out 1.2 GB

Kaiju can produce summary reports at different taxon levels, which can also be input into krona for visualizations of the summarized data. The following species-level summaries can be generated for all kaiju results with the tax_class_kaiju_species_summary_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_trim2.kaiju_out_species.summary 241 KB
SRR606249_subset10_1_reads_trim30.kaiju_out_species.summary 204 KB

Kaiju detected (i.e., at least one read was assigned) >5,000 species in the SRR606249_subset10_1_reads_trim2 dataset and >4,000 species in the SRR606249_subset10_1_reads_trim30 dataset. Although a thresholding approach with the example dataset would not have cleanly achieved 100% precision and recall with kaiju as with sourmash, requiring a minimum number of reads for a high confidence kaiju species detection would substantially reduce the number of false positives. In the trim30 dataset, a threshold of >6,000 assigned reads would have identified 57 true positives (out of an expected 62 species) and one false positive (Proteiniclasticum ruminis = 48,586 reads), which in this case could be a real species contaminant in the sample, and five false negatives (Enterococcus faecalis = 5,079 reads; Chloroflexus aurantiacus = 4,807 reads; Bordetella bronchiseptica = 3,309 reads; Fusobacterium nucleatum = 3,156 reads; Thermoanaerobacter pseudethanolicus = 93 reads). Further benchmarking work with additional datasets will be important to determine how to best apply interpretation guidelines to kaiju results.

The following genus-level summary files are produced by kaiju for all filtered reads after running the tax_class_kaijureport_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_trim2.kaiju_genus.summary 54 KB
SRR606249_subset10_1_reads_trim30.kaiju_genus.summary 47 KB

The *.kaiju_out_krona.summary files can be input into krona with the tax_class_visualize_krona_kaijureport_workflow rule to produce the following krona plots of all genus-level results:

File Name File Size
SRR606249_subset10_1_reads_trim2.kaiju_genus_krona.html 371 KB
SRR606249_subset10_1_reads_trim30.kaiju_genus_krona.html 353 KB

Because of the high number of organisms detected by kaiju, removing the organisms detected at low abundance makes the krona plots easier to read. The tax_class_kaijureport_filtered_workflow rule will produce the following genus-level summary reports that exclude genera with <1% of total reads, and krona will produce corresponding visualizations with the tax_class_visualize_krona_kaijureport_filtered_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_trim2.kaiju_genus_filtered1_total.summary 1.3 KB
SRR606249_subset10_1_reads_trim30.kaiju_genus_filtered1_total.summary 1.3 KB
SRR606249_subset10_1_reads_trim2.kaiju_genus_krona_filtered1_total.html 227 KB
SRR606249_subset10_1_reads_trim30.kaiju_genus_krona_filtered1_total.html 227 KB

The tax_class_kaijureport_filteredclass_workflow rule will produce the following genus-level summary reports that exclude genera with <1% of all classified reads, and krona will produce corresponding visualizations with the tax_class_visualize_krona_kaijureport_filteredclass_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_trim2.kaiju_genus_filtered1_classified.summary 1.5 KB
SRR606249_subset10_1_reads_trim30.kaiju_genus_filtered1_classified.summary 1.5 KB
SRR606249_subset10_1_reads_trim2.kaiju_genus_krona_filtered1_classified.html 227 KB
SRR606249_subset10_1_reads_trim30.kaiju_genus_krona_filtered1_classified.html 227 KB

Example Kraken2 Output

The following files are produced by Kraken2 with the tax_class_kraken2_workflow rule with trim30:

File Name File Size
SRR606249_subset10_1_reads_trim30_kraken2_unclassified_minikraken2_v2_8GB_201904_UPDATE_confidence0_1.fq 113 MB
SRR606249_subset10_1_reads_trim30_kraken2_unclassified_minikraken2_v2_8GB_201904_UPDATE_confidence0_2.fq 111 MB
SRR606249_subset10_1_reads_trim30_kraken2_minikraken2_v2_8GB_201904_UPDATE_confidence0.report 192 KB
SRR606249_subset10_1_reads_trim30_kraken2_minikraken2_v2_8GB_201904_UPDATE_confidence0.out 755 MB

Example KrakenUniq Output

The following files are produced by KrakenUniq with the tax_class_krakenuniq_workflow rule with trim30:

File Name File Size
SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_classified 849 MB
SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_unclassified 166 MB
SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_report 167 KB
SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_out 502 MB

Example MTSv Output

The following files are produced by MTSV with the tax_class_run_mtsv_workflow rule with trim2 and trim30:

File Name File Size
SRR606249_subset10_1_reads_trim2_MTSV/Summary/summary.csv 44 KB
SRR606249_subset10_1_reads_trim30_MTSV/Summary/summary.csv 36 KB