08. Taxonomic Classification

Taxonomic Classification

Workflow Overview

This workflow will perform taxonomic classification on filtered Illumina paired-end reads with Mash version 2.2, Sourmash version 2.1.0, Kaiju version 1.6.1, Kraken2 version 2.0.8-beta, Bracken version 2.2, and KrakenUniq version 0.5.8, MTSv version 1.0.6, as well as visualizations of Kaiju results with KronaTools version 2.7. It has been tested to run offline following the execution of the Read Filtering Workflow. Kaiju can also be run on assembled contigs, which are output by the Assembly Workflow.

Required Files

If you have not already, you will need to activate your metscale environment and perform the Offline Setup before running the taxonomic classification workflow:

[user@localhost ~]$ conda activate metscale

(metscale)[user@localhost ~]$ cd metscale/workflows

(metscale)[user@localhost workflows]$ python download_offline_files.py --workflow taxonomic_classification

Singularity Images

In the metscale/container_images/ directory, you should see the following Singularity images that were created when running the taxonomic_classification or all flags during the Offline Setup:

Tool	Singularity Image	Size
Mash	`mash_2.2--h3d38be6_0.sif`	`44 MB`
Sourmash	`sourmash_2.1.0--py27he1b5a44_0.sif`	`271 MB`
Kaiju	`kaiju_1.6.1--pl5.22.0_0.sif`	`37 MB`
Krona	`krona_2.7--pl5.22.0_1.sif`	`21 MB`
Kraken2	`kraken2_2.0.8_beta--pl526h6bb024c_0.sif`	`327 MB`
Bracken	`bracken_2.2--py27h2d50403_1.sif`	`262 MB`
KrakenUniq	`krakenuniq_0.5.8--pl526he860b03_0.sif`	`54 MB`
MTSv	`mtsv_1.0.6--py36hc9558a2_1.sif`	`675 MB`

Databases for Taxonomic Classification

You should also have the following databases downloaded in the metscale/workflows/data directory:

Tool	Downloaded Database	Description
Mash	`https://obj.umiacs.umd.edu/diamond/RefSeq_10K_Sketches.msh.gz`	Microbial signatures from NCBI RefSeq, as used in the original Mash publication
Sourmash	`s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k21-2017.05.09.tar.gz`	Microbial genome signatures from NCBI GenBank, prepared with a k-mer length of 21
Sourmash	`s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz`	Microbial genome signatures from NCBI GenBank, prepared with a k-mer length of 31
Sourmash	`s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k51-2017.05.09.tar.gz`	Microbial genome signatures from NCBI GenBank, prepared with a k-mer length of 51
Sourmash	`s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-refseq-sbt-k21-2017.05.09.tar.gz`	Microbial genome signatures from NCBI RefSeq, prepared with a k-mer length of 21
Sourmash	`s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-refseq-sbt-k31-2017.05.09.tar.gz`	Microbial genome signatures from NCBI RefSeq, prepared with a k-mer length of 31
Sourmash	`s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-refseq-sbt-k51-2017.05.09.tar.gz`	Microbial genome signatures from NCBI RefSeq, prepared with a k-mer length of 51
Kaiju	`http://kaiju.binf.ku.dk/database/kaiju_index_nr_euk.tgz`	Protein sequences from NCBI nr, including bacteria, archaea, viruses, & fungi
Kraken2	`ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken_8GB_202003.tgz`	Kraken 2 database comprised of RefSeq bacteria, archaea, and viral genomes, as well as the GRCh38 human genome
Bracken	`ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken_8GB_202003.tgz`	Contains three Bracken database files within this sub-directory - `database{100, 150, 200}mers.kmer_distrib`
KrakenUniq	`https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_8GB.tgz`	Kraken 1 database comprised of RefSeq bacteria, archaeal, and viral genomes
MTSv	`https://rcdata.nau.edu/fofanov_lab/ Compressed_MTSV_database_files/complete_genome.tar.gz`	Directory and sub-directory that contains the complete reference database files for MTSv

Tool	Downloaded Database	Size	MD5 Checksum
Mash	`RefSeq_10K_Sketches.msh.gz`	`6.3 GB`	`01de4f52d6fe509e146fdeacee0fb8d1` *
Sourmash	`microbe-genbank-sbt-k21-2017.05.09.tar.gz`	`4.2 GB`	`d9ed08309ff56a558dcabf4ef1af2c4e` **
Sourmash	`microbe-genbank-sbt-k31-2017.05.09.tar.gz`	`4.2 GB`	`759b18b7dcb4f42ff04592d94db3685e` **
Sourmash	`microbe-genbank-sbt-k51-2017.05.09.tar.gz`	`4.2 GB`	`2dbe310b7f5a396e76ceba155ff5d424` **
Sourmash	`microbe-refseq-sbt-k21-2017.05.09.tar.gz`	`3.6 GB`	`bf594377fe644a741a7cbd78e0b791ac` **
Sourmash	`microbe-refseq-sbt-k31-2017.05.09.tar.gz`	`3.6 GB`	`8bec1b006779fe4100a58dcf23c9d92f` **
Sourmash	`microbe-refseq-sbt-k51-2017.05.09.tar.gz`	`3.6 GB`	`eb4e5e1210403baab4f0fbe86ecc8917` **
Kaiju	`kaiju_index_nr_euk.tgz`	`28 GB`	`89d4dab3c6fd23432d99357e0c4687de` *
KrakenUniq	`minikraken_20171019_8GB.tgz`	`5.7 GB`	`5d626afb34d87925990889d27e411c29` *
Kraken2 and Bracken	`minikraken_8GB_202003.tgz`	`5.6 GB`	`51f9cd572b0630fdaf4b5595c672b5b9` *
MTSv	`complete_genome.tar.gz`	`39 GB`	`a2dcdbb77ab38e4b474eefb6dddadab1` *

*Uncompressed during offline setup

**Uncompressed during tool execution

If you are missing any of these files, or if you suspect any of these databases are corrupted (e.g., interruptions during the offline download), you should re-run the appropriate setup command, as per instructions in the Offline Setup. It can take a while to download these databases, and an interruption during their download is a common reason for the taxonomic classification workflow not running properly. You can check to see if the MD5 checksums of your downloaded databases match to the expected MD5 checksums in the above table. Here's an example of how to run the md5sum command:

$ cd metscale/workflows/data
$ md5sum kaiju_index_nr_euk.tgz
89d4dab3c6fd23432d99357e0c4687de  kaiju_index_nr_euk.tgz

If the databases were uncompressed and used in a previous workflow, they will appear like this in the metscale/workflows/data directory in the future (note that some of these are hidden directories):

Tool	Uncompressed Database	Size	MD5 Checksum
Mash	`RefSeq_10K_Sketches.msh`	`6.8 GB`	`f89c78f3dfbed846dc48727642a97c3a`
Sourmash	`.sbt.genbank-k21/`	`18 GB`	`uncompressed hidden directory`
Sourmash	`.sbt.genbank-k31/`	`18 GB`	`uncompressed hidden directory`
Sourmash	`.sbt.genbank-k51/`	`18 GB`	`uncompressed hidden directory`
Sourmash	`genbank-k21.sbt.json`	`27 MB`	`b22f0e51672019c04423d8afb03175a0`
Sourmash	`genbank-k31.sbt.json`	`27 MB`	`62c475f0f8357a7219f15f466814e9d3`
Sourmash	`genbank-k51.sbt.json`	`27 MB`	`690f9de5f7100f2ecae3ef6aab0a39e0`
Sourmash	`.sbt.refseq-k21/`	`15 GB`	`uncompressed hidden directory`
Sourmash	`.sbt.refseq-k31/`	`15 GB`	`uncompressed hidden directory`
Sourmash	`.sbt.refseq-k51/`	`15 GB`	`uncompressed hidden directory`
Sourmash	`refseq-k21.sbt.json`	`25 MB`	`8d33669b8618cb3e76e5cd8e48a814c1`
Sourmash	`refseq-k31.sbt.json`	`25 MB`	`5347dc7fca3779ae47f2fcfd4f50b15c`
Sourmash	`refseq-k51.sbt.json`	`25 MB`	`46a5638f5fa07afe4811a58e9f7c7ded`
Kaiju	`kaiju_db_nr_euk.fmi`	`48 GB`	`f9e30d8a03925a07ff19c32bcf05df3d`
Kaiju	`names.dmp`	`134 MB`	`2e3e40199d42a529802a02f118fb033b`
Kaiju	`nodes.dmp`	`104 MB`	`8c4e4d3acdec1a0a837ab71d53c06828`
Kraken2	`minikraken_8GB_20200312/hash.k2d`	`7.5G`	`1e7612c690b728ecefecb8d4a182f58c`
Kraken2	`minikraken_8GB_20200312/opts.k2d`	`56`	`0107fefa85081a8ad341e75c8ee300b2`
Kraken2	`minikraken_8GB_20200312/seqid2taxid.map`	`2.8 MB`	`4be51cd7214ff22bbd6339821b08b06b`
Kraken2	`minikraken_8GB_20200312/taxo.k2d`	`2.2 MB`	`b99be394ad419e64c8c55aea14b238f7`
Bracken	`minikraken_8GB_20200312/database50mers.kmer_distrib`	`2.9 MB`	`5a46e3b8e4585d5f58f2f10879d28272`
Bracken	`minikraken_8GB_20200312/database100mers.kmer_distrib`	`2.9 MB`	`3a737368f629627995026dc1b40711b1`
Bracken	`minikraken_8GB_20200312/database150mers.kmer_distrib`	`2.6 MB`	`d1eb27e527b36469cd19aef62d491ebe`
Bracken	`minikraken_8GB_20200312/database200mers.kmer_distrib`	`2.4 MB`	`38b3320b6701a11ab79f88ea3ea01b77`
Bracken	`minikraken_8GB_20200312/database250mers.kmer_distrib`	`2.2 MB`	`603413ad165b860e5fe5af954402268a`
KrakenUniq	`minikraken_20171019_8GB/database.idx`	`513 MB`	`3e7cf2135ef4945b92ed02dad96c8f65`
KrakenUniq	`minikraken_20171019_8GB/database.kdb`	`7.6 GB`	`7ab07372f6fdb7b78322b7802f72d6bc`
KrakenUniq	`minikraken_20171019_8GB/database.kdb.counts`	`178 KB`	`c9791458a3bbe97cc8db0b22016894d1`
KrakenUniq	`minikraken_20171019_8GB/taxDB`	`77 MB`	`90132b2ad1e02ca4e8fc50cdab905c67`
KrakenUniq	`minikraken_20171019_8GB/taxonomy/names.dmp`	`139 MB`	`107ed68d9a24fc47f2e46d4c53206ca8`
KrakenUniq	`minikraken_20171019_8GB/taxonomy/nodes.dmp`	`108 MB`	`dc6023b70a83f5f01eb2a2dd9d883a07`
MTSv	`Oct-28-2019/artifacts/complete_genome.fas`	`121 GB`	`aad15187a99c26fc234725932f411574`
MTSv	`Oct-28-2019/artifacts/complete_genome_ff.txt`	`5.8 MB`	`6b515496a94240ce27b343b03b8dd66b`
MTSv	`Oct-28-2019/artifacts/complete_genome.json`	`17 KB`	variable
MTSv	`Oct-28-2019/artifacts/complete_genome.p`	`157 MB`	`7ee3e27583f1d027a36ef92a723ed91f`
MTSv	`Oct-28-2019/artifacts/nucl_gb.accession2taxid.gz`	`1.8 GB`	`93dcf9204dfe383aca117f4b978c4ed9`
MTSv	`Oct-28-2019/artifacts/nucl_wgs.accession2taxid.gz`	`3.2 GB`	`d954bf090d05eb39778b8dd18364762f`
MTSv	`Oct-28-2019/artifacts/pdb.accession2taxid.gz`	`3.2 MB`	`64de81d3ad2cc772b8978a85634deefc`
MTSv	`Oct-28-2019/artifacts/prot.accession2taxid.gz`	`5.4 MB`	`7d197b60ca8408b860916be7c13f3380`
MTSv	`Oct-28-2019/artifacts/taxdump.tar.gz`	`48 MB`	`c9c83e0f08c66d9f2e7eee65194bb519`
MTSv	`Oct-28-2019/artifacts/tree.index`	`133 MB`	`b8779584df7c78947ae79c08b9d60aa0`

Input Files

The taxonomic classification workflow uses input files generated from prior workflows in the MetScale pipeline. The input files for each of the taxonomic classification tools are described below, and all of these files should be located in the metscale/workflows/data directory.

Sourmash Gather Input Files

If the Comparison Workflow was previously run, the taxonomic classification workflow will use the previously generated MinHash signatures as input files for sourmash gather. These are the files that you should see in the metscale/workflows/data directory after generating signatures for the example SRR606249_subset10 dataset in the Comparison Workflow:

File Name	File Size	Signature Source
`SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig`	`1.1 MB`	Signatures from conservatively filtered reads (trim2)
`SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig`	`883 KB`	Signatures from aggressively filtered reads (trim30)

Note that running the Comparison Workflow before taxonomic classification is not required. If the comparison workflow was not run before the taxonomic classification workflow, then the signatures will be generated from the filtered reads at the time the taxonomic classification workflow is executed. Each signature file includes three different k-mer lengths (k=21, k=31, k=51) that will correspond to the pre-computed databases that are used with sourmash gather in the taxonomic classification workflow.

Kaiju, Kraken2, Bracken, and KrakenUniq Input Files

Below are the input files for the lowest common ancestor (LCA) read-based assignment tools. You should see these in the metscale/workflows/data directory after running the example SRR606249_subset10 dataset through the read filtering workflow:

File Name	File Size
`SRR606249_subset10_1_reads_trim2_1.fq.gz`	`381 MB`
`SRR606249_subset10_1_reads_trim2_2.fq.gz`	`374 MB`
`SRR606249_subset10_1_reads_trim30_1.fq.gz`	`365 MB`
`SRR606249_subset10_1_reads_trim30_2.fq.gz`	`359 MB`

Mash and MTSv Input Files

If the complete Read Filtering was previously run, the taxonomic classification workflow will use the previously generated trimmed, interleaved, and zipped file(s) as input files for Mash dist, Mash screen, and MTSv. You should see these interleaved files in the metscale/workflows/data directory after generating them for the example SRR606249_subset10 dataset in the Read Filtering workflow:

File Name	File Size	Signature Source
`SRR606249_subset10_1_reads_trim2_interleaved_reads.fq.gz`	`688 MB`	Interleaved and zipped file from conservatively filtered reads (trim2)
`SRR606249_subset10_1_reads_trim30_interleaved_reads.fq.gz`	`579 MB`	Interleaved and zipped file from aggressively filtered reads (trim30)

Note that running the Read Filtering Workflow before taxonomic classification is not required. If the full read filtering workflow was not run before the taxonomic classification workflow, then the interleaved files will be generated at the time the taxonomic classification workflow is executed. If you have interleaved and zipped reads you can also execute MTSv by jumping into the workflow (see Different Workflow Entry Points for time stamping files to jump into the workflow with your interleaved *fq.gz files).

Workflow Execution

Workflows are executed according to the sample names and workflow parameters, as specified in the config file. For more information about config files, see the Getting Started wiki page.

After the config file is ready, be sure to specify the Singularity bind path from the metscale/workflows directory before running the taxonomic classification workflow.

cd metscale/workflows
export SINGULARITY_BINDPATH="data:/tmp"

Execution of the taxonomic classification workflow can then be performed with the following command:

snakemake --use-singularity {rules} {other options}

The following rules are available for execution in the taxonomic classification workflow (yellow stars indicate the terminal rules):

The taxonomic classification rules and their parameters are listed under "workflows" in the metscale/workflows/config/default_workflowconfig.settings config file.

The following rules are available for execution in the taxonomic classification workflow, and these rules and their parameters are also listed under "workflows" in the metscale/workflows/config/default_workflowconfig.settings default config file:

Tool	Rule	Description
Mash	`tax_class_mash_screen_workflow`	Mash screens a database to estimate how well each genome is contained in a metagenome
Mash	`tax_class_mash_dist_workflow`	Mash estimates the distance between two isolate genomes to assess similarity
Sourmash	`tax_class_signatures_workflow`	Sourmash computes the MinHash signatures of filtered reads
Sourmash	`tax_class_gather_workflow`	Sourmash gather finds the best match for each genome within a metagenome
Kaiju	`tax_class_kaijureport_workflow`	With trimmed reads as inputs, Kaiju performs taxonomic classifications and reports all organisms identified
Kaiju	`tax_class_kaijureport_contigs_workflow`	With contigs as inputs, Kaiju performs taxonomic classifications and reports all organisms identified
Kraken2	`tax_class_kraken2_workflow`	Executes Kraken2 on trimmed reads for taxonomic classification
Bracken	`tax_class_bracken_workflow`	Executes Bracken for re-estimation of Kraken2 taxonomic classifications at specific taxon level(s)
KrakenUniq	`tax_class_krakenuniq_workflow`	Executes KrakenUniq on trimmed reads for taxonomic classification
MTSv	`tax_class_create_mtsv_db_workflow`	Installs and configures the downloaded MTSv database. This only needs to be executed the first time you run MTSv
MTSv	`tax_class_run_mtsv_workflow`	Copies over the MTSv .cfg file, updates it with data specified in workflow parameters, and runs MTSv with the parameters given to it. This command will run through MTSv init and analyses, which include MTSv readprep, binning, analysis, reporting, and summary.

Mash Execution

This command will execute mash screen on interleaved, filtered reads from metagenomes:

snakemake --use-singularity tax_class_mash_screen_workflow

This command will execute mash dist on interleaved, filtered reads from isolate genomes:

snakemake --use-singularity tax_class_mash_dist_workflow

If interleaved files have not been generated before running mash, then those will be generated by read filtering workflow when running these commands.

Sourmash Execution

NOTE: The tax_class_signatures_workflow rule runs by calling to the comparison workflow's snakefile, which generates MinHash signatures. If the comparison workflow was already run, snakemake will recognize that the signature file already exists and will not re-execute. Because of this, if you wish to change parameters for the tax_class_signatures_workflow rule, you must also update those parameters in the comparison workflow to ensure that the intended signature file is generated.

To execute sourmash, you can run all rules independently or in tandem by specifying multiple sourmash rules in one run. The following command will run all of the available rules for sourmash:

snakemake --use-singularity tax_class_signatures_workflow tax_class_gather_workflow

Kraken2 Execution

The following command will run Kraken2 on your trimmed read files:

snakemake --use-singularity tax_class_kraken2_workflow

Bracken Execution

The following command will run Bracken on your Kraken2 output files. Alternatively, you can call this by itself as a terminal rule to Kraken2 and Bracken:

snakemake --use-singularity tax_class_bracken_workflow

KrakenUniq Execution

The following command will run KrakenUniq on your trimmed read files:

snakemake --use-singularity tax_class_krakenuniq_workflow

Kaiju Execution

The following command will run kaiju on trimmed reads:

snakemake --use-singularity tax_class_kaijureport_workflow

The following command will run kaiju on assembled contigs, which may be helpful in improving accuracy for this amino acid-based classifier:

snakemake --use-singularity tax_class_kaijureport_contigs_workflow

If users are interested in exploring how Kaiju's performance could be improved with different parameter and filtering options, a number of additional rules are available for Kaiju:

Additional Kaiju Rule	Description
`tax_class_kaijureport_filtered_workflow`	With filtered reads as inputs, Kaiju summarizes genera with >=1% of the total reads
`tax_class_kaijureport_filtered_contigs_workflow`	With contigs as inputs, Kaiju summarizes genera with >=1% of the total reads
`tax_class_kaijureport_filteredclass_workflow`	Kaiju summarizes genera with >=1% of all classified reads
`tax_class_kaijureport_filteredclass_contigs_workflow`	With contigs as inputs, Kaiju summarizes genera with >=1% of all classified reads
`tax_class_add_taxonnames_workflow`	Kaiju adds taxon names to its report
`tax_class_add_taxonnames_contigs_workflow`	Kaiju adds taxon names to its report generated from contig file(s)
`tax_class_kaiju_species_summary_workflow`	Kaiju provides a species-level summary of the organisms identified
`tax_class_kaiju_species_summary_contigs_workflow`	Kaiju provides a species-level summary of the organisms identified from contig file(s)
`tax_class_visualize_krona_kaijureport_workflow`	Krona plots are generated from all Kaiju genus-level results
`tax_class_visualize_krona_kaijureport_contigs_workflow`	Krona plots are generated from all Kaiju genus-level results generated from contig file(s)
`tax_class_visualize_krona_kaijureport_filtered_workflow`	Krona plots are generated from Kaiju genera with >=1% of the total reads
`tax_class_visualize_krona_kaijureport_filtered_contigs_workflow`	Krona plots are generated from Kaiju genera with >=1% of the total reads generated from contig file(s)
`tax_class_visualize_krona_kaijureport_filteredclass_workflow`	Krona plots are generated from Kaiju genera with >=1% of all classified reads
`tax_class_visualize_krona_kaijureport_filteredclass_contigs_workflow`	Krona plots are generated from Kaiju genera with >=1% of all classified reads generated from contig file(s)
`tax_class_visualize_krona_species_summary_workflow`	Krona plots are generated from Kaiju species summary report
`tax_class_visualize_krona_species_summary_contigs_workflow`	Krona plots are generated from Kaiju species summary report generated from contig file(s)

The following command will execute all available rules for Kaiju on trimmed reads:

snakemake --use-singularity tax_class_add_taxonnames_workflow tax_class_kaiju_species_summary_workflow tax_class_visualize_krona_kaijureport_filtered_workflow tax_class_visualize_krona_kaijureport_filteredclass_workflow tax_class_visualize_krona_species_summary_workflow

The following command will execute all available rules for Kaiju and Krona, but using contigs as inputs instead of reads:

snakemake --use-singularity tax_class_add_taxonnames_contigs_workflow tax_class_visualize_krona_kaijureport_contigs_workflow tax_class_visualize_krona_kaijureport_filtered_contigs_workflow tax_class_visualize_krona_kaijureport_filteredclass_contigs_workflow tax_class_visualize_krona_species_summary_contigs_workflow

MTSv Execution

To execute MTSv, you can run both rules independently or in tandem by specifying multiple MTSv rules in one run. The following command will run all of the available rules for MTSv:

snakemake --use-singularity tax_class_create_mtsv_db_workflow tax_class_run_mtsv_workflow

NOTE: The tax_class_create_mtsv_db_workflow only needs to be executed the first time MTSv is run, and after that all analyses can be run by executing tax_class_run_mtsv_workflow.

Additional options for snakemake can be found in the snakemake documentation.

To change or specify your own parameters for this or any of the workflows prior to execution, see (see Workflow Architecture)

Output

After successful execution of the taxonomic classification workflow, you will find all of its outputs in the metscale/workflows/data/ directory. You should see the files below after running the example dataset with each of the taxonomic classification rules.

Mash Output

Mash Rule	Output File	Output Description
`tax_class_mash_screen_workflow`	`{sample}_1_reads_trim{quality threshold}_{database}_mash_screen.tab`	List of genomes contained within the metagenome
`tax_class_mash_screen_workflow`	`{sample}_1_reads_trim{quality threshold}_{database}_mash_screen.sorted.tab`	List of genomes contained within the metagenome, sorted to show best hits at the top
`tax_class_mash_dist_workflow`	`{sample}_1_reads_trim{quality threshold}_{database}_mash_distances.tab`	List of genomes matching to the query
`tax_class_mash_dist_workflow`	`{sample}_1_reads_trim{quality threshold}_{database}_mash_distances.sorted.tab`	List of genomes matching to the query, sorted to show best hits at the top

Sourmash Gather Output

Sourmash Rule	Output File	Output Description
`tax_class_signatures_workflow`	`{sample}_1_reads_trim{quality threshold}_scaled{scaled value}.k{k values}.sig`	MinHash sketches of filtered reads, generated and saved in a signature file by sourmash compute
`tax_class_gather_workflow`	`{sample}_1_reads_trim{quality threshold}_k{k value}.gather_unassigned.csv`	Unassigned hashes in a single signature generated by sourmash gather
`tax_class_gather_workflow`	`{sample}_1_reads_trim{quality threshold}_k{k value}.gather_matches.csv`	Signatures of matches generated by sourmash gather
`tax_class_gather_workflow`	`{sample}_1_reads_trim{quality threshold}_k{k value}.gather_output.csv`	Taxonomic classifications generated by sourmash gather

Although a few different files are generated by sourmash gather, only the *gather_output.csv file is used for the human interpretation of results. These columns are expected in the primary sourmash gather output file - *gather_output.csv (see sourmash documentation for more information):

Name	Description
`intersect_bp`	Approximate number of base pairs shared between the query metagenome and the matched reference genome
`f_orig_query`	Fraction of the original query metagenome shared with the reference genome match
`f_match`	Fraction of the reference genome match shared with the query metagenome
`f_unique_to_query`	Fraction of the query metagenome that uniquely matches to the displayed reference genome. Note that matches are subtracted from the query metagenome as they are found, so `f_unique_to_query` is iteratively calculated on the remaining metagenome after the prior matches are subtracted. Moving from the top row to the bottom row, the matches are reported in the order that they were identified in `*gather_output.csv`.
`f_unique_weighted`	`f_unique_to_query` weighted by abundance
`average_abund`	Average abundance of the match
`name`	Name of the match in the database
`filename`	Name of the database Sequence Bloom Tree (SBT) used
`md5`	A checksum value to verify data integrity

Kraken2 Output

Rule	Output File	Output Description
`tax_class_kraken2_workflow`	`{sample}_1_reads_trim{quality threshold}_kraken2_classified_{database}_confidence{threshold}_{read_direction}.fq`	Classified reads from Kraken2 output
`tax_class_kraken2_workflow`	`{sample}_1_reads_trim{quality threshold}_kraken2_unclassified_{database}_confidence{threshold}_{read_direction}.fq`	Unclassified reads from Kraken2 output
`tax_class_kraken2_workflow`	`{sample}_1_reads_trim{quality threshold}_kraken2_{database}_confidence{threshold}.report`	Report to summarize all Kraken2 read-based taxon classifications
`tax_class_kraken2_workflow`	`{sample}_1_reads_trim{quality threshold}_kraken2_{database}_confidence{threshold}.out`	File generated to describe each of the Kraken2 classified (C) and unclassified (U) reads

Bracken Output

Rule	Output File	Output Description
`tax_class_bracken_workflow`	`{sample}_1_reads_trim{quality threshold}_kraken2_{kraken2 database}_confidence{threshold}_bracken.report`	TSV report of Kraken2 results that are input into Bracken for re-classification
`tax_class_bracken_workflow`	`{sample}_1_reads_trim{quality threshold}_kraken2_{kraken 2 database}_confidence{threshold}_bracken_db -{bracken database}_r-{read_length}_l-{level}_t-{threshold}`	TSV file of Bracken re-classified reads from Kraken2 at a specific taxon level

KrakenUniq Output

Rule	Output File	Output Description
`tax_class_krakenuniq_workflow`	`{sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_classified`	FASTA file of all classified reads identified by KrakenUniq
`tax_class_krakenuniq_workflow`	`{sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_unclassified`	FASTA file of all unclassified reads identified by KrakenUniq
`tax_class_krakenuniq_workflow`	`{sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_report`	Report to summarize all KrakenUniq read-based taxon classifications
`tax_class_krakenuniq_workflow`	`{sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_out`	Text file generated to describe all KrakenUniq classified (C) and unclassified (U) reads

Kaiju Output

Kaiju Rule	Output File	Output Description
`tax_class_kaijureport_workflow` or `tax_class_kaijureport_contigs_workflow`	`{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju.out`	Standard Kaiju output file
`tax_class_kaijureport_workflow` or `tax_class_kaijureport_contigs_workflow`	`{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus.summary`	Genus-level summary of all Kaiju results
`tax_class_kaijureport_filtered_workflow` or `tax_class_kaijureport_filtered_contigs_workflow`	`{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_filtered1_total.summary`	Kaiju genus-level summary with no genera <1% of total reads
`tax_class_kaijureport_filteredclass_workflow` or `tax_class_kaijureport_filteredclass_contigs_workflow`	`{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_filtered1_classified.summary`	Kaiju genus-level summary with no genera <1% of all classified reads
`tax_class_add_taxonnames_workflow` or `tax_class_add_taxonnames_contigs_workflow`	`{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_names.out`	Kaiju output file with taxon names appended to it
`tax_class_kaiju_species_summary_workflow` or `tax_class_kaiju_species_summary_contigs_workflow`	`{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_out_species.summary`	Species-level summary of all kaiju results

These columns are expected in the primary kaiju output file - *.out (see kaiju documentation for more information):

Name	Description
Column 1	Either C or U, indicating whether the read is classified or unclassified
Column 2	Name of the read
Column 3	NCBI taxon identifier of the assigned taxon
Column 4	The length or score of the best match used for classification
Column 5	The taxon identifiers of all database sequences with the best match
Column 6	The accession numbers of all database sequences with the best match
Column 7	Matching fragment sequence(s)

Krona Output

Three rules are available to generate krona plots from the genus-level summaries of the Kaiju results, and one rule will generate a krona plot from the species-level summary.

Kaiju/Krona Rule	Output File	Output Description
`tax_class_visualize_krona_kaijureport_workflow` or `tax_class_visualize_krona_kaijureport_contigs_workflow`	`{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_krona.html`	Krona plot of all Kaiju genus-level results
`tax_class_visualize_krona_kaijureport_filtered_workflow` or `tax_class_visualize_krona_kaijureport_filtered_contigs_workflow`	`{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_krona_filtered1_total.html`	Krona plot of genus-level Kaiju results after removing genera with <1% of total reads
`tax_class_visualize_krona_kaijureport_filteredclass_workflow` or `tax_class_visualize_krona_kaijureport_filteredclass_contigs_workflow`	`{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_krona_filtered1_classified.html`	Krona plot of genus-level Kaiju results after removing genera with <1% of all classified reads
`tax_class_visualize_krona_species_summary_workflow` or `tax_class_visualize_krona_species_summary_contigs_workflow`	`{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_species_krona.html`	Krona plot of all species-level Kaiju results

MTSv Output

Rule	Output File	Output Description
`tax_class_create_mtsv_db_workflow`	`Oct-28-2019/`	Unzips and formats the MTSv database in this subdirectory
`tax_class_run_mtsv_workflow`	`<your_sample_name>_MTSV/`	A directory with your sample name as the prefix is generated and contains all the output files of MTSv analyses. Subdirectories containing these files include: `Analysis/`, `Binning/`, `Logs/`, `Params/`, `QueryFastas/`, `Reports/`, and `Summary/`

MTSv outputs include *html files with summary statistics (Report/binning_report.html, Report/readprep_report.html, and Report/summary_report.html). A final taxonomic summary table for each sample is written to Summary/summary.csv.

Additional Information

Command Line Equivalents

To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools.

Mash Commands

The equivalents of the following commands generate Mash distance taxonomic classification outputs with trim2:

mash sketch -m 2 SRR606249_subset10_trim2_interleaved_reads.fq
mash dist refseq.genomes.k21s1000.msh SRR606249_subset10_trim2_interleaved_reads.fq.msh > SRR606249_subset10_trim2_refseq.genomes.k21s1000_mash_distances.tab
sort -gk3 SRR606249_subset10_trim2_refseq.genomes.k21s1000_mash_distances.tab > SRR606249_subset10_trim2_refseq.genomes.k21s1000_mash_distances.sorted.tab

The equivalents of the following commands generate Mash distance taxonomic classification outputs with trim30:

mash sketch -m 2 SRR606249_subset10_trim30_interleaved_reads.fq
mash dist RefSeq88n.msh SRR606249_subset10_trim30_interleaved_reads.fq.msh > SRR606249_subset10_trim30_refseq.genomes.RefSeq88n_mash_distances.tab
sort -gk3 SRR606249_subset10_trim30_refseq.genomes.RefSeq88n_mash_distances.tab > SRR606249_subset10_trim30_refseq.genomes.k21s1000_mash_distances.sorted.tab

The equivalents of the following commands generate Mash screen taxonomic classification outputs with trim2:

mash screen -p 4 RefSeq88n.msh SRR606249_subset10_trim2_interleaved_reads.fq > SRR606249_subset10_trim2_interleaved_reads.RefSeq88n.screen.tab
sort -gr SRR606249_subset10_trim2_interleaved_reads.RefSeq88n.screen.tab > SRR606249_subset10_trim2_interleaved_reads.RefSeq88n.screen.sorted.tab

The equivalents of the following commands generate Mash screen taxonomic classification outputs with trim30:

mash screen -p 4 refseq.genomes.k21s1000.msh SRR606249_subset10_trim30_interleaved_reads.fq > SRR606249_subset10_trim30_interleaved_reads.k21s1000.screen.tab
sort -gr SRR606249_subset10_trim30_interleaved_reads.k21s1000.screen.tab > SRR606249_subset10_trim30_interleaved_reads.k21s1000.screen.sorted.tab

Sourmash Commands

This *sig file generated by sourmash compute is used as the input file for sourmash compare and sourmash gather. If it was not previously generated in the Comparison Workflow, the signature file will be generated by the tax_class_signatures_workflow rule.

The MinHash signatures of reads filtered with a quality score threshold of 2 are assigned to reference genomes with the equivalent of the following sourmash gather commands (one command for each value of k - 21, 31, and 51):

sourmash gather -k 51 SRR606249_subset10_trim2_scaled10000.k51.sig genbank-k51.sbt.json refseq-k51.sbt.json -o SRR606249_subset10_trim2_k51.gather_output.csv --output-unassigned SRR606249_subset10_trim2_k51.gather_unassigned.csv --save-matches SRR606249_subset10_trim2_k51.gather_matches.csv
sourmash gather -k 31 SRR606249_subset10_trim2_scaled10000.k31.sig genbank-k31.sbt.json refseq-k31.sbt.json -o SRR606249_subset10_trim2_k31.gather_output.csv --output-unassigned SRR606249_subset10_trim2_k31.gather_unassigned.csv --save-matches SRR606249_subset10_trim2_k31.gather_matches.csv
sourmash gather -k 21 SRR606249_subset10_trim2_scaled10000.k21.sig genbank-k21.sbt.json refseq-k21.sbt.json -o SRR606249_subset10_trim2_k21.gather_output.csv --output-unassigned SRR606249_subset10_trim2_k21.gather_unassigned.csv --save-matches SRR606249_subset10_trim2_k21.gather_matches.csv

The MinHash signatures of reads filtered with a quality score threshold of 30 are assigned to reference genomes with the equivalent of the following sourmash gather commands (one command for each value of k - 21, 31, and 51):

sourmash gather -k 51 SRR606249_subset10_trim30_scaled10000.k51.sig genbank-k51.sbt.json refseq-k51.sbt.json -o SRR606249_subset10_trim30_k51.gather_output.csv --output-unassigned SRR606249_subset10_trim30_k51.gather_unassigned.csv --save-matches SRR606249_subset10_trim30_k51.gather_matches.csv
sourmash gather -k 31 SRR606249_subset10_trim30_scaled10000.k31.sig genbank-k31.sbt.json refseq-k31.sbt.json -o SRR606249_subset10_trim30_k31.gather_output.csv --output-unassigned SRR606249_subset10_trim30_k31.gather_unassigned.csv --save-matches SRR606249_subset10_trim30_k31.gather_matches.csv
sourmash gather -k 21 SRR606249_subset10_trim30_scaled10000.k21.sig genbank-k21.sbt.json refseq-k21.sbt.json -o SRR606249_subset10_trim30_k21.gather_output.csv --output-unassigned SRR606249_subset10_trim30_k21.gather_unassigned.csv --save-matches SRR606249_subset10_trim30_k21.gather_matches.csv

Kraken2 Commands

The equivalents of the following commands generate taxonomic classification outputs with Kraken2 for trim2 and trim30:

kraken2 --db minikraken_8GB_20200312/ --paired SRR606249_subset10_1_reads_trim2_1.fq.gz SRR606249_subset10_1_reads_trim2_2.fq.gz --use-names --confidence 0 --report SRR606249_subset10_trim2_kraken2.report --unclassified-out SRR606249_subset10_trim2_kraken2_unclassified#.fq.gz --classified-out SRR606249_subset10_trim2_kraken2_classified#.fq.gz --output SRR606249_subset10_trim2_kraken2.out.txt
kraken2 --db minikraken_8GB_20200312/ --paired SRR606249_subset10_1_reads_trim30_1.fq.gz SRR606249_subset10_1_reads_trim30_2.fq.gz --use-names --confidence 0 --report SRR606249_subset10_trim30_kraken2.report --unclassified-out SRR606249_subset10_trim30_kraken2_unclassified#.fq.gz --classified-out SRR606249_subset10_trim30_kraken2_classified#.fq.gz --output SRR606249_subset10_trim30_kraken2.out.txt

Bracken Commands

The equivalents of the following commands generate Bracken species-level taxonomic classification outputs using the Kraken2 reports previously generated with trim2 and trim30:

bracken -d minikraken_8GB_20200312/ -i SRR606249_subset10_1_reads_trim2_kraken2_minikraken_8GB_20200312_confidence0.report -r 100 -l S -t 0 -o SRR606249_subset10_1_reads_trim2_kraken2_minikraken_8GB_20200312_confidence0_bracken_db-default_r-100_l-S_t-0
bracken -d minikraken_8GB_20200312/ -i SRR606249_subset10_1_reads_trim30_kraken2_minikraken_8GB_20200312_confidence0.report -r 100 -l S -t 0 -o SRR606249_subset10_1_reads_trim30_kraken2_minikraken_8GB_20200312_confidence0_bracken_db-default_r-100_l-S_t-0

KrakenUniq Commands

The equivalents of the following commands generate taxonomic classification outputs with KrakenUniq for trim2 and trim30:

krakenuniq --paired SRR606249_subset10_1_reads_trim2_1.fq.gz SRR606249_subset10_1_reads_trim2_2.fq.gz --db minikraken_20171019_8GB/ --hll-precision 12 --report-file SRR606249_subset10_trim2_krakenuniq_report --unclassified-out SRR606249_subset10_trim2_krakenuniq_unclassified --classified-out SRR606249_subset10_trim2_krakenuniq_classified --output SRR606249_subset10_trim2_krakenuniq_out
krakenuniq --paired SRR606249_subset10_1_reads_trim30_1.fq.gz SRR606249_subset10_1_reads_trim30_2.fq.gz --db minikraken_20171019_8GB/ --hll-precision 12 --report-file SRR606249_subset10_trim30_krakenuniq_report --unclassified-out SRR606249_subset10_trim30_krakenuniq_unclassified --classified-out SRR606249_subset10_trim30_krakenuniq_classified --output SRR606249_subset10_trim30_krakenuniq_out

Kaiju Commands

The equivalents of the following commands will run Kaiju and generate its standard report for all of the organisms identified (one command is listed below for the trim2 filtered reads and one for trim30):

kaiju -x -v -t nodes.dmp -f kaiju_db_nr_euk.fmi -i SRR606249_subset10_trim_1.fq.gz -j SRR606249_subset10_trim2_2.fq.gz -o SRR606249_subset10_trim2.kaiju.out -z 4
kaiju -x -v -t nodes.dmp -f kaiju_db_nr_euk.fmi -i SRR606249_subset10_trim30_1.fq.gz -j SRR606249_subset10_trim30_2.fq.gz -o SRR606249_subset10_trim30.kaiju.out -z 4

The standard kaiju report includes NCBI taxonomy IDs, but it does not include the taxon names. The equivalents of the following commands are used to append the corresponding taxon names to the last column in Kaiju's output file with addTaxonNames:

addTaxonNames -t nodes.dmp -n names.dmp –u –p –i SRR606249_subset10_trim2.kaiju.out -o SRR606249_subset10_trim2.kaiju_names.out
addTaxonNames -t nodes.dmp -n names.dmp –u –p –i SRR606249_subset10_trim30.kaiju.out -o SRR606249_subset10_trim30.kaiju_names.out

A summary report for a specified taxon rank can be created from the kaiju output with kaijuReport. These summary reports can be directly input into krona with ktImportText.

The equivalents of the following commands generate genus-level summaries with kaijuReport for all organisms identified in the standard kaiju report:

kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r genus -o SRR606249_subset10_trim2.kaiju_genus.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r genus -o SRR606249_subset10_trim30.kaiju_genus.summary

The equivalents of the following commands generate species-level summaries with kaijuReport for all organisms identified in the standard kaiju report:

kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r species -o SRR606249_subset10_trim2.kaiju_out_species.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r species -o SRR606249_subset10_trim30.kaiju_out_species.summary

The equivalents of the following commands generate genus-level summaries with kaijuReport for all genera comprising at least 1% of the total reads:

kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r genus -m 1 -o SRR606249_subset10_trim2.kaiju_genus_filtered1_total.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r genus -m 1 -o SRR606249_subset10_trim30.kaiju_genus_filtered1_total.summary

The equivalents of the following commands generate genus-level summaries with kaijuReport for genera comprising at least 1% of all classified reads:

kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r genus -m 1 -u -o SRR606249_subset10_trim2.kaiju_genus_filtered1_classified.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r genus -m 1 -u -o SRR606249_subset10_trim30.kaiju_genus_filtered1_classified.summary

The equivalents of the following commands are used to generate krona visualizations from the kaijuReport genus-level summaries with ktImportText (three different krona plots for trim2 and three for trim30):

ktImportText -o SRR606249_subset10_trim2.kaiju_genus_krona.html SRR606249_subset10_trim2.kaiju_genus.summary
ktImportText -o SRR606249_subset10_trim2.kaiju_genus_krona_filtered1_total.html SRR606249_subset10_trim2.kaiju_genus_filtered1_total.summary
ktImportText -o SRR606249_subset10_trim2.kaiju_genus_krona_filtered1_classified.html SRR606249_subset10_trim2.kaiju_genus_filtered1_classified.summary
ktImportText -o SRR606249_subset10_trim30.kaiju_genus_krona.html SRR606249_subset10_trim30.kaiju_genus.summary
ktImportText -o SRR606249_subset10_trim30.kaiju_genus_krona_filtered1_total.html SRR606249_subset10_trim30.kaiju_genus_filtered1_total.summary
ktImportText -o SRR606249_subset10_trim30.kaiju_genus_krona_filtered1_classified.html SRR606249_subset10_trim30.kaiju_genus_filtered1_classified.summary

MTSv Commands

For command line equivalents for executing MTSv analyses including Readprep, Binning, Analysis, and Summary, please see the MTSv GitHub page for Binning and Analysis Quick Start Guide

Expected Output Files for the Example Dataset

Below is a more detailed description of the output files expected in the metscale/workflows/data/ directory after the taxonomic classification workflow has been successfully run.

Using the filtered reads generated by the Read Filtering Workflow:

File Name	File Size
`SRR606249_subset10_1_reads_trim2_1.fq.gz`	`381 MB`
`SRR606249_subset10_1_reads_trim2_2.fq.gz`	`374 MB`
`SRR606249_subset10_1_reads_trim30_1.fq.gz`	`327 MB`
`SRR606249_subset10_1_reads_trim30_2.fq.gz`	`313 MB`

Example Mash Output

File Name	File Size
`SRR606249_subset10_1_reads_trim30_RefSeq_10K_Sketches.msh_mash_screen.tab`	`6.7 MB`
`SRR606249_subset10_1_reads_trim30_RefSeq_10K_Sketches.msh_mash_screen.sorted.tab`	`6.7 MB`
`SRR606249_subset10_1_reads_trim30_RefSeq_10K_Sketches.msh_mash_distances.tab`	`11 MB`
`SRR606249_subset10_1_reads__trim30_RefSeq_10K_Sketches.msh_mash_distances.sorted.tab`	`11 MB`

Example Sourmash Gather Output

The following files are produced by sourmash after computing the signatures of filtered reads with the tax_class_signatures_workflow rule:

File Name	File Size
`SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig`	`1.1 MB`
`SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig`	`883 KB`

The following files are produced by sourmash after running sourmash gather with the tax_class_gather_workflow rule (this rule produces three files for each value of k and trim quality threshold):

File Name	File Size
`SRR606249_subset10_1_reads_trim2_k21.gather_matches.csv`	`2.2 MB`
`SRR606249_subset10_1_reads_trim2_k21.gather_output.csv`	`19 KB`
`SRR606249_subset10_1_reads_trim2_k21.gather_unassigned.csv`	`52 KB`
`SRR606249_subset10_1_reads_trim30_k21.gather_matches.csv`	`2.2 MB`
`SRR606249_subset10_1_reads_trim30_k21.gather_output.csv`	`19 KB`
`SRR606249_subset10_1_reads_trim30_k21.gather_unassigned.csv`	`11 KB`
`SRR606249_subset10_1_reads_trim2_k31.gather_matches.csv`	`2.2 MB`
`SRR606249_subset10_1_reads_trim2_k31.gather_output.csv`	`19 KB`
`SRR606249_subset10_1_reads_trim2_k31.gather_unassigned.csv`	`58 KB`
`SRR606249_subset10_1_reads_trim30_k31.gather_matches.csv`	`2.2 MB`
`SRR606249_subset10_1_reads_trim30_k31.gather_output.csv`	`19 KB`
`SRR606249_subset10_1_reads_trim30_k31.gather_unassigned.csv`	`12 KB`
`SRR606249_subset10_1_reads_trim2_k51.gather_matches.csv`	`2.2 MB`
`SRR606249_subset10_1_reads_trim2_k51.gather_output.csv`	`20 KB`
`SRR606249_subset10_1_reads_trim2_k51.gather_unassigned.csv`	`57 KB`
`SRR606249_subset10_1_reads_trim30_k51.gather_matches.csv`	`2.2 MB`
`SRR606249_subset10_1_reads_trim30_k51.gather_output.csv`	`20 KB`
`SRR606249_subset10_1_reads_trim30_k51.gather_unassigned.csv`	`14 KB`

The f_match reported by sourmash in the *gather_output.csv file reports the fraction of the reference genome match contained within the query metagenome. The reference genome matches are listed according to their GenBank IDs, which can be connected to NCBI TaxIDs and species names.

For the example dataset, there was no loss of sensitivity in increasing the k-mer length or trimming quality threshold - all 62 species were detected in all scenarios. Although different k-mer lengths and read trimming parameters produced similar sourmash gather f_match results, the smallest fraction of reference genomes was typically detected in the k=51 trim30 dataset. Increasing specificity with longer k-mer lengths and more aggressive trimming tended to lower the fraction of each true positive reference genome detected; however, in some cases the increased specificity of longer k-mer lengths resulted in detecting more of the reference genome than shorter k-mer lengths (e.g., Thermotoga sp. RQ2).

Although sourmash gather performs best at the species level, it will report matches below the species level. Two of the species in the Shakya subset 10 example dataset had strain variants, and no false positive strains were detected for these organisms (i.e., Methanococcus maripaludis strain C5, Methanococcus maripaludis strain S2, Shewanella baltica strain OS185, Shewanella baltica strain OS223). False positives for other organisms detected at the species level tended to have smaller f_match values, which suggests that there may be a threshold of genome coverage required for a high confidence species call to be made. Alternatively, these could be low abundance, real contaminants within the sample that was sequenced.

Overall, k-mer length had a larger impact on sourmash results than the trimming quality threshold. The smallest fraction of false positive genomes was detected in the k=51 datasets. The k=51 and trim30 parameters are recommended as defaults if users only want to run sourmash with one trimming quality threshold and value of k.

Example Kaiju Output

The following *.out files are produced by kaiju to report the taxonomic classifications of reads after running the tax_class_kaijureport_workflow rule:

File Name	File Size
`SRR606249_subset10_1_reads_trim2.kaiju.out`	`628 MB`
`SRR606249_subset10_1_reads_trim30.kaiju.out`	`592 MB`

The tax_class_add_taxonnames_workflow rule will append taxon names to the *.out files to produce the following files:

File Name	File Size
`SRR606249_subset10_1_reads_trim2.kaiju_names.out`	`1.3 GB`
`SRR606249_subset10_1_reads_trim30.kaiju_names.out`	`1.2 GB`

Kaiju can produce summary reports at different taxon levels, which can also be input into krona for visualizations of the summarized data. The following species-level summaries can be generated for all kaiju results with the tax_class_kaiju_species_summary_workflow rule:

File Name	File Size
`SRR606249_subset10_1_reads_trim2.kaiju_out_species.summary`	`241 KB`
`SRR606249_subset10_1_reads_trim30.kaiju_out_species.summary`	`204 KB`

Kaiju detected (i.e., at least one read was assigned) >5,000 species in the SRR606249_subset10_1_reads_trim2 dataset and >4,000 species in the SRR606249_subset10_1_reads_trim30 dataset. Although a thresholding approach with the example dataset would not have cleanly achieved 100% precision and recall with kaiju as with sourmash, requiring a minimum number of reads for a high confidence kaiju species detection would substantially reduce the number of false positives. In the trim30 dataset, a threshold of >6,000 assigned reads would have identified 57 true positives (out of an expected 62 species) and one false positive (Proteiniclasticum ruminis = 48,586 reads), which in this case could be a real species contaminant in the sample, and five false negatives (Enterococcus faecalis = 5,079 reads; Chloroflexus aurantiacus = 4,807 reads; Bordetella bronchiseptica = 3,309 reads; Fusobacterium nucleatum = 3,156 reads; Thermoanaerobacter pseudethanolicus = 93 reads). Further benchmarking work with additional datasets will be important to determine how to best apply interpretation guidelines to kaiju results.

The following genus-level summary files are produced by kaiju for all filtered reads after running the tax_class_kaijureport_workflow rule:

File Name	File Size
`SRR606249_subset10_1_reads_trim2.kaiju_genus.summary`	`54 KB`
`SRR606249_subset10_1_reads_trim30.kaiju_genus.summary`	`47 KB`

The *.kaiju_out_krona.summary files can be input into krona with the tax_class_visualize_krona_kaijureport_workflow rule to produce the following krona plots of all genus-level results:

File Name	File Size
`SRR606249_subset10_1_reads_trim2.kaiju_genus_krona.html`	`371 KB`
`SRR606249_subset10_1_reads_trim30.kaiju_genus_krona.html`	`353 KB`

Because of the high number of organisms detected by kaiju, removing the organisms detected at low abundance makes the krona plots easier to read. The tax_class_kaijureport_filtered_workflow rule will produce the following genus-level summary reports that exclude genera with <1% of total reads, and krona will produce corresponding visualizations with the tax_class_visualize_krona_kaijureport_filtered_workflow rule:

File Name	File Size
`SRR606249_subset10_1_reads_trim2.kaiju_genus_filtered1_total.summary`	`1.3 KB`
`SRR606249_subset10_1_reads_trim30.kaiju_genus_filtered1_total.summary`	`1.3 KB`
`SRR606249_subset10_1_reads_trim2.kaiju_genus_krona_filtered1_total.html`	`227 KB`
`SRR606249_subset10_1_reads_trim30.kaiju_genus_krona_filtered1_total.html`	`227 KB`

The tax_class_kaijureport_filteredclass_workflow rule will produce the following genus-level summary reports that exclude genera with <1% of all classified reads, and krona will produce corresponding visualizations with the tax_class_visualize_krona_kaijureport_filteredclass_workflow rule:

File Name	File Size
`SRR606249_subset10_1_reads_trim2.kaiju_genus_filtered1_classified.summary`	`1.5 KB`
`SRR606249_subset10_1_reads_trim30.kaiju_genus_filtered1_classified.summary`	`1.5 KB`
`SRR606249_subset10_1_reads_trim2.kaiju_genus_krona_filtered1_classified.html`	`227 KB`
`SRR606249_subset10_1_reads_trim30.kaiju_genus_krona_filtered1_classified.html`	`227 KB`

Example Kraken2 Output

The following files are produced by Kraken2 with the tax_class_kraken2_workflow rule with trim30:

File Name	File Size
`SRR606249_subset10_1_reads_trim30_kraken2_unclassified_minikraken2_v2_8GB_201904_UPDATE_confidence0_1.fq`	`113 MB`
`SRR606249_subset10_1_reads_trim30_kraken2_unclassified_minikraken2_v2_8GB_201904_UPDATE_confidence0_2.fq`	`111 MB`
`SRR606249_subset10_1_reads_trim30_kraken2_minikraken2_v2_8GB_201904_UPDATE_confidence0.report`	`192 KB`
`SRR606249_subset10_1_reads_trim30_kraken2_minikraken2_v2_8GB_201904_UPDATE_confidence0.out`	`755 MB`

Example KrakenUniq Output

The following files are produced by KrakenUniq with the tax_class_krakenuniq_workflow rule with trim30:

File Name	File Size
`SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_classified`	`849 MB`
`SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_unclassified`	`166 MB`
`SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_report`	`167 KB`
`SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_out`	`502 MB`

Example MTSv Output

The following files are produced by MTSV with the tax_class_run_mtsv_workflow rule with trim2 and trim30:

File Name	File Size
`SRR606249_subset10_1_reads_trim2_MTSV/Summary/summary.csv`	`44 KB`
`SRR606249_subset10_1_reads_trim30_MTSV/Summary/summary.csv`	`36 KB`