-
Notifications
You must be signed in to change notification settings - Fork 9
08. Taxonomic Classification
This workflow will perform taxonomic classification on filtered Illumina paired-end reads with Mash version 2.2, Sourmash version 2.1.0, Kaiju version 1.6.1, Kraken2 version 2.0.8-beta, Bracken version 2.2, and KrakenUniq version 0.5.8, MTSv version 1.0.6, as well as visualizations of Kaiju results with KronaTools version 2.7. It has been tested to run offline following the execution of the Read Filtering Workflow. Kaiju can also be run on assembled contigs, which are output by the Assembly Workflow.
If you have not already, you will need to activate your metscale environment and perform the Offline Setup before running the taxonomic classification workflow:
[user@localhost ~]$ conda activate metscale
(metscale)[user@localhost ~]$ cd metscale/workflows
(metscale)[user@localhost workflows]$ python download_offline_files.py --workflow taxonomic_classification
In the metscale/container_images/
directory, you should see the following Singularity images that were created when running the taxonomic_classification or all flags during the Offline Setup:
Tool | Singularity Image | Size |
---|---|---|
Mash | mash_2.2--h3d38be6_0.sif |
44 MB |
Sourmash | sourmash_2.1.0--py27he1b5a44_0.sif |
271 MB |
Kaiju | kaiju_1.6.1--pl5.22.0_0.sif |
37 MB |
Krona | krona_2.7--pl5.22.0_1.sif |
21 MB |
Kraken2 | kraken2_2.0.8_beta--pl526h6bb024c_0.sif |
327 MB |
Bracken | bracken_2.2--py27h2d50403_1.sif |
262 MB |
KrakenUniq | krakenuniq_0.5.8--pl526he860b03_0.sif |
54 MB |
MTSv | mtsv_1.0.6--py36hc9558a2_1.sif |
675 MB |
You should also have the following databases downloaded in the metscale/workflows/data
directory:
Tool | Downloaded Database | Description |
---|---|---|
Mash | https://obj.umiacs.umd.edu/diamond/RefSeq_10K_Sketches.msh.gz |
Microbial signatures from NCBI RefSeq, as used in the original Mash publication |
Sourmash | s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k21-2017.05.09.tar.gz |
Microbial genome signatures from NCBI GenBank, prepared with a k-mer length of 21 |
Sourmash | s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz |
Microbial genome signatures from NCBI GenBank, prepared with a k-mer length of 31 |
Sourmash | s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k51-2017.05.09.tar.gz |
Microbial genome signatures from NCBI GenBank, prepared with a k-mer length of 51 |
Sourmash | s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-refseq-sbt-k21-2017.05.09.tar.gz |
Microbial genome signatures from NCBI RefSeq, prepared with a k-mer length of 21 |
Sourmash | s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-refseq-sbt-k31-2017.05.09.tar.gz |
Microbial genome signatures from NCBI RefSeq, prepared with a k-mer length of 31 |
Sourmash | s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-refseq-sbt-k51-2017.05.09.tar.gz |
Microbial genome signatures from NCBI RefSeq, prepared with a k-mer length of 51 |
Kaiju | http://kaiju.binf.ku.dk/database/kaiju_index_nr_euk.tgz |
Protein sequences from NCBI nr, including bacteria, archaea, viruses, & fungi |
Kraken2 | ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken_8GB_202003.tgz |
Kraken 2 database comprised of RefSeq bacteria, archaea, and viral genomes, as well as the GRCh38 human genome |
Bracken | ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken_8GB_202003.tgz |
Contains three Bracken database files within this sub-directory - database{100, 150, 200}mers.kmer_distrib
|
KrakenUniq | https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_8GB.tgz |
Kraken 1 database comprised of RefSeq bacteria, archaeal, and viral genomes |
MTSv | https://rcdata.nau.edu/fofanov_lab/ Compressed_MTSV_database_files/complete_genome.tar.gz |
Directory and sub-directory that contains the complete reference database files for MTSv |
Tool | Downloaded Database | Size | MD5 Checksum |
---|---|---|---|
Mash | RefSeq_10K_Sketches.msh.gz |
6.3 GB |
01de4f52d6fe509e146fdeacee0fb8d1 * |
Sourmash | microbe-genbank-sbt-k21-2017.05.09.tar.gz |
4.2 GB |
d9ed08309ff56a558dcabf4ef1af2c4e ** |
Sourmash | microbe-genbank-sbt-k31-2017.05.09.tar.gz |
4.2 GB |
759b18b7dcb4f42ff04592d94db3685e ** |
Sourmash | microbe-genbank-sbt-k51-2017.05.09.tar.gz |
4.2 GB |
2dbe310b7f5a396e76ceba155ff5d424 ** |
Sourmash | microbe-refseq-sbt-k21-2017.05.09.tar.gz |
3.6 GB |
bf594377fe644a741a7cbd78e0b791ac ** |
Sourmash | microbe-refseq-sbt-k31-2017.05.09.tar.gz |
3.6 GB |
8bec1b006779fe4100a58dcf23c9d92f ** |
Sourmash | microbe-refseq-sbt-k51-2017.05.09.tar.gz |
3.6 GB |
eb4e5e1210403baab4f0fbe86ecc8917 ** |
Kaiju | kaiju_index_nr_euk.tgz |
28 GB |
89d4dab3c6fd23432d99357e0c4687de * |
KrakenUniq | minikraken_20171019_8GB.tgz |
5.7 GB |
5d626afb34d87925990889d27e411c29 * |
Kraken2 and Bracken | minikraken_8GB_202003.tgz |
5.6 GB |
51f9cd572b0630fdaf4b5595c672b5b9 * |
MTSv | complete_genome.tar.gz |
39 GB |
a2dcdbb77ab38e4b474eefb6dddadab1 * |
*Uncompressed during offline setup
**Uncompressed during tool execution
If you are missing any of these files, or if you suspect any of these databases are corrupted (e.g., interruptions during the offline download), you should re-run the appropriate setup command, as per instructions in the Offline Setup. It can take a while to download these databases, and an interruption during their download is a common reason for the taxonomic classification workflow not running properly. You can check to see if the MD5 checksums of your downloaded databases match to the expected MD5 checksums in the above table. Here's an example of how to run the md5sum command:
$ cd metscale/workflows/data
$ md5sum kaiju_index_nr_euk.tgz
89d4dab3c6fd23432d99357e0c4687de kaiju_index_nr_euk.tgz
If the databases were uncompressed and used in a previous workflow, they will appear like this in the metscale/workflows/data
directory in the future (note that some of these are hidden directories):
Tool | Uncompressed Database | Size | MD5 Checksum |
---|---|---|---|
Mash | RefSeq_10K_Sketches.msh |
6.8 GB |
f89c78f3dfbed846dc48727642a97c3a |
Sourmash | .sbt.genbank-k21/ |
18 GB |
uncompressed hidden directory |
Sourmash | .sbt.genbank-k31/ |
18 GB |
uncompressed hidden directory |
Sourmash | .sbt.genbank-k51/ |
18 GB |
uncompressed hidden directory |
Sourmash | genbank-k21.sbt.json |
27 MB |
b22f0e51672019c04423d8afb03175a0 |
Sourmash | genbank-k31.sbt.json |
27 MB |
62c475f0f8357a7219f15f466814e9d3 |
Sourmash | genbank-k51.sbt.json |
27 MB |
690f9de5f7100f2ecae3ef6aab0a39e0 |
Sourmash | .sbt.refseq-k21/ |
15 GB |
uncompressed hidden directory |
Sourmash | .sbt.refseq-k31/ |
15 GB |
uncompressed hidden directory |
Sourmash | .sbt.refseq-k51/ |
15 GB |
uncompressed hidden directory |
Sourmash | refseq-k21.sbt.json |
25 MB |
8d33669b8618cb3e76e5cd8e48a814c1 |
Sourmash | refseq-k31.sbt.json |
25 MB |
5347dc7fca3779ae47f2fcfd4f50b15c |
Sourmash | refseq-k51.sbt.json |
25 MB |
46a5638f5fa07afe4811a58e9f7c7ded |
Kaiju | kaiju_db_nr_euk.fmi |
48 GB |
f9e30d8a03925a07ff19c32bcf05df3d |
Kaiju | names.dmp |
134 MB |
2e3e40199d42a529802a02f118fb033b |
Kaiju | nodes.dmp |
104 MB |
8c4e4d3acdec1a0a837ab71d53c06828 |
Kraken2 | minikraken_8GB_20200312/hash.k2d |
7.5G |
1e7612c690b728ecefecb8d4a182f58c |
Kraken2 | minikraken_8GB_20200312/opts.k2d |
56 |
0107fefa85081a8ad341e75c8ee300b2 |
Kraken2 | minikraken_8GB_20200312/seqid2taxid.map |
2.8 MB |
4be51cd7214ff22bbd6339821b08b06b |
Kraken2 | minikraken_8GB_20200312/taxo.k2d |
2.2 MB |
b99be394ad419e64c8c55aea14b238f7 |
Bracken | minikraken_8GB_20200312/database50mers.kmer_distrib |
2.9 MB |
5a46e3b8e4585d5f58f2f10879d28272 |
Bracken | minikraken_8GB_20200312/database100mers.kmer_distrib |
2.9 MB |
3a737368f629627995026dc1b40711b1 |
Bracken | minikraken_8GB_20200312/database150mers.kmer_distrib |
2.6 MB |
d1eb27e527b36469cd19aef62d491ebe |
Bracken | minikraken_8GB_20200312/database200mers.kmer_distrib |
2.4 MB |
38b3320b6701a11ab79f88ea3ea01b77 |
Bracken | minikraken_8GB_20200312/database250mers.kmer_distrib |
2.2 MB |
603413ad165b860e5fe5af954402268a |
KrakenUniq | minikraken_20171019_8GB/database.idx |
513 MB |
3e7cf2135ef4945b92ed02dad96c8f65 |
KrakenUniq | minikraken_20171019_8GB/database.kdb |
7.6 GB |
7ab07372f6fdb7b78322b7802f72d6bc |
KrakenUniq | minikraken_20171019_8GB/database.kdb.counts |
178 KB |
c9791458a3bbe97cc8db0b22016894d1 |
KrakenUniq | minikraken_20171019_8GB/taxDB |
77 MB |
90132b2ad1e02ca4e8fc50cdab905c67 |
KrakenUniq | minikraken_20171019_8GB/taxonomy/names.dmp |
139 MB |
107ed68d9a24fc47f2e46d4c53206ca8 |
KrakenUniq | minikraken_20171019_8GB/taxonomy/nodes.dmp |
108 MB |
dc6023b70a83f5f01eb2a2dd9d883a07 |
MTSv | Oct-28-2019/artifacts/complete_genome.fas |
121 GB |
aad15187a99c26fc234725932f411574 |
MTSv | Oct-28-2019/artifacts/complete_genome_ff.txt |
5.8 MB |
6b515496a94240ce27b343b03b8dd66b |
MTSv | Oct-28-2019/artifacts/complete_genome.json |
17 KB |
variable |
MTSv | Oct-28-2019/artifacts/complete_genome.p |
157 MB |
7ee3e27583f1d027a36ef92a723ed91f |
MTSv | Oct-28-2019/artifacts/nucl_gb.accession2taxid.gz |
1.8 GB |
93dcf9204dfe383aca117f4b978c4ed9 |
MTSv | Oct-28-2019/artifacts/nucl_wgs.accession2taxid.gz |
3.2 GB |
d954bf090d05eb39778b8dd18364762f |
MTSv | Oct-28-2019/artifacts/pdb.accession2taxid.gz |
3.2 MB |
64de81d3ad2cc772b8978a85634deefc |
MTSv | Oct-28-2019/artifacts/prot.accession2taxid.gz |
5.4 MB |
7d197b60ca8408b860916be7c13f3380 |
MTSv | Oct-28-2019/artifacts/taxdump.tar.gz |
48 MB |
c9c83e0f08c66d9f2e7eee65194bb519 |
MTSv | Oct-28-2019/artifacts/tree.index |
133 MB |
b8779584df7c78947ae79c08b9d60aa0 |
The taxonomic classification workflow uses input files generated from prior workflows in the MetScale pipeline. The input files for each of the taxonomic classification tools are described below, and all of these files should be located in the metscale/workflows/data
directory.
If the Comparison Workflow was previously run, the taxonomic classification workflow will use the previously generated MinHash signatures as input files for sourmash gather. These are the files that you should see in the metscale/workflows/data
directory after generating signatures for the example SRR606249_subset10 dataset in the Comparison Workflow:
File Name | File Size | Signature Source |
---|---|---|
SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig |
1.1 MB |
Signatures from conservatively filtered reads (trim2) |
SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig |
883 KB |
Signatures from aggressively filtered reads (trim30) |
Note that running the Comparison Workflow before taxonomic classification is not required. If the comparison workflow was not run before the taxonomic classification workflow, then the signatures will be generated from the filtered reads at the time the taxonomic classification workflow is executed. Each signature file includes three different k-mer lengths (k=21, k=31, k=51) that will correspond to the pre-computed databases that are used with sourmash gather in the taxonomic classification workflow.
Below are the input files for the lowest common ancestor (LCA) read-based assignment tools. You should see these in the metscale/workflows/data
directory after running the example SRR606249_subset10 dataset through the read filtering workflow:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2_1.fq.gz |
381 MB |
SRR606249_subset10_1_reads_trim2_2.fq.gz |
374 MB |
SRR606249_subset10_1_reads_trim30_1.fq.gz |
365 MB |
SRR606249_subset10_1_reads_trim30_2.fq.gz |
359 MB |
If the complete Read Filtering was previously run, the taxonomic classification workflow will use the previously generated trimmed, interleaved, and zipped file(s) as input files for Mash dist, Mash screen, and MTSv. You should see these interleaved files in the metscale/workflows/data
directory after generating them for the example SRR606249_subset10 dataset in the Read Filtering workflow:
File Name | File Size | Signature Source |
---|---|---|
SRR606249_subset10_1_reads_trim2_interleaved_reads.fq.gz |
688 MB |
Interleaved and zipped file from conservatively filtered reads (trim2) |
SRR606249_subset10_1_reads_trim30_interleaved_reads.fq.gz |
579 MB |
Interleaved and zipped file from aggressively filtered reads (trim30) |
Note that running the Read Filtering Workflow before taxonomic classification is not required. If the full read filtering workflow was not run before the taxonomic classification workflow, then the interleaved files will be generated at the time the taxonomic classification workflow is executed. If you have interleaved and zipped reads you can also execute MTSv by jumping into the workflow (see Different Workflow Entry Points for time stamping files to jump into the workflow with your interleaved *fq.gz files).
Workflows are executed according to the sample names and workflow parameters, as specified in the config file. For more information about config files, see the Getting Started wiki page.
After the config file is ready, be sure to specify the Singularity bind path from the metscale/workflows
directory before running the taxonomic classification workflow.
cd metscale/workflows
export SINGULARITY_BINDPATH="data:/tmp"
Execution of the taxonomic classification workflow can then be performed with the following command:
snakemake --use-singularity {rules} {other options}
The following rules are available for execution in the taxonomic classification workflow (yellow stars indicate the terminal rules):
The taxonomic classification rules and their parameters are listed under "workflows" in the metscale/workflows/config/default_workflowconfig.settings
config file.
The following rules are available for execution in the taxonomic classification workflow, and these rules and their parameters are also listed under "workflows" in the metscale/workflows/config/default_workflowconfig.settings
default config file:
Tool | Rule | Description |
---|---|---|
Mash | tax_class_mash_screen_workflow |
Mash screens a database to estimate how well each genome is contained in a metagenome |
Mash | tax_class_mash_dist_workflow |
Mash estimates the distance between two isolate genomes to assess similarity |
Sourmash | tax_class_signatures_workflow |
Sourmash computes the MinHash signatures of filtered reads |
Sourmash | tax_class_gather_workflow |
Sourmash gather finds the best match for each genome within a metagenome |
Kaiju | tax_class_kaijureport_workflow |
With trimmed reads as inputs, Kaiju performs taxonomic classifications and reports all organisms identified |
Kaiju | tax_class_kaijureport_contigs_workflow |
With contigs as inputs, Kaiju performs taxonomic classifications and reports all organisms identified |
Kraken2 | tax_class_kraken2_workflow |
Executes Kraken2 on trimmed reads for taxonomic classification |
Bracken | tax_class_bracken_workflow |
Executes Bracken for re-estimation of Kraken2 taxonomic classifications at specific taxon level(s) |
KrakenUniq | tax_class_krakenuniq_workflow |
Executes KrakenUniq on trimmed reads for taxonomic classification |
MTSv | tax_class_create_mtsv_db_workflow |
Installs and configures the downloaded MTSv database. This only needs to be executed the first time you run MTSv |
MTSv | tax_class_run_mtsv_workflow |
Copies over the MTSv .cfg file, updates it with data specified in workflow parameters, and runs MTSv with the parameters given to it. This command will run through MTSv init and analyses, which include MTSv readprep, binning, analysis, reporting, and summary. |
This command will execute mash screen on interleaved, filtered reads from metagenomes:
snakemake --use-singularity tax_class_mash_screen_workflow
This command will execute mash dist on interleaved, filtered reads from isolate genomes:
snakemake --use-singularity tax_class_mash_dist_workflow
If interleaved files have not been generated before running mash, then those will be generated by read filtering workflow when running these commands.
NOTE: The tax_class_signatures_workflow
rule runs by calling to the comparison workflow's snakefile, which generates MinHash signatures. If the comparison workflow was already run, snakemake will recognize that the signature file already exists and will not re-execute. Because of this, if you wish to change parameters for the tax_class_signatures_workflow
rule, you must also update those parameters in the comparison workflow to ensure that the intended signature file is generated.
To execute sourmash, you can run all rules independently or in tandem by specifying multiple sourmash rules in one run. The following command will run all of the available rules for sourmash:
snakemake --use-singularity tax_class_signatures_workflow tax_class_gather_workflow
The following command will run Kraken2 on your trimmed read files:
snakemake --use-singularity tax_class_kraken2_workflow
The following command will run Bracken on your Kraken2 output files. Alternatively, you can call this by itself as a terminal rule to Kraken2 and Bracken:
snakemake --use-singularity tax_class_bracken_workflow
The following command will run KrakenUniq on your trimmed read files:
snakemake --use-singularity tax_class_krakenuniq_workflow
The following command will run kaiju on trimmed reads:
snakemake --use-singularity tax_class_kaijureport_workflow
The following command will run kaiju on assembled contigs, which may be helpful in improving accuracy for this amino acid-based classifier:
snakemake --use-singularity tax_class_kaijureport_contigs_workflow
If users are interested in exploring how Kaiju's performance could be improved with different parameter and filtering options, a number of additional rules are available for Kaiju:
Additional Kaiju Rule | Description |
---|---|
tax_class_kaijureport_filtered_workflow |
With filtered reads as inputs, Kaiju summarizes genera with >=1% of the total reads |
tax_class_kaijureport_filtered_contigs_workflow |
With contigs as inputs, Kaiju summarizes genera with >=1% of the total reads |
tax_class_kaijureport_filteredclass_workflow |
Kaiju summarizes genera with >=1% of all classified reads |
tax_class_kaijureport_filteredclass_contigs_workflow |
With contigs as inputs, Kaiju summarizes genera with >=1% of all classified reads |
tax_class_add_taxonnames_workflow |
Kaiju adds taxon names to its report |
tax_class_add_taxonnames_contigs_workflow |
Kaiju adds taxon names to its report generated from contig file(s) |
tax_class_kaiju_species_summary_workflow |
Kaiju provides a species-level summary of the organisms identified |
tax_class_kaiju_species_summary_contigs_workflow |
Kaiju provides a species-level summary of the organisms identified from contig file(s) |
tax_class_visualize_krona_kaijureport_workflow |
Krona plots are generated from all Kaiju genus-level results |
tax_class_visualize_krona_kaijureport_contigs_workflow |
Krona plots are generated from all Kaiju genus-level results generated from contig file(s) |
tax_class_visualize_krona_kaijureport_filtered_workflow |
Krona plots are generated from Kaiju genera with >=1% of the total reads |
tax_class_visualize_krona_kaijureport_filtered_contigs_workflow |
Krona plots are generated from Kaiju genera with >=1% of the total reads generated from contig file(s) |
tax_class_visualize_krona_kaijureport_filteredclass_workflow |
Krona plots are generated from Kaiju genera with >=1% of all classified reads |
tax_class_visualize_krona_kaijureport_filteredclass_contigs_workflow |
Krona plots are generated from Kaiju genera with >=1% of all classified reads generated from contig file(s) |
tax_class_visualize_krona_species_summary_workflow |
Krona plots are generated from Kaiju species summary report |
tax_class_visualize_krona_species_summary_contigs_workflow |
Krona plots are generated from Kaiju species summary report generated from contig file(s) |
The following command will execute all available rules for Kaiju on trimmed reads:
snakemake --use-singularity tax_class_add_taxonnames_workflow tax_class_kaiju_species_summary_workflow tax_class_visualize_krona_kaijureport_filtered_workflow tax_class_visualize_krona_kaijureport_filteredclass_workflow tax_class_visualize_krona_species_summary_workflow
The following command will execute all available rules for Kaiju and Krona, but using contigs as inputs instead of reads:
snakemake --use-singularity tax_class_add_taxonnames_contigs_workflow tax_class_visualize_krona_kaijureport_contigs_workflow tax_class_visualize_krona_kaijureport_filtered_contigs_workflow tax_class_visualize_krona_kaijureport_filteredclass_contigs_workflow tax_class_visualize_krona_species_summary_contigs_workflow
To execute MTSv, you can run both rules independently or in tandem by specifying multiple MTSv rules in one run. The following command will run all of the available rules for MTSv:
snakemake --use-singularity tax_class_create_mtsv_db_workflow tax_class_run_mtsv_workflow
NOTE: The tax_class_create_mtsv_db_workflow
only needs to be executed the first time MTSv is run, and after that all analyses can be run by executing tax_class_run_mtsv_workflow
.
Additional options for snakemake can be found in the snakemake documentation.
To change or specify your own parameters for this or any of the workflows prior to execution, see (see Workflow Architecture)
After successful execution of the taxonomic classification workflow, you will find all of its outputs in the metscale/workflows/data/
directory. You should see the files below after running the example dataset with each of the taxonomic classification rules.
Mash Rule | Output File | Output Description |
---|---|---|
tax_class_mash_screen_workflow |
{sample}_1_reads_trim{quality threshold}_{database}_mash_screen.tab |
List of genomes contained within the metagenome |
tax_class_mash_screen_workflow |
{sample}_1_reads_trim{quality threshold}_{database}_mash_screen.sorted.tab |
List of genomes contained within the metagenome, sorted to show best hits at the top |
tax_class_mash_dist_workflow |
{sample}_1_reads_trim{quality threshold}_{database}_mash_distances.tab |
List of genomes matching to the query |
tax_class_mash_dist_workflow |
{sample}_1_reads_trim{quality threshold}_{database}_mash_distances.sorted.tab |
List of genomes matching to the query, sorted to show best hits at the top |
Sourmash Rule | Output File | Output Description |
---|---|---|
tax_class_signatures_workflow |
{sample}_1_reads_trim{quality threshold}_scaled{scaled value}.k{k values}.sig |
MinHash sketches of filtered reads, generated and saved in a signature file by sourmash compute |
tax_class_gather_workflow |
{sample}_1_reads_trim{quality threshold}_k{k value}.gather_unassigned.csv |
Unassigned hashes in a single signature generated by sourmash gather |
tax_class_gather_workflow |
{sample}_1_reads_trim{quality threshold}_k{k value}.gather_matches.csv |
Signatures of matches generated by sourmash gather |
tax_class_gather_workflow |
{sample}_1_reads_trim{quality threshold}_k{k value}.gather_output.csv |
Taxonomic classifications generated by sourmash gather |
Although a few different files are generated by sourmash gather, only the *gather_output.csv file is used for the human interpretation of results. These columns are expected in the primary sourmash gather output file - *gather_output.csv (see sourmash documentation for more information):
Name | Description |
---|---|
intersect_bp |
Approximate number of base pairs shared between the query metagenome and the matched reference genome |
f_orig_query |
Fraction of the original query metagenome shared with the reference genome match |
f_match |
Fraction of the reference genome match shared with the query metagenome |
f_unique_to_query |
Fraction of the query metagenome that uniquely matches to the displayed reference genome. Note that matches are subtracted from the query metagenome as they are found, so f_unique_to_query is iteratively calculated on the remaining metagenome after the prior matches are subtracted. Moving from the top row to the bottom row, the matches are reported in the order that they were identified in *gather_output.csv . |
f_unique_weighted |
f_unique_to_query weighted by abundance |
average_abund |
Average abundance of the match |
name |
Name of the match in the database |
filename |
Name of the database Sequence Bloom Tree (SBT) used |
md5 |
A checksum value to verify data integrity |
Rule | Output File | Output Description |
---|---|---|
tax_class_kraken2_workflow |
{sample}_1_reads_trim{quality threshold}_kraken2_classified_{database}_confidence{threshold}_{read_direction}.fq |
Classified reads from Kraken2 output |
tax_class_kraken2_workflow |
{sample}_1_reads_trim{quality threshold}_kraken2_unclassified_{database}_confidence{threshold}_{read_direction}.fq |
Unclassified reads from Kraken2 output |
tax_class_kraken2_workflow |
{sample}_1_reads_trim{quality threshold}_kraken2_{database}_confidence{threshold}.report |
Report to summarize all Kraken2 read-based taxon classifications |
tax_class_kraken2_workflow |
{sample}_1_reads_trim{quality threshold}_kraken2_{database}_confidence{threshold}.out |
File generated to describe each of the Kraken2 classified (C) and unclassified (U) reads |
Rule | Output File | Output Description |
---|---|---|
tax_class_bracken_workflow |
{sample}_1_reads_trim{quality threshold}_kraken2_{kraken2 database}_confidence{threshold}_bracken.report |
TSV report of Kraken2 results that are input into Bracken for re-classification |
tax_class_bracken_workflow |
{sample}_1_reads_trim{quality threshold}_kraken2_{kraken 2 database}_confidence{threshold}_bracken_db -{bracken database}_r-{read_length}_l-{level}_t-{threshold} |
TSV file of Bracken re-classified reads from Kraken2 at a specific taxon level |
Rule | Output File | Output Description |
---|---|---|
tax_class_krakenuniq_workflow |
{sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_classified |
FASTA file of all classified reads identified by KrakenUniq |
tax_class_krakenuniq_workflow |
{sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_unclassified |
FASTA file of all unclassified reads identified by KrakenUniq |
tax_class_krakenuniq_workflow |
{sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_report |
Report to summarize all KrakenUniq read-based taxon classifications |
tax_class_krakenuniq_workflow |
{sample}_1_reads_trim{quality threshold}_krakenuniq_{database}_hll{threshold}_out |
Text file generated to describe all KrakenUniq classified (C) and unclassified (U) reads |
Kaiju Rule | Output File | Output Description |
---|---|---|
tax_class_kaijureport_workflow or tax_class_kaijureport_contigs_workflow
|
{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju.out |
Standard Kaiju output file |
tax_class_kaijureport_workflow or tax_class_kaijureport_contigs_workflow
|
{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus.summary |
Genus-level summary of all Kaiju results |
tax_class_kaijureport_filtered_workflow or tax_class_kaijureport_filtered_contigs_workflow
|
{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_filtered1_total.summary |
Kaiju genus-level summary with no genera <1% of total reads |
tax_class_kaijureport_filteredclass_workflow or tax_class_kaijureport_filteredclass_contigs_workflow
|
{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_filtered1_classified.summary |
Kaiju genus-level summary with no genera <1% of all classified reads |
tax_class_add_taxonnames_workflow or tax_class_add_taxonnames_contigs_workflow
|
{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_names.out |
Kaiju output file with taxon names appended to it |
tax_class_kaiju_species_summary_workflow or tax_class_kaiju_species_summary_contigs_workflow
|
{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_out_species.summary |
Species-level summary of all kaiju results |
These columns are expected in the primary kaiju output file - *.out (see kaiju documentation for more information):
Name | Description |
---|---|
Column 1 | Either C or U, indicating whether the read is classified or unclassified |
Column 2 | Name of the read |
Column 3 | NCBI taxon identifier of the assigned taxon |
Column 4 | The length or score of the best match used for classification |
Column 5 | The taxon identifiers of all database sequences with the best match |
Column 6 | The accession numbers of all database sequences with the best match |
Column 7 | Matching fragment sequence(s) |
Three rules are available to generate krona plots from the genus-level summaries of the Kaiju results, and one rule will generate a krona plot from the species-level summary.
Kaiju/Krona Rule | Output File | Output Description |
---|---|---|
tax_class_visualize_krona_kaijureport_workflow or tax_class_visualize_krona_kaijureport_contigs_workflow
|
{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_krona.html |
Krona plot of all Kaiju genus-level results |
tax_class_visualize_krona_kaijureport_filtered_workflow or tax_class_visualize_krona_kaijureport_filtered_contigs_workflow
|
{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_krona_filtered1_total.html |
Krona plot of genus-level Kaiju results after removing genera with <1% of total reads |
tax_class_visualize_krona_kaijureport_filteredclass_workflow or tax_class_visualize_krona_kaijureport_filteredclass_contigs_workflow
|
{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_genus_krona_filtered1_classified.html |
Krona plot of genus-level Kaiju results after removing genera with <1% of all classified reads |
tax_class_visualize_krona_species_summary_workflow or tax_class_visualize_krona_species_summary_contigs_workflow
|
{sample}_1_reads_trim{quality threshold}_{assembler}.kaiju_species_krona.html |
Krona plot of all species-level Kaiju results |
Rule | Output File | Output Description |
---|---|---|
tax_class_create_mtsv_db_workflow |
Oct-28-2019/ |
Unzips and formats the MTSv database in this subdirectory |
tax_class_run_mtsv_workflow |
<your_sample_name>_MTSV/ |
A directory with your sample name as the prefix is generated and contains all the output files of MTSv analyses. Subdirectories containing these files include: Analysis/ , Binning/ , Logs/ , Params/ , QueryFastas/ , Reports/ , and Summary/
|
MTSv outputs include *html files with summary statistics (Report/binning_report.html
, Report/readprep_report.html
, and Report/summary_report.html
). A final taxonomic summary table for each sample is written to Summary/summary.csv
.
To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools.
The equivalents of the following commands generate Mash distance taxonomic classification outputs with trim2:
mash sketch -m 2 SRR606249_subset10_trim2_interleaved_reads.fq
mash dist refseq.genomes.k21s1000.msh SRR606249_subset10_trim2_interleaved_reads.fq.msh > SRR606249_subset10_trim2_refseq.genomes.k21s1000_mash_distances.tab
sort -gk3 SRR606249_subset10_trim2_refseq.genomes.k21s1000_mash_distances.tab > SRR606249_subset10_trim2_refseq.genomes.k21s1000_mash_distances.sorted.tab
The equivalents of the following commands generate Mash distance taxonomic classification outputs with trim30:
mash sketch -m 2 SRR606249_subset10_trim30_interleaved_reads.fq
mash dist RefSeq88n.msh SRR606249_subset10_trim30_interleaved_reads.fq.msh > SRR606249_subset10_trim30_refseq.genomes.RefSeq88n_mash_distances.tab
sort -gk3 SRR606249_subset10_trim30_refseq.genomes.RefSeq88n_mash_distances.tab > SRR606249_subset10_trim30_refseq.genomes.k21s1000_mash_distances.sorted.tab
The equivalents of the following commands generate Mash screen taxonomic classification outputs with trim2:
mash screen -p 4 RefSeq88n.msh SRR606249_subset10_trim2_interleaved_reads.fq > SRR606249_subset10_trim2_interleaved_reads.RefSeq88n.screen.tab
sort -gr SRR606249_subset10_trim2_interleaved_reads.RefSeq88n.screen.tab > SRR606249_subset10_trim2_interleaved_reads.RefSeq88n.screen.sorted.tab
The equivalents of the following commands generate Mash screen taxonomic classification outputs with trim30:
mash screen -p 4 refseq.genomes.k21s1000.msh SRR606249_subset10_trim30_interleaved_reads.fq > SRR606249_subset10_trim30_interleaved_reads.k21s1000.screen.tab
sort -gr SRR606249_subset10_trim30_interleaved_reads.k21s1000.screen.tab > SRR606249_subset10_trim30_interleaved_reads.k21s1000.screen.sorted.tab
This *sig file generated by sourmash compute
is used as the input file for sourmash compare
and sourmash gather
. If it was not previously generated in the Comparison Workflow, the signature file will be generated by the tax_class_signatures_workflow
rule.
The MinHash signatures of reads filtered with a quality score threshold of 2 are assigned to reference genomes with the equivalent of the following sourmash gather commands (one command for each value of k - 21, 31, and 51):
sourmash gather -k 51 SRR606249_subset10_trim2_scaled10000.k51.sig genbank-k51.sbt.json refseq-k51.sbt.json -o SRR606249_subset10_trim2_k51.gather_output.csv --output-unassigned SRR606249_subset10_trim2_k51.gather_unassigned.csv --save-matches SRR606249_subset10_trim2_k51.gather_matches.csv
sourmash gather -k 31 SRR606249_subset10_trim2_scaled10000.k31.sig genbank-k31.sbt.json refseq-k31.sbt.json -o SRR606249_subset10_trim2_k31.gather_output.csv --output-unassigned SRR606249_subset10_trim2_k31.gather_unassigned.csv --save-matches SRR606249_subset10_trim2_k31.gather_matches.csv
sourmash gather -k 21 SRR606249_subset10_trim2_scaled10000.k21.sig genbank-k21.sbt.json refseq-k21.sbt.json -o SRR606249_subset10_trim2_k21.gather_output.csv --output-unassigned SRR606249_subset10_trim2_k21.gather_unassigned.csv --save-matches SRR606249_subset10_trim2_k21.gather_matches.csv
The MinHash signatures of reads filtered with a quality score threshold of 30 are assigned to reference genomes with the equivalent of the following sourmash gather commands (one command for each value of k - 21, 31, and 51):
sourmash gather -k 51 SRR606249_subset10_trim30_scaled10000.k51.sig genbank-k51.sbt.json refseq-k51.sbt.json -o SRR606249_subset10_trim30_k51.gather_output.csv --output-unassigned SRR606249_subset10_trim30_k51.gather_unassigned.csv --save-matches SRR606249_subset10_trim30_k51.gather_matches.csv
sourmash gather -k 31 SRR606249_subset10_trim30_scaled10000.k31.sig genbank-k31.sbt.json refseq-k31.sbt.json -o SRR606249_subset10_trim30_k31.gather_output.csv --output-unassigned SRR606249_subset10_trim30_k31.gather_unassigned.csv --save-matches SRR606249_subset10_trim30_k31.gather_matches.csv
sourmash gather -k 21 SRR606249_subset10_trim30_scaled10000.k21.sig genbank-k21.sbt.json refseq-k21.sbt.json -o SRR606249_subset10_trim30_k21.gather_output.csv --output-unassigned SRR606249_subset10_trim30_k21.gather_unassigned.csv --save-matches SRR606249_subset10_trim30_k21.gather_matches.csv
The equivalents of the following commands generate taxonomic classification outputs with Kraken2 for trim2 and trim30:
kraken2 --db minikraken_8GB_20200312/ --paired SRR606249_subset10_1_reads_trim2_1.fq.gz SRR606249_subset10_1_reads_trim2_2.fq.gz --use-names --confidence 0 --report SRR606249_subset10_trim2_kraken2.report --unclassified-out SRR606249_subset10_trim2_kraken2_unclassified#.fq.gz --classified-out SRR606249_subset10_trim2_kraken2_classified#.fq.gz --output SRR606249_subset10_trim2_kraken2.out.txt
kraken2 --db minikraken_8GB_20200312/ --paired SRR606249_subset10_1_reads_trim30_1.fq.gz SRR606249_subset10_1_reads_trim30_2.fq.gz --use-names --confidence 0 --report SRR606249_subset10_trim30_kraken2.report --unclassified-out SRR606249_subset10_trim30_kraken2_unclassified#.fq.gz --classified-out SRR606249_subset10_trim30_kraken2_classified#.fq.gz --output SRR606249_subset10_trim30_kraken2.out.txt
The equivalents of the following commands generate Bracken species-level taxonomic classification outputs using the Kraken2 reports previously generated with trim2 and trim30:
bracken -d minikraken_8GB_20200312/ -i SRR606249_subset10_1_reads_trim2_kraken2_minikraken_8GB_20200312_confidence0.report -r 100 -l S -t 0 -o SRR606249_subset10_1_reads_trim2_kraken2_minikraken_8GB_20200312_confidence0_bracken_db-default_r-100_l-S_t-0
bracken -d minikraken_8GB_20200312/ -i SRR606249_subset10_1_reads_trim30_kraken2_minikraken_8GB_20200312_confidence0.report -r 100 -l S -t 0 -o SRR606249_subset10_1_reads_trim30_kraken2_minikraken_8GB_20200312_confidence0_bracken_db-default_r-100_l-S_t-0
The equivalents of the following commands generate taxonomic classification outputs with KrakenUniq for trim2 and trim30:
krakenuniq --paired SRR606249_subset10_1_reads_trim2_1.fq.gz SRR606249_subset10_1_reads_trim2_2.fq.gz --db minikraken_20171019_8GB/ --hll-precision 12 --report-file SRR606249_subset10_trim2_krakenuniq_report --unclassified-out SRR606249_subset10_trim2_krakenuniq_unclassified --classified-out SRR606249_subset10_trim2_krakenuniq_classified --output SRR606249_subset10_trim2_krakenuniq_out
krakenuniq --paired SRR606249_subset10_1_reads_trim30_1.fq.gz SRR606249_subset10_1_reads_trim30_2.fq.gz --db minikraken_20171019_8GB/ --hll-precision 12 --report-file SRR606249_subset10_trim30_krakenuniq_report --unclassified-out SRR606249_subset10_trim30_krakenuniq_unclassified --classified-out SRR606249_subset10_trim30_krakenuniq_classified --output SRR606249_subset10_trim30_krakenuniq_out
The equivalents of the following commands will run Kaiju and generate its standard report for all of the organisms identified (one command is listed below for the trim2 filtered reads and one for trim30):
kaiju -x -v -t nodes.dmp -f kaiju_db_nr_euk.fmi -i SRR606249_subset10_trim_1.fq.gz -j SRR606249_subset10_trim2_2.fq.gz -o SRR606249_subset10_trim2.kaiju.out -z 4
kaiju -x -v -t nodes.dmp -f kaiju_db_nr_euk.fmi -i SRR606249_subset10_trim30_1.fq.gz -j SRR606249_subset10_trim30_2.fq.gz -o SRR606249_subset10_trim30.kaiju.out -z 4
The standard kaiju report includes NCBI taxonomy IDs, but it does not include the taxon names. The equivalents of the following commands are used to append the corresponding taxon names to the last column in Kaiju's output file with addTaxonNames
:
addTaxonNames -t nodes.dmp -n names.dmp –u –p –i SRR606249_subset10_trim2.kaiju.out -o SRR606249_subset10_trim2.kaiju_names.out
addTaxonNames -t nodes.dmp -n names.dmp –u –p –i SRR606249_subset10_trim30.kaiju.out -o SRR606249_subset10_trim30.kaiju_names.out
A summary report for a specified taxon rank can be created from the kaiju output with kaijuReport
. These summary reports can be directly input into krona with ktImportText
.
The equivalents of the following commands generate genus-level summaries with kaijuReport
for all organisms identified in the standard kaiju report:
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r genus -o SRR606249_subset10_trim2.kaiju_genus.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r genus -o SRR606249_subset10_trim30.kaiju_genus.summary
The equivalents of the following commands generate species-level summaries with kaijuReport
for all organisms identified in the standard kaiju report:
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r species -o SRR606249_subset10_trim2.kaiju_out_species.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r species -o SRR606249_subset10_trim30.kaiju_out_species.summary
The equivalents of the following commands generate genus-level summaries with kaijuReport
for all genera comprising at least 1% of the total reads:
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r genus -m 1 -o SRR606249_subset10_trim2.kaiju_genus_filtered1_total.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r genus -m 1 -o SRR606249_subset10_trim30.kaiju_genus_filtered1_total.summary
The equivalents of the following commands generate genus-level summaries with kaijuReport
for genera comprising at least 1% of all classified reads:
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim2.kaiju.out -r genus -m 1 -u -o SRR606249_subset10_trim2.kaiju_genus_filtered1_classified.summary
kaijuReport -v -t nodes.dmp -n names.dmp -i SRR606249_subset10_trim30.kaiju.out -r genus -m 1 -u -o SRR606249_subset10_trim30.kaiju_genus_filtered1_classified.summary
The equivalents of the following commands are used to generate krona visualizations from the kaijuReport
genus-level summaries with ktImportText
(three different krona plots for trim2 and three for trim30):
ktImportText -o SRR606249_subset10_trim2.kaiju_genus_krona.html SRR606249_subset10_trim2.kaiju_genus.summary
ktImportText -o SRR606249_subset10_trim2.kaiju_genus_krona_filtered1_total.html SRR606249_subset10_trim2.kaiju_genus_filtered1_total.summary
ktImportText -o SRR606249_subset10_trim2.kaiju_genus_krona_filtered1_classified.html SRR606249_subset10_trim2.kaiju_genus_filtered1_classified.summary
ktImportText -o SRR606249_subset10_trim30.kaiju_genus_krona.html SRR606249_subset10_trim30.kaiju_genus.summary
ktImportText -o SRR606249_subset10_trim30.kaiju_genus_krona_filtered1_total.html SRR606249_subset10_trim30.kaiju_genus_filtered1_total.summary
ktImportText -o SRR606249_subset10_trim30.kaiju_genus_krona_filtered1_classified.html SRR606249_subset10_trim30.kaiju_genus_filtered1_classified.summary
For command line equivalents for executing MTSv analyses including Readprep, Binning, Analysis, and Summary, please see the MTSv GitHub page for Binning and Analysis Quick Start Guide
Below is a more detailed description of the output files expected in the metscale/workflows/data/
directory after the taxonomic classification workflow has been successfully run.
Using the filtered reads generated by the Read Filtering Workflow:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2_1.fq.gz |
381 MB |
SRR606249_subset10_1_reads_trim2_2.fq.gz |
374 MB |
SRR606249_subset10_1_reads_trim30_1.fq.gz |
327 MB |
SRR606249_subset10_1_reads_trim30_2.fq.gz |
313 MB |
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim30_RefSeq_10K_Sketches.msh_mash_screen.tab |
6.7 MB |
SRR606249_subset10_1_reads_trim30_RefSeq_10K_Sketches.msh_mash_screen.sorted.tab |
6.7 MB |
SRR606249_subset10_1_reads_trim30_RefSeq_10K_Sketches.msh_mash_distances.tab |
11 MB |
SRR606249_subset10_1_reads__trim30_RefSeq_10K_Sketches.msh_mash_distances.sorted.tab |
11 MB |
The following files are produced by sourmash after computing the signatures of filtered reads with the tax_class_signatures_workflow
rule:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig |
1.1 MB |
SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig |
883 KB |
The following files are produced by sourmash after running sourmash gather with the tax_class_gather_workflow
rule (this rule produces three files for each value of k and trim quality threshold):
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2_k21.gather_matches.csv |
2.2 MB |
SRR606249_subset10_1_reads_trim2_k21.gather_output.csv |
19 KB |
SRR606249_subset10_1_reads_trim2_k21.gather_unassigned.csv |
52 KB |
SRR606249_subset10_1_reads_trim30_k21.gather_matches.csv |
2.2 MB |
SRR606249_subset10_1_reads_trim30_k21.gather_output.csv |
19 KB |
SRR606249_subset10_1_reads_trim30_k21.gather_unassigned.csv |
11 KB |
SRR606249_subset10_1_reads_trim2_k31.gather_matches.csv |
2.2 MB |
SRR606249_subset10_1_reads_trim2_k31.gather_output.csv |
19 KB |
SRR606249_subset10_1_reads_trim2_k31.gather_unassigned.csv |
58 KB |
SRR606249_subset10_1_reads_trim30_k31.gather_matches.csv |
2.2 MB |
SRR606249_subset10_1_reads_trim30_k31.gather_output.csv |
19 KB |
SRR606249_subset10_1_reads_trim30_k31.gather_unassigned.csv |
12 KB |
SRR606249_subset10_1_reads_trim2_k51.gather_matches.csv |
2.2 MB |
SRR606249_subset10_1_reads_trim2_k51.gather_output.csv |
20 KB |
SRR606249_subset10_1_reads_trim2_k51.gather_unassigned.csv |
57 KB |
SRR606249_subset10_1_reads_trim30_k51.gather_matches.csv |
2.2 MB |
SRR606249_subset10_1_reads_trim30_k51.gather_output.csv |
20 KB |
SRR606249_subset10_1_reads_trim30_k51.gather_unassigned.csv |
14 KB |
The f_match
reported by sourmash in the *gather_output.csv file reports the fraction of the reference genome match contained within the query metagenome. The reference genome matches are listed according to their GenBank IDs, which can be connected to NCBI TaxIDs and species names.
For the example dataset, there was no loss of sensitivity in increasing the k-mer length or trimming quality threshold - all 62 species were detected in all scenarios. Although different k-mer lengths and read trimming parameters produced similar sourmash gather f_match
results, the smallest fraction of reference genomes was typically detected in the k=51 trim30 dataset. Increasing specificity with longer k-mer lengths and more aggressive trimming tended to lower the fraction of each true positive reference genome detected; however, in some cases the increased specificity of longer k-mer lengths resulted in detecting more of the reference genome than shorter k-mer lengths (e.g., Thermotoga sp. RQ2).
Although sourmash gather performs best at the species level, it will report matches below the species level. Two of the species in the Shakya subset 10 example dataset had strain variants, and no false positive strains were detected for these organisms (i.e., Methanococcus maripaludis strain C5, Methanococcus maripaludis strain S2, Shewanella baltica strain OS185, Shewanella baltica strain OS223). False positives for other organisms detected at the species level tended to have smaller f_match
values, which suggests that there may be a threshold of genome coverage required for a high confidence species call to be made. Alternatively, these could be low abundance, real contaminants within the sample that was sequenced.
Overall, k-mer length had a larger impact on sourmash results than the trimming quality threshold. The smallest fraction of false positive genomes was detected in the k=51 datasets. The k=51 and trim30 parameters are recommended as defaults if users only want to run sourmash with one trimming quality threshold and value of k.
The following *.out files are produced by kaiju to report the taxonomic classifications of reads after running the tax_class_kaijureport_workflow
rule:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2.kaiju.out |
628 MB |
SRR606249_subset10_1_reads_trim30.kaiju.out |
592 MB |
The tax_class_add_taxonnames_workflow
rule will append taxon names to the *.out files to produce the following files:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2.kaiju_names.out |
1.3 GB |
SRR606249_subset10_1_reads_trim30.kaiju_names.out |
1.2 GB |
Kaiju can produce summary reports at different taxon levels, which can also be input into krona for visualizations of the summarized data. The following species-level summaries can be generated for all kaiju results with the tax_class_kaiju_species_summary_workflow
rule:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2.kaiju_out_species.summary |
241 KB |
SRR606249_subset10_1_reads_trim30.kaiju_out_species.summary |
204 KB |
Kaiju detected (i.e., at least one read was assigned) >5,000 species in the SRR606249_subset10_1_reads_trim2 dataset and >4,000 species in the SRR606249_subset10_1_reads_trim30 dataset. Although a thresholding approach with the example dataset would not have cleanly achieved 100% precision and recall with kaiju as with sourmash, requiring a minimum number of reads for a high confidence kaiju species detection would substantially reduce the number of false positives. In the trim30 dataset, a threshold of >6,000 assigned reads would have identified 57 true positives (out of an expected 62 species) and one false positive (Proteiniclasticum ruminis = 48,586 reads), which in this case could be a real species contaminant in the sample, and five false negatives (Enterococcus faecalis = 5,079 reads; Chloroflexus aurantiacus = 4,807 reads; Bordetella bronchiseptica = 3,309 reads; Fusobacterium nucleatum = 3,156 reads; Thermoanaerobacter pseudethanolicus = 93 reads). Further benchmarking work with additional datasets will be important to determine how to best apply interpretation guidelines to kaiju results.
The following genus-level summary files are produced by kaiju for all filtered reads after running the tax_class_kaijureport_workflow
rule:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2.kaiju_genus.summary |
54 KB |
SRR606249_subset10_1_reads_trim30.kaiju_genus.summary |
47 KB |
The *.kaiju_out_krona.summary files can be input into krona with the tax_class_visualize_krona_kaijureport_workflow
rule to produce the following krona plots of all genus-level results:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2.kaiju_genus_krona.html |
371 KB |
SRR606249_subset10_1_reads_trim30.kaiju_genus_krona.html |
353 KB |
Because of the high number of organisms detected by kaiju, removing the organisms detected at low abundance makes the krona plots easier to read. The tax_class_kaijureport_filtered_workflow
rule will produce the following genus-level summary reports that exclude genera with <1% of total reads, and krona will produce corresponding visualizations with the tax_class_visualize_krona_kaijureport_filtered_workflow
rule:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2.kaiju_genus_filtered1_total.summary |
1.3 KB |
SRR606249_subset10_1_reads_trim30.kaiju_genus_filtered1_total.summary |
1.3 KB |
SRR606249_subset10_1_reads_trim2.kaiju_genus_krona_filtered1_total.html |
227 KB |
SRR606249_subset10_1_reads_trim30.kaiju_genus_krona_filtered1_total.html |
227 KB |
The tax_class_kaijureport_filteredclass_workflow
rule will produce the following genus-level summary reports that exclude genera with <1% of all classified reads, and krona will produce corresponding visualizations with the tax_class_visualize_krona_kaijureport_filteredclass_workflow
rule:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2.kaiju_genus_filtered1_classified.summary |
1.5 KB |
SRR606249_subset10_1_reads_trim30.kaiju_genus_filtered1_classified.summary |
1.5 KB |
SRR606249_subset10_1_reads_trim2.kaiju_genus_krona_filtered1_classified.html |
227 KB |
SRR606249_subset10_1_reads_trim30.kaiju_genus_krona_filtered1_classified.html |
227 KB |
The following files are produced by Kraken2 with the tax_class_kraken2_workflow
rule with trim30:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim30_kraken2_unclassified_minikraken2_v2_8GB_201904_UPDATE_confidence0_1.fq |
113 MB |
SRR606249_subset10_1_reads_trim30_kraken2_unclassified_minikraken2_v2_8GB_201904_UPDATE_confidence0_2.fq |
111 MB |
SRR606249_subset10_1_reads_trim30_kraken2_minikraken2_v2_8GB_201904_UPDATE_confidence0.report |
192 KB |
SRR606249_subset10_1_reads_trim30_kraken2_minikraken2_v2_8GB_201904_UPDATE_confidence0.out |
755 MB |
The following files are produced by KrakenUniq with the tax_class_krakenuniq_workflow
rule with trim30:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_classified |
849 MB |
SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_unclassified |
166 MB |
SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_report |
167 KB |
SRR606249_subset10_1_reads_trim30_krakenuniq_minikraken_20171019_8GB_hll12_out |
502 MB |
The following files are produced by MTSV with the tax_class_run_mtsv_workflow
rule with trim2 and trim30:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2_MTSV/Summary/summary.csv |
44 KB |
SRR606249_subset10_1_reads_trim30_MTSV/Summary/summary.csv |
36 KB |