-
Notifications
You must be signed in to change notification settings - Fork 9
09. Functional Inference
This workflow will perform functional inference (e.g., gene and plasmid detection, microbial pathway analysis) on filtered Illumina paired-end reads with SRST2 version 0.2.0 and with HUMAnN2 version 2.8.1, or on assembled contigs with prokka version 1.14.5 and ABRicate version 0.5. Note that HUMAnN2 will simultaneously run and generate taxonomic outputs for MetaPhlAn2. The functional inference workflow has been tested to run offline after the execution of the Read Filtering Workflow and/or Assembly Workflow. Default HUMAnN2 databases are used within the functional inference workflow, but users have the option of selecting from various pre-compiled or custom databases for SRST2, prokka, and ABRicate.
SRST2, prokka, and ABRicate are intended to be used for the presence/absence detection of genes or plasmids within metagenomes. HUMAnN2 was designed to report the organisms, gene families, and pathways identified within a microbial community. While this can be helpful in understanding the functional content of a metagenome, detecting whole or partial functional elements is not equivalent to definitively detecting antimicrobial resistance, virulence, or other metabolic phenotypes, since phenotype determinations are often more complex than the presence or absence of a single gene or plasmid. Ultimately, the strength of the effect of a single gene or plasmid is variable and dependent on the specific phenotype being evaluated.
Certain features of the functional inference tools are better suited for analyzing isolates with high sequence coverage (e.g., gene variant/allele detection with indel or SNP calling, MLST), and these are not recommended for use with complex metagenomes because many organisms were likely sequenced at relatively low coverage with similar gene/plasmid content. This makes it very difficult to resolve point mutations with high confidence, but it is easier to detect the presence or absence of large genome segments (e.g., genes, plasmids) in complex metagenomes. The functional inference workflow is currently configured to report the detection of whole or partial genes and plasmids because functional sequence fragments are expected to be found in complex metagenomes. Lower coverage thresholds have not been set to exclude information from the final reports (e.g,. genes will still be reported if they are detected below 90% coverage).
If you have not already, you will need to activate your metscale environment and perform the Offline Setup before running the functional inference workflow:
[user@localhost ~]$ conda activate metscale
(metscale)[user@localhost ~]$ cd metscale/workflows
(metscale)[user@localhost workflows]$ python download_offline_files.py --workflow functional_inference
In the metscale/container_images/
directory, you should see the following Singularity images that were created when running the functional_inference or all flags during the Offline Setup:
Singularity Image | Size |
---|---|
prokka_1.14.5-1.sif |
1.4 GB |
srst2_0.2.0--py27_2.sif |
330 MB |
abricate_latest.sif |
1.3 GB |
humann2_2.8.1--py27_0.sif |
425 MB |
Each tool in the functional inference workflow has its own database(s), which are configured in specific formats as needed to execute the individual tools. Bioinformatics tool authors and container developers may choose to package databases inside their container image, or they may keep the databases outside of the container image. The relatively large taxonomic classification databases are located outside of the taxonomic container images, but the functional inference workflow has some databases inside and some outside of containers.
The SRST2 databases are not included in the SRST2 container image, so they need to be downloaded separately and will appear as individual files in the metscale/workflows/data
directory. The databases for SRST2 were downloaded from the SRST2 GitHub during the Offline Setup. The SRST2 GitHub is a good resource for learning more about how these databases were compiled. SRST2 maps reads to the reference gene sequences within the specified database and reports the percent coverage for the detected genes.
You should find the following SRST2 databases in the metscale/workflows/data
directory:
Database Name | Description |
---|---|
ARGannot.r3.fasta |
The authors of SRST2 recommend using the Antibiotic Resistance Gene-ANNOTation database as the default for detecting resistance genes, so this is currently set as the default database for SRST2 in the config file. A change to this file can specify the use of a non-default SRST2 database. |
ResFinder.fasta |
The authors of SRST2 recommend ResFinder as a secondary option to detect resistance genes. |
PlasmidFinder.fasta |
A PlasmidFinder database is available from the SRST2 authors for plasmid characterization. It identifies plasmids in total or partial sequenced isolates of bacteria. |
EcOH.fasta |
The authors of SRST2 also provide the EcOH database for identifying O and H types in E. coli. This is intended for use with E.coli isolate sequences. |
Database Name | Size | MD5 Checksum |
---|---|---|
ARGannot.r3.fasta |
1.8 MB |
9d01664f4361bb32bea8d23ab1fd7f82 |
EcOH.fasta |
0.8 MB |
1854d799650599f036f6d9ff8e1816d7 |
PlasmidFinder.fasta |
0.2 MB |
8b302925c92b8df5580dda6e8dfee3cf |
ResFinder.fasta |
1.8 MB |
657c10d0b8981cee34d0a9b91513b8a7 |
To run SRST2 with your own custom database, you first need to copy your database (.fasta file) to the metscale/workflows/data/
directory. Then, indicate your database name in you config file under functional_with_srst2_workflow
. By default, MetScale will run the ARGannot.r3.fasta and ResFinder.fasta databases with SRST2.
"functional_with_srst2_workflow" : {
# these paramters determine srst2 input
"qual" : ["2","30"],
#srst2 db's - {ARGannot.r3.fasta, ResFinder.fasta, PlasmidFinder.fasta, EcOH.fasta} or external DBs
"db" : ["ARGannot.r3.fasta", "ResFinder.fasta"]
NOTE: It is recommended to follow the exact rules and scripts provided by the SRST2 authors (See the "Generating SRST2-compatible clustered database from raw sequences" section in https://github.com/katholt/srst2) to successfully generate a custom gene database. We recommend running your data through the SRST2 database generating script csv_to_gene_db.py
, per your use-case. An custom database that is improperly formatted may result in inaccurate outputs without warning.
Unlike SRST2, the Singularity images for prokka and ABRicate are preloaded with default databases. Prokka does not use databases to predict microbial genes, but it does use databases to annotate translated proteins from its predicted genes.
Default prokka databases are used within the functional inference workflow, so there are not currently any options to change prokka databases in the config file. Prokka will identify and translate prokaryotic open reading frames from assembled contigs with Prodigal without the use of reference databases, and then compare the translated proteins to databases with functional information.
To view the default databases available within the prokka Singularity image, you can run the following command from inside the metscale/workflows/
directory:
singularity exec -B ${PWD}/:/tmp ../container_images/prokka_1.14.5-1.sif prokka --listdb
The following prokka databases are available:
Database | Description |
---|---|
Kingdoms: Archaea Bacteria Mitochondria Viruses |
Kingdoms available for annotations |
Genera: Enterococcus Escherichia Staphylococcus |
Genera available for genus-specific annotations |
HMMs: HAMAP |
HAMAP annotation of protein sequences |
CMs: Archaea Bacteria Viruses |
Infernal “covariance models” (CMs) for RNA homology searches |
"Kingdoms" and "Genera" represent prokka flags that could be used to limit annotations to specific taxons. These are not options used in the functional inference workflow because we wanted all taxons to be annotated, which is prokka's default behavior.
NOTE: Prokka is run in the functional inference workflow in a way that eliminates any tRNA genes from the final outputs (--notrna
), and it will run RNAmmer (--rnammer
) to improve the accuracy of rRNA gene predictions. The --metagenome
and --eval 1e-06
flags are also being run with prokka to optimize its use with metagenome assemblies.
Any assembly can be run through Prokka workflows with the ability to specify your own custom database for execution. When running the internal/default databases that need to be copied to your /workflows/data/
directory. You must then change the "input_db"
parameter in your default_worklfowparams.settings
file from the default Prokka database "/NGStools/prokka/db/kingdom/Bacteria/IS"
to the name of the custom database that you place in workflows/data/
. This parameter is found under each prokka + assembler rule in your default_worklfowparams.settings
file:
"functional_inference" : {
# params for functional inference workflow
"prokka_with_megahit" : {
"outdir_pattern" : "{sample}_trim{qual}_megahit.prokka_annotation",
"input_pattern" : "{sample}_trim{qual}.megahit.contigs.fa",
"prefix_pattern" : "{sample}_trim{qual}_megahit",
"input_db" : "/NGStools/prokka/db/kingdom/Bacteria/IS",
"threads" : 16
},
"prokka_metatrans_with_rnaspades" : {
"outdir_pattern" : "{sample}_trim{qual}_metatrans_rnaspades.prokka_annotation",
"input_pattern" : "{sample}_trim{qual}.rnaspades.transcripts.fasta",
"prefix_pattern" : "{sample}_trim{qual}_metatrans_rnaspades",
"input_db" : "/NGStools/prokka/db/kingdom/Bacteria/IS",
"threads" : 16
},
"prokka_trans_with_rnaspades" : {
"outdir_pattern" : "{sample}_trim{qual}_trans_rnaspades.prokka_annotation",
"input_pattern" : "{sample}_trim{qual}.rnaspades.transcripts.fasta",
"prefix_pattern" : "{sample}_trim{qual}_trans_rnaspades",
"input_db" : "/NGStools/prokka/db/kingdom/Bacteria/IS",
"threads" : 16
},
"prokka_with_metaspades" : {
"outdir_pattern" : "{sample}_trim{qual}_metaspades.prokka_annotation",
"input_pattern" : "{sample}_trim{qual}.metaspades.contigs.fa",
"prefix_pattern" : "{sample}_trim{qual}_metaspades",
"input_db" : "/NGStools/prokka/db/kingdom/Bacteria/IS",
"threads" : 16
},
"prokka_with_spades" : {
"outdir_pattern" : "{sample}_trim{qual}_spades.prokka_annotation",
"input_pattern" : "{sample}_trim{qual}.spades.contigs.fa",
"prefix_pattern" : "{sample}_trim{qual}_spades",
"input_db" : "/NGStools/prokka/db/kingdom/Bacteria/IS",
"threads" : 16
You can also specify the location for you custom database by adding the the following lines to your my_custom_config
and illumina_custom_config
files after your "read_filtering"
and before your "workflow" section"
:
{
"read_filtering" : {
"read_patterns" : {
"pre_trimming_pattern" : "{sample}_{direction}_reads",
"pre_trimming_glob_pattern" : "*_1_reads.fq.gz",
"reverse_pe_pattern_search" : "1_",
"reverse_pe_pattern_replace" : "2_",
},
"quality_trimming" : {
"sample_file_ext" : ".fq.gz"
},
},
"functional_inference" : {
# params for functional inference workflow external database(s)
"prokka_with_megahit" : {
"input_db" : "my_custom_database.fasta",
},
"prokka_with_metaspades" : {
"input_db" : "my_custom_database.fasta",
},
"prokka_with_spades" : {
"input_db" : "my_custom_database.fasta",
},
},
"workflows" : {
ABRicate will align assembled contigs to reference gene sequences within a specified database. The following databases are available:
Database | Source |
---|---|
argannot |
Antibiotic Resistance Gene-ANNOTation |
card |
The Comprehensive Antibiotic Resistance Database |
ncbibetalactamase |
Bacterial Antimicrobial Resistance Reference Gene Database |
plasmidfinder |
PlasmidFinder |
resfinder |
ResFinder |
vfdb |
Virulence Factors Database |
To view the databases available within the ABRicate Singularity image, the following command can be run from inside the metscale/workflows/
directory:
singularity exec -B ${PWD}/:/tmp ../container_images/abricate_latest.sif abricate --list
You should see the following database information:
Database | Number of Sequences | Date |
---|---|---|
argannot |
1749 sequences |
Mar 17, 2017 |
card |
2117 sequences |
Mar 18, 2017 |
ncbibetalactamase |
1557 sequences |
Mar 17, 2017 |
plasmidfinder |
263 sequences |
Mar 19, 2017 |
resfinder |
2130 sequences |
Mar 17, 2017 |
vfdb |
2597 sequences |
Mar 17, 2017 |
The ABRicate database can be changed in the config file. The author of ABRicate did not have strong recommendations for which database to select as a default, so card
and vfdb
were set as its default databases in the functional inference workflow.
"functional_abricate_with_megahit_workflow" : {
# these parameters determine which contig params
# from megahit to run abricate on
"qual" : ["2","30"],
# abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
# replace "customDBabricate" with custom/external DB name
"db" : ["card", "vfdb"],
#"db" : ["customDBabricate"],
},
"functional_abricate_with_metaspades_workflow" : {
# these parameters determine which contig params
# from metaspades to run abricate on
"qual" : ["2","30"],
# abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
# replace "customDBabricate" with custom/external DB name
"db" : ["card", "vfdb"],
#"db" : ["customDBabricate"],
},
"functional_abricate_with_spades_workflow" : {
# these parameters determine which contig params
# from spades to run abricate on
"qual" : ["2","30"],
# abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
# replace "customDBabricate" with custom/external DB name
"db" : ["card", "vfdb"],
#"db" : ["customDBabricate"],
},
NOTE: Although some SRST2 and ABRicate databases have similar names and were derived from the same original sources, they are not equivalent. Nomenclature is different between the two, downloads from the original sources took place at different times, and manual gene annotations and additions were performed by tool authors to different degrees. They are both good tools, but results will vary because of underlying database differences.
If you are missing any of these files, you should re-run the appropriate setup command, as per instructions in the Offline Setup.
The MEGAHIT, metaSPAdes, and SPAdes ABRicate workflows all have the ability to specify a custom database for execution. When running the internal/default database(s), the parameter for the database directory (i.e. db_dir
) is set to /opt/abricate/db
in the default_worklfowparams.settings
file. If you want to use your own custom/external database, you will need to copy the custom database to the metscale/workflows/data/
directory, and then update the parameter specifying the location of the default database directory (i.e. db_dir
) to sing_dir
. This can be done by commenting out the line(s) indicated above in default_worklfowparams.settings
and activating the "db_dir" : "sing_dir"
lines.
"abricate_with_metaspades" : {
"output_pattern" : "{sample}_trim{qual}_metaspades.abricate_{db}.csv",
"input_pattern" : "{sample}_trim{qual}.metaspades.contigs.fa",
# set by default to use internal DB's
"db_dir" : "/opt/abricate/db/"
#use external dirs
#"db_dir" : "sing_dir"
},
"abricate_with_megahit" : {
"output_pattern" : "{sample}_trim{qual}_megahit.abricate_{db}.csv",
"input_pattern" : "{sample}_trim{qual}.megahit.contigs.fa",
# set by default to use internal DB's
"db_dir" : "/opt/abricate/db/"
#use external dirs
#"db_dir" : "sing_dir"
},
"abricate_with_spades" : {
"output_pattern" : "{sample}_trim{qual}_spades.abricate_{db}.csv",
"input_pattern" : "{sample}_trim{qual}.spades.contigs.fa",
# set by default to use internal DB's
"db_dir" : "/opt/abricate/db/"
#use external dirs
#"db_dir" : "sing_dir"
},
You can also specify the location for you custom database by adding the the following lines to your my_custom_config
and illumina_custom_config
file(s) before the "workflows"
sections:
},
"functional_inference" : {
# params for functional inference workflow external database(s)
"abricate_with_metaspades" : {
# set by default to use internal DB's
#"db_dir" : "/opt/abricate/db/"
#use external dirs
"db_dir" : "sing_dir"
},
"abricate_with_megahit" : {
# set by default to use internal DB's
#"db_dir" : "/opt/abricate/db/"
#use external dirs
"db_dir" : "sing_dir"
},
"abricate_with_spades" : {
# set by default to use internal DB's
#"db_dir" : "/opt/abricate/db/"
#use external dirs
"db_dir" : "sing_dir"
},
"workflows" : {
Finally, remember to update the name of the database you would like these workflows to use for both rules under the "workflows"
section in you default or custom config file:
"functional_abricate_with_megahit_workflow" : {
# these parameters determine which contig params
# from megahit to run abricate on
"qual" : ["2","30"],
# abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
# replace "customDBabricate" with custom/external DB name
#"db" : ["card", "vfdb"],
"db" : ["customDBabricate"],
},
"functional_abricate_with_metaspades_workflow" : {
# these parameters determine which contig params
# from metaspades to run abricate on
"qual" : ["2","30"],
# abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
# replace "customDBabricate" with custom/external DB name
#"db" : ["card", "vfdb"],
"db" : ["customDBabricate"],
},
"functional_abricate_with_spades_workflow" : {
# these parameters determine which contig params
# from spades to run abricate on
"qual" : ["2","30"],
# abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
# replace "customDBabricate" with custom/external DB name
#"db" : ["card", "vfdb"],
"db" : ["customDBabricate"],
},
NOTE: Because of the different database directory paths required, internal and external databases CANNOT be executed at the same time. Remember, if you are planning to run both functional_abricate_with_megahit_workflow
and functional_abricate_with_metaspades_workflow
, you must update the parameters for both of these rules in your default or custom config file.
The HUMAnN2 databases are not included in the HUMAnN2 container image, so they need to be downloaded separately and will appear as individual files in the metscale/workflows/data
directory. The databases for HUMAnN2 were downloaded from the HUMAnN2 GitHub during the Offline Setup. The HUMAnN2 GitHub is a good resource for learning more about how these databases were compiled. HUMAnN2 uses uses nucleotide mapping and translated search to provide organism-specific gene and pathway abundance profiles from a single metagenome.
You should find the following HUMAnN2 databases in the metscale/workflows/data
directory:
Database Name | Description |
---|---|
uniref90 |
The authors of HUMAnN2 recommend using the UniRef90 database as the default for gene family annotation, so this is currently set as the default database in the config file. A change to this file can specify the use of a non-default HUMAnN2 database. |
chocophlan_plus_viral |
The authors of HUMAnN2 recommend ChocoPhlAn for a pangenome search database. This is a proprietary database developed by the authors of HUMAnN2. |
mpa_v20_m200.tar |
This is the marker metadata file provided with the MetaPhlAn package. |
mpa_v20_m200.md5 |
This is a md5 hash for the marker metadata file provided with the MetaPhlAn package. |
Database Name | Size | MD5 Checksum |
---|---|---|
uniref90/uniref90_annotated.1.1.dmnd |
11 GB |
55fa7b368f74cfdeb9245ab9412056fe * |
full_chocophlan_plus_viral.v0.1.1.tar.gz |
5.4 GB |
d4b5116d0310e967bfc27aa9e0f6c1b6 * |
mpa_v20_m200.tar |
242 MB |
3dfb2b312757b800873fb5199c75bc38 |
mpa_v20_m200.fna.bz2 |
204 MB |
ac9ceee100a6c90d9710dd058e364337 |
mpa_v20_m200.md5 |
51 |
a2043a9d7ea30118c6fca2236afebfb7 |
mpa_v20_m200.1.bt2 |
291 MB |
f08efc1eeb761aec46ff60c13534428c |
mpa_v20_m200.2.bt2 |
170 MB |
be48db657617f164a09906382f43241d |
mpa_v20_m200.3.bt2 |
9 MB |
1e32afab87500364f434d9df647d08cd |
mpa_v20_m200.4.bt2 |
170 MB |
26ff7ae43c503568bab9e4eb39724c53 |
mpa_v20_m200.pkl |
39 MB |
febf3a7080fe8119c2a0bb7bf03e69cf |
mpa_v20_m200.rev.1.bt2 |
291 MB |
3dc8780bb61ce336127a6a07440e1ea8 |
mpa_v20_m200.rev.2.bt2 |
170 MB |
63f2a73a54379d2c00833e4e75bd3e68 |
*Uncompressed during offline setup to uniref90/
and chocophlan_plus_viral/
The functional inference workflow currently uses the Illumina paired-end filtered reads (i.e., outputs from the Read Filtering Workflow), and assembled contigs (i.e., outputs from the Assembly Workflow) as its inputs. These files should be located in the metscale/workflows/data
directory.
The following prokka and ABRicate input files should be located in the metscale/workflows/data
directory after running the example SRR606249_subset10 dataset through the read filtering and assembly workflows:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2.megahit.contigs.fa |
127 MB |
SRR606249_subset10_1_reads_trim30.megahit.contigs.fa |
115 MB |
SRR606249_subset10_1_reads_trim2.metaspades.contigs.fa |
153 MB |
SRR606249_subset10_1_reads_trim30.metaspades.contigs.fa |
142 MB |
SRR606249_subset10_1_reads_trim2.rnaspades.transcripts.fasta |
171 MB |
SRR606249_subset10_1_reads_trim30.rnaspades.transcripts.fasta |
156 MB |
SRR606249_subset10_1_reads_trim2_k21_33_55.spades.contigs.fa |
150 MB |
SRR606249_subset10_1_reads_trim30_k21_33_55.spades.contigs.fa |
139 MB |
If these files look good to go, then you may proceed to run the rest of the workflow offline.
The following SRST2 and HUMAnN2 input files should be located in the metscale/workflows/data
directory after running the example SRR606249_subset10 dataset through the read filtering workflow:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2_1.fq.gz |
365 MB |
SRR606249_subset10_1_reads_trim2_2.fq.gz |
359 MB |
SRR606249_subset10_1_reads_trim30_1.fq.gz |
313 MB |
SRR606249_subset10_1_reads_trim30_2.fq.gz |
300 MB |
The workflow is executed based on the samples and workflow parameters which are specified in the config file. If you wish to change these parameters and/or run your own samples through the functional inference workflow, you can directly edit the default_workflowconfig.settings
file or create a custom config file in the metscale/workflows/config/
directory that will override default files/parameters. For more information on how to do this, see Getting Started.
To specify a different default database to run with ABRicate or SRST2, parameters can be edited in the config file. The default database parameter for each ABRicate or SRST2 rule is listed after "db"
in the config file with the available options commented above each of the default databases.
Be sure to specify the singularity bindpath before running the functional inference workflow:
cd metscale/workflows
export SINGULARITY_BINDPATH="data:/tmp"
Execution of the functional inference workflow can then be performed with the following command:
snakemake --use-singularity {rules} {other options}
The following rules are available for execution in the functional inference workflow (yellow stars indicate the terminal rules):
The following rules are available for execution in the functional inference workflow. These rules and their parameters are also listed under "workflows" in the config file:
Rule | Description |
---|---|
functional_with_srst2_workflow |
SRST2 runs on the quality filtered reads |
functional_prokka_with_megahit_workflow |
Prokka runs on the contigs generated by MEGAHIT |
functional_prokka_with_metaspades_workflow |
Prokka runs on the contigs generated by metaSPAdes |
functional_prokka_with_spades_workflow |
Prokka runs on the isolate contigs generated by SPAdes |
functional_prokka_transcriptome_with_rnaspades_workflow |
Prokka runs on the transcripts generated by rnaSPAdes |
functional_prokka_metatranscriptome_with_rnaspades_workflow |
Prokka runs on the metatranscripts generated by rnaSPAdes |
functional_abricate_with_megahit_workflow |
ABRicate runs on the contigs generated by MEGAHIT |
functional_abricate_with_metaspades_workflow |
ABRicate runs on the contigs generated by metaSPAdes |
functional_abricate_with_spades_workflow |
ABRicate runs on the contigs generated by SPAdes |
functional_humann2_workflow |
HUMAnN2 and MetaPhlAn2 run on the quality filtered reads |
SRST2 can be executed on all the quality trimmed, paired-end reads from the example SRR606249_subset10 dataset with the default config file and the following command:
snakemake --use-singularity functional_with_srst2_workflow
To execute prokka, each rule can be run independently, or they can be run in tandem by specifying multiple rules in one command. The following command will run all the available rules for prokka with the default config file for metagenomic assemblies:
snakemake --use-singularity functional_prokka_with_megahit_workflow functional_prokka_with_metaspades_workflow
Prokka can also be run on isolate assemblies:
snakemake --use-singularity functional_prokka_with_spades_workflow
Prokka can also be run on transcriptome assemblies:
snakemake --use-singularity functional_prokka_transcriptome_with_rnaspades_workflow
Prokka can also be run on metatranscriptome assemblies:
snakemake --use-singularity functional_prokka_metatranscriptome_with_rnaspades_workflow
Similarly, the ABRicate rules can be run independently or in tandem. The following command will run all available rules for ABRicate with the default config file:
snakemake --use-singularity functional_abricate_with_megahit_workflow functional_abricate_with_metaspades_workflow
ABRicate can also be run on isolate assemblies:
snakemake --use-singularity functional_abricate_with_spades_workflow
HUMAnN2 (with MetaPhlAn2) can be executed on all the quality trimmed, paired-end reads from the example SRR606249_subset10 dataset with the default config file and the following command:
snakemake --use-singularity functional_humann2_workflow
Additional options for snakemake can be found in the snakemake documentation.
To change or specify your own parameters, see Workflow Architecture for more information.
After successful execution of the SRST2 portion of the functional inference workflow, bowtie2 (*.bt2
) and SAMtools (*.fai
) will create the following indexes from the reference database sequences in the metscale/workflows/data/
directory:
Reference Database Index | Size |
---|---|
ARGannot.r3.fasta.1.bt2 |
4.7 MB |
ARGannot.r3.fasta.2.bt2 |
403 KB |
ARGannot.r3.fasta.3.bt2 |
17 KB |
ARGannot.r3.fasta.4.bt2 |
403 KB |
ARGannot.r3.fasta.fai |
86 KB |
ARGannot.r3.fasta.rev.1.bt2 |
4.7 MB |
ARGannot.r3.fasta.rev.2.bt2 |
403 KB |
SRST2 maps reads to each of the reference sequences in the database provided (e.g., ARGannot.r3.fasta) and reports genes or plasmids when their coverage (i.e., the percent of the sequence length that was covered by read mapping) reaches above a certain level. That coverage level has been set to 0% in the functional inference workflow, so all partially detected genes or plasmids will be reported in the SRST2 output files.
The following SRST2 output files should be present in the metscale/workflows/data/
directory:
Output File | Description |
---|---|
{sample}_1_reads_trim{qual}_{database}.srst2__ fullgenes__{database}__results.txt |
A detailed report with one row per gene per sample, which includes coverage information |
{sample}_1_reads_trim{qual}_{database}.srst2__ genes__{database}__results.txt |
A summary report of genes detected by sample |
{sample}_1_reads_trim{qual}_{database}.srst2.log |
SRST2 log file |
{sample}_1_reads_trim{qual}_{database}.srst2__{sample}_trim{qual}.{database}.sorted.bam |
Bowtie2 alignment of reads to the database |
{sample}_1_reads_trim{qual}_{database}.srst2__{sample}_trim{qual}.{database}.pileup |
Samtools pileup of the alignment |
The following columns are present in the {sample}_1_reads_trim{qual}_{database}.srst2__ fullgenes__{database}__results.txt
report:
Column Name | Description |
---|---|
Sample | Name of the sample being analyzed |
DB | Name of the SRST2 database being used |
gene | Gene name within the database being used |
allele | Allele name within the database being used |
coverage | Percent of the gene length that was covered by reads in the sample |
depth | Mean read depth across the length of all reported alleles |
diffs | Details of any mismatches against the reported allele (holes = sections of the sequence not covered by the alignment due to truncations or large indels) |
uncertainty | Provides more detail when depth was too low to confidently call an allele* |
divergence | Percent divergence from the reference sequence |
length | Length of the reference sequence |
maxMAF | The highest minor allele frequency (MAF) of variants in the alignment |
clusterid | Cluster ID of reference sequence |
seqid | Sequence ID of reference sequence |
annotation | Additional gene annotations, including the NCBI accession number** |
*See SRST2 documentation for more information.
**For ARGannot.r3.fasta, the additional information includes the gene name, the class of antibiotics, the accession number, the gene location, and the gene size in base pairs.
If SRST2 is run with ARGannot_r3.fasta as the database, the "gene" column in the above fullgenes report will contain the gene name followed by "_" and one of the following abbreviations that describes the relevant antibiotic class:
Abbreviation | Antibiotic Class |
---|---|
AGly | Aminoglycosides |
Bla | Beta-lactamases |
Fos | Fosfomycin |
Flq | Fluoroquinolones |
Gly | Glycopeptides |
MLS | Macrolide-Lincosamide-Streptogramin |
Phe | Phenicols |
Rif | Rifampicin |
Sul | Sulfonamides |
Tet | Tetracyclines |
Tmt | Trimethoprim |
After successful execution of the prokka portion of the functional inference workflow, the following sub-directories will be created in the metscale/workflows/data/
with the outputs from processing MEGAHIT and/or metaSPAdes contigs:
Prokka Rule | Output Directory |
---|---|
functional_prokka_with_megahit_workflow |
{sample}_1_reads_trim{quality threshold}_megahit.prokka_annotation |
functional_prokka_with_megahit_workflow |
{sample}_1_reads_trim{quality threshold}_megahit.prokka_annotation |
functional_prokka_transcriptome_with_rnaspades_workflow |
{sample}_1_reads_trim{quality threshold}_trans_rnaspades.prokka_annotation |
functional_prokka_metatranscriptome_with_rnaspades_workflow |
{sample}_1_reads_trim{quality threshold}_metatrans_rnaspades.prokka_annotation |
Within each of these sub-directories you will find the results reported by prokka in multiple file formats (i.e., .gff, .gbk, .fna, .faa, .ffn, .sqn, .fsa, .tbl, .err, .log, .txt, .tsv). The *.tsv output file contains a list of all features identified, including the following columns:
Column Name | Description |
---|---|
locus_tag | Locus name assigned by prokka |
ftype | Feature type (e.g., ORF/CDS, rRNA) |
length_bp | Length of the gene in base pairs |
gene | Name of the gene it matched, if available |
EC_number | Number assigned by the Enzyme Commission (EC) to its function, if available |
COG | Clusters of Orthologous Groups (COG), which link proteins to categories of known functions, if available |
product | Putative annotation for the open reading frame (ORF) |
NOTE: While prokka directories have unique names, prokka output files are currently generated with one file name prefix. This will be fixed in the future, so samples will be uniquely named within the output directories.
ABRicate will align contigs assembled by MEGAHIT or metaSPAdes to gene sequences within the specified reference database. After successful execution of the ABRicate portion of the functional inference workflow, the following outputs will be generated in the metscale/workflows/data/
directory:
Rule | Output File |
---|---|
functional_abricate_with_megahit_workflow |
{sample}_1_reads_trim{quality threshold}_megahit .abricate_{database}.csv |
functional_abricate_with_metaspades_workflow |
{sample}_1_reads_trim{quality threshold}_metaspades .abricate_{database}.csv |
The ABRicate *csv output file includes the following columns of information (see ABRicate documentation for more information):
Column Name | Description |
---|---|
#FILE | Name of file being analyzed |
SEQUENCE | Name of sequence |
START | Start coordinate in the sequence |
END | End coordinate in the sequence |
GENE | Gene name |
COVERAGE | Proportion of the gene in the sequence |
COVERAGE_MAP | A visual representation of gene coverage |
GAPS | Gaps found between the subject and the query |
%COVERAGE | Proportion of gene covered |
%IDENTITY | Proportion of exact nucleotide matches |
DATABASE | Database the sequence came from |
ACCESSION | NCBI accession number for the sequence source |
All detected genes will be reported in the output file, but users can evaluate metrics such as %coverage and %identity when interpreting the results.
After successful execution of the HUMAn2 portion of the functional inference workflow, bowtie2 (*.bt2
) will create the following indexes from the reference database sequences in the metscale/workflows/data/
directory:
Reference Database Index | Size |
---|---|
mpa_v20_m200.1.bt2 |
291 MB |
mpa_v20_m200.2.bt2 |
170 MB |
mpa_v20_m200.3.bt2 |
9.0 MB |
mpa_v20_m200.4.bt2 |
170 MB |
mpa_v20_m200.1.bt2 |
291 MB |
mpa_v20_m200.rev.2.bt2 |
170 MB |
HUMAnN2 execution uses nucleotide mapping and translated search to provide organism-specific gene and pathway abundance profiles from a single metagenome. [MetaPhlAn2] (https://huttenhower.sph.harvard.edu/metaphlan) is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data.
The following HUMANn2 and MetaPhlAn2 output files should be present in the metscale/workflows/data/
directory:
Output File | Description |
---|---|
{sample}_1_reads_trim{qual}_interleaved_reads_genefamilies.tsv |
This file lists the abundances of each gene family in the community in RPK (Reads Per Kilobase) units. |
{sample}_1_reads_trim{qual}_interleaved_reads_pathabundance.tsv |
This file lists the abundances of each pathway in the community, also in RPK units as described for gene families. |
{sample}_1_reads_trim{qual}_interleaved_reads_pathcoverage.tsv |
This file lists the coverage of each pathway in the community. |
{sample}_1_reads_trim{qual}_interleaved_reads_metaphlan_bowtie2.txt |
This file contains the intermediate mapping results to unique sequence markers. |
{{sample}_1_reads_trim{qual}_interleaved_reads_metaphlan_bugs_list.tsv |
This file contains the final computed organism abundances. |
*See HUMANn2 documentation and MetaPhlAn2 documentation for more information.
To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools.
SRST2 is run on reads filtered with a quality score threshold of 2 with the ARGannot_r3.fasta database using the equivalent of this command:
srst2 --input_pe SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --output SRR606249_subset10_trim2 --log --gene_db ARGannot_r3.fasta --min_coverage 0
SRST2 is run on reads filtered with a quality score threshold of 30 with the ARGannot_r3.fasta database using the equivalent of this command:
srst2 --input_pe SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --output SRR606249_subset10_trim30 --log --gene_db ARGannot_r3.fasta --min_coverage 0
Prokka is run on MEGAHIT contigs with the equivalents of these commands:
prokka SRR606249_subset10_trim2.megahit.contigs.fa --metagenome --eval 1e-06 --notrna --rnammer --outdir SRR606249_subset10_trim2.megahit --prefix SRR606249_subset10_trim2.megahit
prokka SRR606249_subset10_trim30.megahit.contigs.fa --metagenome --eval 1e-06 --notrna --rnammer --outdir SRR606249_subset10_trim30.megahit --prefix SRR606249_subset10_trim30.megahit
Prokka is run on metaSPAdes contigs with the equivalents of these commands:
prokka SRR606249_subset10_trim2.metaspades.contigs.fa --metagenome --eval 1e-06 --notrna --rnammer --outdir SRR606249_subset10_trim2.metaspades --prefix SRR606249_subset10_trim2.metaspades
prokka SRR606249_subset10_trim30.metaspades.contigs.fa --metagenome --eval 1e-06 --notrna --rnammer --outdir SRR606249_subset10_trim30.metaspades --prefix SRR606249_subset10_trim30.metaspades
Prokka is run on rnaSPAdes transcripts with the equivalents of these commands:
prokka SRR606249_subset10_trim2.rnaspades.transcripts.fasta --metagenome --eval 1e-06 --notrna --rnammer --rawproduct --complian --outdir SRR606249_subset10_trim2_trans_rnaspades --prefix SRR606249_subset10_trim2_trans_rnaspades
prokka SRR606249_subset10_trim30.rnaspades.transcripts.faasta --metagenome --eval 1e-06 --notrna --rnammer --rawproduct --complian --outdir SRR606249_subset10_trim30_trans_rnaspades --prefix SRR606249_subset10_trim30_trans_rnaspades
Prokka is run on rnaSPAdes metatranscripts with the equivalents of these commands:
prokka SRR606249_subset10_trim2.rnaspades.transcripts.faasta --metagenome --eval 1e-06 --notrna --rnammer --rawproduct --complian --outdir SRR606249_subset10_trim2_metatrans_rnaspades --prefix SRR606249_subset10_trim2_metatrans_rnaspades
prokka SRR606249_subset10_trim30.rnaspades.transcripts.faasta --metagenome --eval 1e-06 --notrna --rnammer --rawproduct --complian --outdir SRR606249_subset10_trim30_metatrans_rnaspades --prefix SRR606249_subset10_trim30_metatrans_rnaspades
ABRicate is run on MEGAHIT contigs with the CARD database using the equivalent of these commands:
abricate --csv --db card SRR606249_subset10_trim2.megahit.contigs.fa > SRR606249_subset10_trim2_megahit.abricate_card.csv
abricate --csv --db card SRR606249_subset10_trim30.megahit.contigs.fa > SRR606249_subset10_trim30_megahit.abricate_card.csv
ABRicate is run on metaSPAdes contigs with the CARD database using the equivalent of these commands:
abricate --csv --db card SRR606249_subset10_trim2.metaspades.contigs.fa > SRR606249_subset10_trim2_metaspades.abricate_card.csv
abricate --csv --db card SRR606249_subset10_trim30.metaspades.contigs.fa > SRR606249_subset10_trim30_metaspades.abricate_card.csv
HUMAnN2 is run on reads filtered with a quality score threshold of 2 with the Uniref90 database and Chocophlan database using the equivalent of this command:
humann2 --input SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --output SRR606249_subset10_trim2
HUMAnN2 is run on reads filtered with a quality score threshold of 30 with the Uniref90 database and Chocophlan database using the equivalent of this command:
humann2 --input SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --output SRR606249_subset10_trim30