Skip to content

09. Functional Inference

Krista Ternus edited this page Dec 3, 2020 · 3 revisions

Functional Inference

Table of Contents

Workflow Overview

This workflow will perform functional inference (e.g., gene and plasmid detection, microbial pathway analysis) on filtered Illumina paired-end reads with SRST2 version 0.2.0 and with HUMAnN2 version 2.8.1, or on assembled contigs with prokka version 1.14.5 and ABRicate version 0.5. Note that HUMAnN2 will simultaneously run and generate taxonomic outputs for MetaPhlAn2. The functional inference workflow has been tested to run offline after the execution of the Read Filtering Workflow and/or Assembly Workflow. Default HUMAnN2 databases are used within the functional inference workflow, but users have the option of selecting from various pre-compiled or custom databases for SRST2, prokka, and ABRicate.

SRST2, prokka, and ABRicate are intended to be used for the presence/absence detection of genes or plasmids within metagenomes. HUMAnN2 was designed to report the organisms, gene families, and pathways identified within a microbial community. While this can be helpful in understanding the functional content of a metagenome, detecting whole or partial functional elements is not equivalent to definitively detecting antimicrobial resistance, virulence, or other metabolic phenotypes, since phenotype determinations are often more complex than the presence or absence of a single gene or plasmid. Ultimately, the strength of the effect of a single gene or plasmid is variable and dependent on the specific phenotype being evaluated.

Certain features of the functional inference tools are better suited for analyzing isolates with high sequence coverage (e.g., gene variant/allele detection with indel or SNP calling, MLST), and these are not recommended for use with complex metagenomes because many organisms were likely sequenced at relatively low coverage with similar gene/plasmid content. This makes it very difficult to resolve point mutations with high confidence, but it is easier to detect the presence or absence of large genome segments (e.g., genes, plasmids) in complex metagenomes. The functional inference workflow is currently configured to report the detection of whole or partial genes and plasmids because functional sequence fragments are expected to be found in complex metagenomes. Lower coverage thresholds have not been set to exclude information from the final reports (e.g,. genes will still be reported if they are detected below 90% coverage).

Required Files

If you have not already, you will need to activate your metscale environment and perform the Offline Setup before running the functional inference workflow:

[user@localhost ~]$ conda activate metscale

(metscale)[user@localhost ~]$ cd metscale/workflows

(metscale)[user@localhost workflows]$ python download_offline_files.py --workflow functional_inference  

Singularity Images

In the metscale/container_images/ directory, you should see the following Singularity images that were created when running the functional_inference or all flags during the Offline Setup:

Singularity Image Size
prokka_1.14.5-1.sif 1.4 GB
srst2_0.2.0--py27_2.sif 330 MB
abricate_latest.sif 1.3 GB
humann2_2.8.1--py27_0.sif 425 MB

Databases for Functional Inference

Each tool in the functional inference workflow has its own database(s), which are configured in specific formats as needed to execute the individual tools. Bioinformatics tool authors and container developers may choose to package databases inside their container image, or they may keep the databases outside of the container image. The relatively large taxonomic classification databases are located outside of the taxonomic container images, but the functional inference workflow has some databases inside and some outside of containers.

SRST2 Databases

The SRST2 databases are not included in the SRST2 container image, so they need to be downloaded separately and will appear as individual files in the metscale/workflows/data directory. The databases for SRST2 were downloaded from the SRST2 GitHub during the Offline Setup. The SRST2 GitHub is a good resource for learning more about how these databases were compiled. SRST2 maps reads to the reference gene sequences within the specified database and reports the percent coverage for the detected genes.

You should find the following SRST2 databases in the metscale/workflows/data directory:

Database Name Description
ARGannot.r3.fasta The authors of SRST2 recommend using the Antibiotic Resistance Gene-ANNOTation database as the default for detecting resistance genes, so this is currently set as the default database for SRST2 in the config file. A change to this file can specify the use of a non-default SRST2 database.
ResFinder.fasta The authors of SRST2 recommend ResFinder as a secondary option to detect resistance genes.
PlasmidFinder.fasta A PlasmidFinder database is available from the SRST2 authors for plasmid characterization. It identifies plasmids in total or partial sequenced isolates of bacteria.
EcOH.fasta The authors of SRST2 also provide the EcOH database for identifying O and H types in E. coli. This is intended for use with E.coli isolate sequences.
Database Name Size MD5 Checksum
ARGannot.r3.fasta 1.8 MB 9d01664f4361bb32bea8d23ab1fd7f82
EcOH.fasta 0.8 MB 1854d799650599f036f6d9ff8e1816d7
PlasmidFinder.fasta 0.2 MB 8b302925c92b8df5580dda6e8dfee3cf
ResFinder.fasta 1.8 MB 657c10d0b8981cee34d0a9b91513b8a7

Running SRST2 with a Custom Database

To run SRST2 with your own custom database, you first need to copy your database (.fasta file) to the metscale/workflows/data/ directory. Then, indicate your database name in you config file under functional_with_srst2_workflow. By default, MetScale will run the ARGannot.r3.fasta and ResFinder.fasta databases with SRST2.

        "functional_with_srst2_workflow" : {
            # these paramters determine srst2 input
            "qual"      : ["2","30"],
            #srst2 db's - {ARGannot.r3.fasta, ResFinder.fasta, PlasmidFinder.fasta, EcOH.fasta} or external DBs
            "db"        : ["ARGannot.r3.fasta", "ResFinder.fasta"]

NOTE: It is recommended to follow the exact rules and scripts provided by the SRST2 authors (See the "Generating SRST2-compatible clustered database from raw sequences" section in https://github.com/katholt/srst2) to successfully generate a custom gene database. We recommend running your data through the SRST2 database generating script csv_to_gene_db.py, per your use-case. An custom database that is improperly formatted may result in inaccurate outputs without warning.

Prokka Databases

Unlike SRST2, the Singularity images for prokka and ABRicate are preloaded with default databases. Prokka does not use databases to predict microbial genes, but it does use databases to annotate translated proteins from its predicted genes.

Default prokka databases are used within the functional inference workflow, so there are not currently any options to change prokka databases in the config file. Prokka will identify and translate prokaryotic open reading frames from assembled contigs with Prodigal without the use of reference databases, and then compare the translated proteins to databases with functional information.

To view the default databases available within the prokka Singularity image, you can run the following command from inside the metscale/workflows/ directory:

singularity exec -B ${PWD}/:/tmp ../container_images/prokka_1.14.5-1.sif prokka --listdb

The following prokka databases are available:

Database Description
Kingdoms: Archaea Bacteria Mitochondria Viruses Kingdoms available for annotations
Genera: Enterococcus Escherichia Staphylococcus Genera available for genus-specific annotations
HMMs: HAMAP HAMAP annotation of protein sequences
CMs: Archaea Bacteria Viruses Infernal “covariance models” (CMs) for RNA homology searches

"Kingdoms" and "Genera" represent prokka flags that could be used to limit annotations to specific taxons. These are not options used in the functional inference workflow because we wanted all taxons to be annotated, which is prokka's default behavior.

NOTE: Prokka is run in the functional inference workflow in a way that eliminates any tRNA genes from the final outputs (--notrna), and it will run RNAmmer (--rnammer) to improve the accuracy of rRNA gene predictions. The --metagenome and --eval 1e-06 flags are also being run with prokka to optimize its use with metagenome assemblies.

Running Prokka with a Custom Database

Any assembly can be run through Prokka workflows with the ability to specify your own custom database for execution. When running the internal/default databases that need to be copied to your /workflows/data/ directory. You must then change the "input_db" parameter in your default_worklfowparams.settings file from the default Prokka database "/NGStools/prokka/db/kingdom/Bacteria/IS" to the name of the custom database that you place in workflows/data/. This parameter is found under each prokka + assembler rule in your default_worklfowparams.settings file:

    "functional_inference" : {
        # params for functional inference workflow
        "prokka_with_megahit" : {  
            "outdir_pattern" : "{sample}_trim{qual}_megahit.prokka_annotation",
            "input_pattern"  : "{sample}_trim{qual}.megahit.contigs.fa",
            "prefix_pattern" : "{sample}_trim{qual}_megahit",
            "input_db" : "/NGStools/prokka/db/kingdom/Bacteria/IS",
            "threads" : 16
        },
        "prokka_metatrans_with_rnaspades" : {
            "outdir_pattern" : "{sample}_trim{qual}_metatrans_rnaspades.prokka_annotation",
            "input_pattern"  : "{sample}_trim{qual}.rnaspades.transcripts.fasta",
            "prefix_pattern" : "{sample}_trim{qual}_metatrans_rnaspades",
            "input_db" : "/NGStools/prokka/db/kingdom/Bacteria/IS",
            "threads" : 16
        },
        "prokka_trans_with_rnaspades" : {
            "outdir_pattern" : "{sample}_trim{qual}_trans_rnaspades.prokka_annotation",
            "input_pattern"  : "{sample}_trim{qual}.rnaspades.transcripts.fasta",
            "prefix_pattern" : "{sample}_trim{qual}_trans_rnaspades",
            "input_db" : "/NGStools/prokka/db/kingdom/Bacteria/IS",
            "threads" : 16
        },
        "prokka_with_metaspades" : {
            "outdir_pattern" : "{sample}_trim{qual}_metaspades.prokka_annotation",
            "input_pattern"  : "{sample}_trim{qual}.metaspades.contigs.fa",
            "prefix_pattern" : "{sample}_trim{qual}_metaspades",
            "input_db" : "/NGStools/prokka/db/kingdom/Bacteria/IS",
            "threads" : 16
        },
        "prokka_with_spades" : {
            "outdir_pattern" : "{sample}_trim{qual}_spades.prokka_annotation",
            "input_pattern"  : "{sample}_trim{qual}.spades.contigs.fa",
            "prefix_pattern" : "{sample}_trim{qual}_spades",
            "input_db" : "/NGStools/prokka/db/kingdom/Bacteria/IS",
            "threads" : 16

You can also specify the location for you custom database by adding the the following lines to your my_custom_config and illumina_custom_config files after your "read_filtering" and before your "workflow" section":

{
   "read_filtering" : {
        "read_patterns" : {
            "pre_trimming_pattern"  : "{sample}_{direction}_reads",
            "pre_trimming_glob_pattern"  : "*_1_reads.fq.gz",
            "reverse_pe_pattern_search"  : "1_",
            "reverse_pe_pattern_replace" : "2_",
        },
        "quality_trimming" : {
            "sample_file_ext" : ".fq.gz"
        },
    },
    "functional_inference" : {
        # params for functional inference workflow external database(s)
        "prokka_with_megahit" : {  
            "input_db" : "my_custom_database.fasta",
        },
        "prokka_with_metaspades" : {
            "input_db" : "my_custom_database.fasta",
        },
        "prokka_with_spades" : {
            "input_db" : "my_custom_database.fasta",
        },
    },
    "workflows" : {

ABRicate Databases

ABRicate will align assembled contigs to reference gene sequences within a specified database. The following databases are available:

Database Source
argannot Antibiotic Resistance Gene-ANNOTation
card The Comprehensive Antibiotic Resistance Database
ncbibetalactamase Bacterial Antimicrobial Resistance Reference Gene Database
plasmidfinder PlasmidFinder
resfinder ResFinder
vfdb Virulence Factors Database

To view the databases available within the ABRicate Singularity image, the following command can be run from inside the metscale/workflows/ directory:

singularity exec -B ${PWD}/:/tmp ../container_images/abricate_latest.sif abricate --list

You should see the following database information:

Database Number of Sequences Date
argannot 1749 sequences Mar 17, 2017
card 2117 sequences Mar 18, 2017
ncbibetalactamase 1557 sequences Mar 17, 2017
plasmidfinder 263 sequences Mar 19, 2017
resfinder 2130 sequences Mar 17, 2017
vfdb 2597 sequences Mar 17, 2017

The ABRicate database can be changed in the config file. The author of ABRicate did not have strong recommendations for which database to select as a default, so card and vfdb were set as its default databases in the functional inference workflow.

        "functional_abricate_with_megahit_workflow" : {
            # these parameters determine which contig params
            # from megahit to run abricate on
            "qual"      : ["2","30"],
            # abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
            # replace "customDBabricate" with custom/external DB name 
            "db"        : ["card", "vfdb"],
            #"db"        : ["customDBabricate"],
        },
        "functional_abricate_with_metaspades_workflow" : {
            # these parameters determine which contig params
            # from metaspades to run abricate on
            "qual"      : ["2","30"],
            # abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
            # replace "customDBabricate" with custom/external DB name
            "db"        : ["card", "vfdb"],
            #"db"        : ["customDBabricate"],
        },
        "functional_abricate_with_spades_workflow" : {
            # these parameters determine which contig params
            # from spades to run abricate on
            "qual"      : ["2","30"],
            # abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
            # replace "customDBabricate" with custom/external DB name
            "db"        : ["card", "vfdb"],
            #"db"        : ["customDBabricate"],
        },

NOTE: Although some SRST2 and ABRicate databases have similar names and were derived from the same original sources, they are not equivalent. Nomenclature is different between the two, downloads from the original sources took place at different times, and manual gene annotations and additions were performed by tool authors to different degrees. They are both good tools, but results will vary because of underlying database differences.

If you are missing any of these files, you should re-run the appropriate setup command, as per instructions in the Offline Setup.

Running ABRicate with a Custom Database

The MEGAHIT, metaSPAdes, and SPAdes ABRicate workflows all have the ability to specify a custom database for execution. When running the internal/default database(s), the parameter for the database directory (i.e. db_dir) is set to /opt/abricate/db in the default_worklfowparams.settings file. If you want to use your own custom/external database, you will need to copy the custom database to the metscale/workflows/data/ directory, and then update the parameter specifying the location of the default database directory (i.e. db_dir) to sing_dir. This can be done by commenting out the line(s) indicated above in default_worklfowparams.settings and activating the "db_dir" : "sing_dir" lines.

        "abricate_with_metaspades" : {
            "output_pattern" : "{sample}_trim{qual}_metaspades.abricate_{db}.csv",
            "input_pattern" : "{sample}_trim{qual}.metaspades.contigs.fa",
            # set by default to use internal DB's
            "db_dir" : "/opt/abricate/db/"
            #use external dirs
            #"db_dir" : "sing_dir"
        },
        "abricate_with_megahit" : {
            "output_pattern" : "{sample}_trim{qual}_megahit.abricate_{db}.csv",
            "input_pattern"  : "{sample}_trim{qual}.megahit.contigs.fa",
            # set by default to use internal DB's
            "db_dir" : "/opt/abricate/db/"
            #use external dirs
            #"db_dir" : "sing_dir"
        },
        "abricate_with_spades" : {
            "output_pattern" : "{sample}_trim{qual}_spades.abricate_{db}.csv",
            "input_pattern"  : "{sample}_trim{qual}.spades.contigs.fa",
            # set by default to use internal DB's
            "db_dir" : "/opt/abricate/db/"
            #use external dirs
            #"db_dir" : "sing_dir"
        },

You can also specify the location for you custom database by adding the the following lines to your my_custom_config and illumina_custom_config file(s) before the "workflows" sections:

    },
    "functional_inference" : {

        # params for functional inference workflow external database(s)
        "abricate_with_metaspades" : {
            # set by default to use internal DB's
            #"db_dir" : "/opt/abricate/db/"
            #use external dirs
            "db_dir" : "sing_dir"
        },
        "abricate_with_megahit" : {
            # set by default to use internal DB's
            #"db_dir" : "/opt/abricate/db/"
            #use external dirs
            "db_dir" : "sing_dir"
        },
        "abricate_with_spades" : {
            # set by default to use internal DB's
            #"db_dir" : "/opt/abricate/db/"
            #use external dirs
            "db_dir" : "sing_dir"
    },
    "workflows" : { 

Finally, remember to update the name of the database you would like these workflows to use for both rules under the "workflows" section in you default or custom config file:

        "functional_abricate_with_megahit_workflow" : {
            # these parameters determine which contig params
            # from megahit to run abricate on
            "qual"      : ["2","30"],
            # abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
            # replace "customDBabricate" with custom/external DB name 
            #"db"        : ["card", "vfdb"],
            "db"        : ["customDBabricate"],
        },
        "functional_abricate_with_metaspades_workflow" : {
            # these parameters determine which contig params
            # from metaspades to run abricate on
            "qual"      : ["2","30"],
            # abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
            # replace "customDBabricate" with custom/external DB name
            #"db"        : ["card", "vfdb"],
            "db"        : ["customDBabricate"],
        },
        "functional_abricate_with_spades_workflow" : {
            # these parameters determine which contig params
            # from spades to run abricate on
            "qual"      : ["2","30"],
            # abricate internal DBs - card, vfdb, argannot, resfinder, ncbibetalactamase, plasmidfinder
            # replace "customDBabricate" with custom/external DB name
            #"db"        : ["card", "vfdb"],
            "db"        : ["customDBabricate"],
        },

NOTE: Because of the different database directory paths required, internal and external databases CANNOT be executed at the same time. Remember, if you are planning to run both functional_abricate_with_megahit_workflow and functional_abricate_with_metaspades_workflow, you must update the parameters for both of these rules in your default or custom config file.

HUMAnN2 Databases

The HUMAnN2 databases are not included in the HUMAnN2 container image, so they need to be downloaded separately and will appear as individual files in the metscale/workflows/data directory. The databases for HUMAnN2 were downloaded from the HUMAnN2 GitHub during the Offline Setup. The HUMAnN2 GitHub is a good resource for learning more about how these databases were compiled. HUMAnN2 uses uses nucleotide mapping and translated search to provide organism-specific gene and pathway abundance profiles from a single metagenome.

You should find the following HUMAnN2 databases in the metscale/workflows/data directory:

Database Name Description
uniref90 The authors of HUMAnN2 recommend using the UniRef90 database as the default for gene family annotation, so this is currently set as the default database in the config file. A change to this file can specify the use of a non-default HUMAnN2 database.
chocophlan_plus_viral The authors of HUMAnN2 recommend ChocoPhlAn for a pangenome search database. This is a proprietary database developed by the authors of HUMAnN2.
mpa_v20_m200.tar This is the marker metadata file provided with the MetaPhlAn package.
mpa_v20_m200.md5 This is a md5 hash for the marker metadata file provided with the MetaPhlAn package.
Database Name Size MD5 Checksum
uniref90/uniref90_annotated.1.1.dmnd 11 GB 55fa7b368f74cfdeb9245ab9412056fe*
full_chocophlan_plus_viral.v0.1.1.tar.gz 5.4 GB d4b5116d0310e967bfc27aa9e0f6c1b6*
mpa_v20_m200.tar 242 MB 3dfb2b312757b800873fb5199c75bc38
mpa_v20_m200.fna.bz2 204 MB ac9ceee100a6c90d9710dd058e364337
mpa_v20_m200.md5 51 a2043a9d7ea30118c6fca2236afebfb7
mpa_v20_m200.1.bt2 291 MB f08efc1eeb761aec46ff60c13534428c
mpa_v20_m200.2.bt2 170 MB be48db657617f164a09906382f43241d
mpa_v20_m200.3.bt2 9 MB 1e32afab87500364f434d9df647d08cd
mpa_v20_m200.4.bt2 170 MB 26ff7ae43c503568bab9e4eb39724c53
mpa_v20_m200.pkl 39 MB febf3a7080fe8119c2a0bb7bf03e69cf
mpa_v20_m200.rev.1.bt2 291 MB 3dc8780bb61ce336127a6a07440e1ea8
mpa_v20_m200.rev.2.bt2 170 MB 63f2a73a54379d2c00833e4e75bd3e68

*Uncompressed during offline setup to uniref90/ and chocophlan_plus_viral/

Input Files

The functional inference workflow currently uses the Illumina paired-end filtered reads (i.e., outputs from the Read Filtering Workflow), and assembled contigs (i.e., outputs from the Assembly Workflow) as its inputs. These files should be located in the metscale/workflows/data directory.

Prokka and ABRicate Input Files

The following prokka and ABRicate input files should be located in the metscale/workflows/data directory after running the example SRR606249_subset10 dataset through the read filtering and assembly workflows:

File Name File Size
SRR606249_subset10_1_reads_trim2.megahit.contigs.fa 127 MB
SRR606249_subset10_1_reads_trim30.megahit.contigs.fa 115 MB
SRR606249_subset10_1_reads_trim2.metaspades.contigs.fa 153 MB
SRR606249_subset10_1_reads_trim30.metaspades.contigs.fa 142 MB
SRR606249_subset10_1_reads_trim2.rnaspades.transcripts.fasta 171 MB
SRR606249_subset10_1_reads_trim30.rnaspades.transcripts.fasta 156 MB
SRR606249_subset10_1_reads_trim2_k21_33_55.spades.contigs.fa 150 MB
SRR606249_subset10_1_reads_trim30_k21_33_55.spades.contigs.fa 139 MB

If these files look good to go, then you may proceed to run the rest of the workflow offline.

SRST2 and HUMAnN2 Input Files

The following SRST2 and HUMAnN2 input files should be located in the metscale/workflows/data directory after running the example SRR606249_subset10 dataset through the read filtering workflow:

File Name File Size
SRR606249_subset10_1_reads_trim2_1.fq.gz 365 MB
SRR606249_subset10_1_reads_trim2_2.fq.gz 359 MB
SRR606249_subset10_1_reads_trim30_1.fq.gz 313 MB
SRR606249_subset10_1_reads_trim30_2.fq.gz 300 MB

Workflow Execution

The workflow is executed based on the samples and workflow parameters which are specified in the config file. If you wish to change these parameters and/or run your own samples through the functional inference workflow, you can directly edit the default_workflowconfig.settings file or create a custom config file in the metscale/workflows/config/ directory that will override default files/parameters. For more information on how to do this, see Getting Started.

To specify a different default database to run with ABRicate or SRST2, parameters can be edited in the config file. The default database parameter for each ABRicate or SRST2 rule is listed after "db" in the config file with the available options commented above each of the default databases.

Be sure to specify the singularity bindpath before running the functional inference workflow:

cd metscale/workflows
export SINGULARITY_BINDPATH="data:/tmp"  

Execution of the functional inference workflow can then be performed with the following command:

snakemake --use-singularity {rules} {other options}

The following rules are available for execution in the functional inference workflow (yellow stars indicate the terminal rules):

The following rules are available for execution in the functional inference workflow. These rules and their parameters are also listed under "workflows" in the config file:

Rule Description
functional_with_srst2_workflow SRST2 runs on the quality filtered reads
functional_prokka_with_megahit_workflow Prokka runs on the contigs generated by MEGAHIT
functional_prokka_with_metaspades_workflow Prokka runs on the contigs generated by metaSPAdes
functional_prokka_with_spades_workflow Prokka runs on the isolate contigs generated by SPAdes
functional_prokka_transcriptome_with_rnaspades_workflow Prokka runs on the transcripts generated by rnaSPAdes
functional_prokka_metatranscriptome_with_rnaspades_workflow Prokka runs on the metatranscripts generated by rnaSPAdes
functional_abricate_with_megahit_workflow ABRicate runs on the contigs generated by MEGAHIT
functional_abricate_with_metaspades_workflow ABRicate runs on the contigs generated by metaSPAdes
functional_abricate_with_spades_workflow ABRicate runs on the contigs generated by SPAdes
functional_humann2_workflow HUMAnN2 and MetaPhlAn2 run on the quality filtered reads

SRST2 can be executed on all the quality trimmed, paired-end reads from the example SRR606249_subset10 dataset with the default config file and the following command:

snakemake --use-singularity functional_with_srst2_workflow

To execute prokka, each rule can be run independently, or they can be run in tandem by specifying multiple rules in one command. The following command will run all the available rules for prokka with the default config file for metagenomic assemblies:

snakemake --use-singularity functional_prokka_with_megahit_workflow functional_prokka_with_metaspades_workflow  

Prokka can also be run on isolate assemblies:

snakemake --use-singularity functional_prokka_with_spades_workflow  

Prokka can also be run on transcriptome assemblies:

snakemake --use-singularity functional_prokka_transcriptome_with_rnaspades_workflow

Prokka can also be run on metatranscriptome assemblies:

snakemake --use-singularity functional_prokka_metatranscriptome_with_rnaspades_workflow

Similarly, the ABRicate rules can be run independently or in tandem. The following command will run all available rules for ABRicate with the default config file:

snakemake --use-singularity functional_abricate_with_megahit_workflow functional_abricate_with_metaspades_workflow 

ABRicate can also be run on isolate assemblies:

snakemake --use-singularity functional_abricate_with_spades_workflow  

HUMAnN2 (with MetaPhlAn2) can be executed on all the quality trimmed, paired-end reads from the example SRR606249_subset10 dataset with the default config file and the following command:

snakemake --use-singularity functional_humann2_workflow

Additional options for snakemake can be found in the snakemake documentation.

To change or specify your own parameters, see Workflow Architecture for more information.

Output

SRST2 Output

After successful execution of the SRST2 portion of the functional inference workflow, bowtie2 (*.bt2) and SAMtools (*.fai) will create the following indexes from the reference database sequences in the metscale/workflows/data/ directory:

Reference Database Index Size
ARGannot.r3.fasta.1.bt2 4.7 MB
ARGannot.r3.fasta.2.bt2 403 KB
ARGannot.r3.fasta.3.bt2 17 KB
ARGannot.r3.fasta.4.bt2 403 KB
ARGannot.r3.fasta.fai 86 KB
ARGannot.r3.fasta.rev.1.bt2 4.7 MB
ARGannot.r3.fasta.rev.2.bt2 403 KB

SRST2 maps reads to each of the reference sequences in the database provided (e.g., ARGannot.r3.fasta) and reports genes or plasmids when their coverage (i.e., the percent of the sequence length that was covered by read mapping) reaches above a certain level. That coverage level has been set to 0% in the functional inference workflow, so all partially detected genes or plasmids will be reported in the SRST2 output files.

The following SRST2 output files should be present in the metscale/workflows/data/ directory:

Output File Description
{sample}_1_reads_trim{qual}_{database}.srst2__ fullgenes__{database}__results.txt A detailed report with one row per gene per sample, which includes coverage information
{sample}_1_reads_trim{qual}_{database}.srst2__ genes__{database}__results.txt A summary report of genes detected by sample
{sample}_1_reads_trim{qual}_{database}.srst2.log SRST2 log file
{sample}_1_reads_trim{qual}_{database}.srst2__{sample}_trim{qual}.{database}.sorted.bam Bowtie2 alignment of reads to the database
{sample}_1_reads_trim{qual}_{database}.srst2__{sample}_trim{qual}.{database}.pileup Samtools pileup of the alignment

The following columns are present in the {sample}_1_reads_trim{qual}_{database}.srst2__ fullgenes__{database}__results.txt report:

Column Name Description
Sample Name of the sample being analyzed
DB Name of the SRST2 database being used
gene Gene name within the database being used
allele Allele name within the database being used
coverage Percent of the gene length that was covered by reads in the sample
depth Mean read depth across the length of all reported alleles
diffs Details of any mismatches against the reported allele (holes = sections of the sequence not covered by the alignment due to truncations or large indels)
uncertainty Provides more detail when depth was too low to confidently call an allele*
divergence Percent divergence from the reference sequence
length Length of the reference sequence
maxMAF The highest minor allele frequency (MAF) of variants in the alignment
clusterid Cluster ID of reference sequence
seqid Sequence ID of reference sequence
annotation Additional gene annotations, including the NCBI accession number**

*See SRST2 documentation for more information.

**For ARGannot.r3.fasta, the additional information includes the gene name, the class of antibiotics, the accession number, the gene location, and the gene size in base pairs.

If SRST2 is run with ARGannot_r3.fasta as the database, the "gene" column in the above fullgenes report will contain the gene name followed by "_" and one of the following abbreviations that describes the relevant antibiotic class:

Abbreviation Antibiotic Class
AGly Aminoglycosides
Bla Beta-lactamases
Fos Fosfomycin
Flq Fluoroquinolones
Gly Glycopeptides
MLS Macrolide-Lincosamide-Streptogramin
Phe Phenicols
Rif Rifampicin
Sul Sulfonamides
Tet Tetracyclines
Tmt Trimethoprim

Prokka Output

After successful execution of the prokka portion of the functional inference workflow, the following sub-directories will be created in the metscale/workflows/data/ with the outputs from processing MEGAHIT and/or metaSPAdes contigs:

Prokka Rule Output Directory
functional_prokka_with_megahit_workflow {sample}_1_reads_trim{quality threshold}_megahit.prokka_annotation
functional_prokka_with_megahit_workflow {sample}_1_reads_trim{quality threshold}_megahit.prokka_annotation
functional_prokka_transcriptome_with_rnaspades_workflow {sample}_1_reads_trim{quality threshold}_trans_rnaspades.prokka_annotation
functional_prokka_metatranscriptome_with_rnaspades_workflow {sample}_1_reads_trim{quality threshold}_metatrans_rnaspades.prokka_annotation

Within each of these sub-directories you will find the results reported by prokka in multiple file formats (i.e., .gff, .gbk, .fna, .faa, .ffn, .sqn, .fsa, .tbl, .err, .log, .txt, .tsv). The *.tsv output file contains a list of all features identified, including the following columns:

Column Name Description
locus_tag Locus name assigned by prokka
ftype Feature type (e.g., ORF/CDS, rRNA)
length_bp Length of the gene in base pairs
gene Name of the gene it matched, if available
EC_number Number assigned by the Enzyme Commission (EC) to its function, if available
COG Clusters of Orthologous Groups (COG), which link proteins to categories of known functions, if available
product Putative annotation for the open reading frame (ORF)

NOTE: While prokka directories have unique names, prokka output files are currently generated with one file name prefix. This will be fixed in the future, so samples will be uniquely named within the output directories.

ABRicate Output

ABRicate will align contigs assembled by MEGAHIT or metaSPAdes to gene sequences within the specified reference database. After successful execution of the ABRicate portion of the functional inference workflow, the following outputs will be generated in the metscale/workflows/data/ directory:

Rule Output File
functional_abricate_with_megahit_workflow {sample}_1_reads_trim{quality threshold}_megahit .abricate_{database}.csv
functional_abricate_with_metaspades_workflow {sample}_1_reads_trim{quality threshold}_metaspades .abricate_{database}.csv

The ABRicate *csv output file includes the following columns of information (see ABRicate documentation for more information):

Column Name Description
#FILE Name of file being analyzed
SEQUENCE Name of sequence
START Start coordinate in the sequence
END End coordinate in the sequence
GENE Gene name
COVERAGE Proportion of the gene in the sequence
COVERAGE_MAP A visual representation of gene coverage
GAPS Gaps found between the subject and the query
%COVERAGE Proportion of gene covered
%IDENTITY Proportion of exact nucleotide matches
DATABASE Database the sequence came from
ACCESSION NCBI accession number for the sequence source

All detected genes will be reported in the output file, but users can evaluate metrics such as %coverage and %identity when interpreting the results.

HUMAnN2 Output

After successful execution of the HUMAn2 portion of the functional inference workflow, bowtie2 (*.bt2) will create the following indexes from the reference database sequences in the metscale/workflows/data/ directory:

Reference Database Index Size
mpa_v20_m200.1.bt2 291 MB
mpa_v20_m200.2.bt2 170 MB
mpa_v20_m200.3.bt2 9.0 MB
mpa_v20_m200.4.bt2 170 MB
mpa_v20_m200.1.bt2 291 MB
mpa_v20_m200.rev.2.bt2 170 MB

HUMAnN2 execution uses nucleotide mapping and translated search to provide organism-specific gene and pathway abundance profiles from a single metagenome. [MetaPhlAn2] (https://huttenhower.sph.harvard.edu/metaphlan) is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data.

The following HUMANn2 and MetaPhlAn2 output files should be present in the metscale/workflows/data/ directory:

Output File Description
{sample}_1_reads_trim{qual}_interleaved_reads_genefamilies.tsv This file lists the abundances of each gene family in the community in RPK (Reads Per Kilobase) units.
{sample}_1_reads_trim{qual}_interleaved_reads_pathabundance.tsv This file lists the abundances of each pathway in the community, also in RPK units as described for gene families.
{sample}_1_reads_trim{qual}_interleaved_reads_pathcoverage.tsv This file lists the coverage of each pathway in the community.
{sample}_1_reads_trim{qual}_interleaved_reads_metaphlan_bowtie2.txt This file contains the intermediate mapping results to unique sequence markers.
{{sample}_1_reads_trim{qual}_interleaved_reads_metaphlan_bugs_list.tsv This file contains the final computed organism abundances.

*See HUMANn2 documentation and MetaPhlAn2 documentation for more information.

Additional Information

Command Line Equivalents

To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools.

SRST2 is run on reads filtered with a quality score threshold of 2 with the ARGannot_r3.fasta database using the equivalent of this command:

srst2 --input_pe SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --output SRR606249_subset10_trim2 --log --gene_db ARGannot_r3.fasta --min_coverage 0

SRST2 is run on reads filtered with a quality score threshold of 30 with the ARGannot_r3.fasta database using the equivalent of this command:

srst2 --input_pe SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --output SRR606249_subset10_trim30 --log --gene_db ARGannot_r3.fasta --min_coverage 0

Prokka is run on MEGAHIT contigs with the equivalents of these commands:

prokka SRR606249_subset10_trim2.megahit.contigs.fa --metagenome --eval 1e-06 --notrna --rnammer --outdir SRR606249_subset10_trim2.megahit --prefix SRR606249_subset10_trim2.megahit
prokka SRR606249_subset10_trim30.megahit.contigs.fa --metagenome --eval 1e-06 --notrna --rnammer --outdir SRR606249_subset10_trim30.megahit --prefix SRR606249_subset10_trim30.megahit

Prokka is run on metaSPAdes contigs with the equivalents of these commands:

prokka SRR606249_subset10_trim2.metaspades.contigs.fa --metagenome --eval 1e-06 --notrna --rnammer --outdir SRR606249_subset10_trim2.metaspades --prefix SRR606249_subset10_trim2.metaspades
prokka SRR606249_subset10_trim30.metaspades.contigs.fa --metagenome --eval 1e-06 --notrna --rnammer --outdir SRR606249_subset10_trim30.metaspades --prefix SRR606249_subset10_trim30.metaspades

Prokka is run on rnaSPAdes transcripts with the equivalents of these commands:

prokka SRR606249_subset10_trim2.rnaspades.transcripts.fasta --metagenome --eval 1e-06 --notrna --rnammer --rawproduct --complian --outdir SRR606249_subset10_trim2_trans_rnaspades --prefix SRR606249_subset10_trim2_trans_rnaspades
prokka SRR606249_subset10_trim30.rnaspades.transcripts.faasta --metagenome --eval 1e-06 --notrna --rnammer --rawproduct --complian --outdir SRR606249_subset10_trim30_trans_rnaspades --prefix SRR606249_subset10_trim30_trans_rnaspades

Prokka is run on rnaSPAdes metatranscripts with the equivalents of these commands:

prokka SRR606249_subset10_trim2.rnaspades.transcripts.faasta --metagenome --eval 1e-06 --notrna --rnammer --rawproduct --complian --outdir SRR606249_subset10_trim2_metatrans_rnaspades --prefix SRR606249_subset10_trim2_metatrans_rnaspades
prokka SRR606249_subset10_trim30.rnaspades.transcripts.faasta --metagenome --eval 1e-06 --notrna --rnammer --rawproduct --complian --outdir SRR606249_subset10_trim30_metatrans_rnaspades --prefix SRR606249_subset10_trim30_metatrans_rnaspades

ABRicate is run on MEGAHIT contigs with the CARD database using the equivalent of these commands:

abricate --csv --db card SRR606249_subset10_trim2.megahit.contigs.fa > SRR606249_subset10_trim2_megahit.abricate_card.csv
abricate --csv --db card SRR606249_subset10_trim30.megahit.contigs.fa > SRR606249_subset10_trim30_megahit.abricate_card.csv

ABRicate is run on metaSPAdes contigs with the CARD database using the equivalent of these commands:

abricate --csv --db card SRR606249_subset10_trim2.metaspades.contigs.fa > SRR606249_subset10_trim2_metaspades.abricate_card.csv
abricate --csv --db card SRR606249_subset10_trim30.metaspades.contigs.fa > SRR606249_subset10_trim30_metaspades.abricate_card.csv

HUMAnN2 is run on reads filtered with a quality score threshold of 2 with the Uniref90 database and Chocophlan database using the equivalent of this command:

humann2 --input SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --output SRR606249_subset10_trim2

HUMAnN2 is run on reads filtered with a quality score threshold of 30 with the Uniref90 database and Chocophlan database using the equivalent of this command:

humann2 --input SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --output SRR606249_subset10_trim30