Skip to content

11. Workflow Architecture

Krista Ternus edited this page Dec 3, 2020 · 3 revisions

Workflow Architecture

Table of Contents

Overview

The MetScale workflows have been developed to run Singularity containers offline via snakemake. This wiki page provides more detail on the architecture of these workflows and the approach that was taken to develop them.

Singularity Containers

Singularity containers have the advantage of making bioinformatics tool installation and analyses more reproducible because each tool and its operational dependencies are packaged together within one Singularity image. Singularity images (containing the extension *.sif) are downloaded for use in our workflows and saved in the metscale/container_images/ directory. These images are then run as containers, which contain minimal operating systems and communicate with the host operating system in a specified way. The singularity bind path in our workflows is set to the metscale/workflows/data directory, which maps the metscale/workflows/data directory on the user's host file system to a directory within the containers. This is how the containers find and write files to the host system, and why all files needed by and produced by the Singularity containers are located in the metscale/workflows/data directory of the user's host file system.

Container images usually originate from bioinformatics tool developers submitting their tools to repositories like bioconda, BioContainers, or Docker Hub. Container images are typically released with version numbers, similar to how developers release new versions of their bioinformatics tools. Most Singularity images used within our workflows are derived from quay.io/biocontainers, which are automatically created when developers submit their tools to bioconda. A couple Singularity images used within our workflows are derived from Docker Hub, and these were cases where the BioContainer version did not pass our initial testing.

Before a new version of an image is integrated into our workflows, it is first independently tested outside of the metscale environment as a Singularity container with the following steps (note that our team uses Singularity version 2.6.1 when running these steps to test individual containers):

  1. Research the tool commands to be used and any databases needed, which is typically found in the tool's documentation
  2. Find the version of the image you wish to test, which is typically created by the tool developer with a version number (e.g., https://quay.io/repository/biocontainers/srst2?tab=tags)
  3. Locally download the database(s) or other files needed, if they are not already included in the container (e.g., quality filtered Illumina paired-end reads to use as SRST2 input files, ARGannot_r3.fasta for the SRST2 database)
  4. Pull the Singularity image in the same directory as the other database(s) or files needed:
singularity pull docker://quay.io/biocontainers/SRST2:0.2.0--py27_2
  1. In the same directory, run the Singularity container with the commands researched in #1:
singularity exec -B ${PWD}/:/tmp srst2_0.2.0--py27_2.sif srst2 --input_pe /tmp/SRR606249_subset10_trim30_1.fq.gz /tmp/SRR606249_subset10_trim30_2.fq.gz --output /tmp/test.out --log --gene_db /tmp/ARGannot_r3.fasta --min_coverage 0

In the above example, the Singularity container was run from the same directory as where the image, database, and input files were saved. During testing we tend to execute the Singularity container from the same directory where the *.sif is saved, but if the input files were saved in a subdirectory called "input" on the host file system, then the path to those input files on the would be listed in the command after /tmp/:

singularity exec -B ${PWD}/:/tmp srst2_0.2.0--py27_2.sif srst2 --input_pe /tmp/input/SRR606249_subset10_trim30_1.fq.gz /tmp/input/SRR606249_subset10_trim30_2.fq.gz --output /tmp/test.out --log --gene_db /tmp/ARGannot_r3.fasta --min_coverage 0

These commands for testing individual Singularity containers are run outside of the metag environment where the metagenomics workflows are executed, and the bind path in the above example was set to /tmp with -B ${PWD}/:/tmp. This is different than metscale/workflows/data, which is where the bind path is set when executing the metagenomics workflows. To check to see where your bind path is set, you can run this command:

echo $SINGULARITY_BINDPATH

Another option for testing an individual container is to run specific tool commands from inside the container, which can be done by running commands like this:

singularity shell srst2_0.2.0--py27_2.sif
srst2 --input_pe SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --output test.out --log --gene_db ARGannot_r3.fasta --min_coverage 0

If the container computationally runs as expected and produces results consistent with what was scientifically expected, then it can be integrated into our metagenomics workflows. This applies to adding new containers to the workflows, as well as updating versions of existing containers to capture new updates that a developer has made to their tool. To see a list of the Singularity images and other required files and databases that are currently used in our metagenomics workflows, see the files listed in offline_downloads.json.

Snakemake Default Setting Files

Snakemake is executed according to specifications in the following three default setting files, which are located in the metscale/workflows/config/ directory:

  1. metscale/workflows/config/default_workflowparams.settings - This file specifies the container versions to use (corresponding to container images downloaded during the offline setup), the parameters to run with each container during workflow execution, parameters to run with snakemake rules in each workflow, and the number of the number of threads to use. CPU, Threads, and Cores: By default, each of the workflows is set to run with a specific number of threads for optimal computation. These are specified in the metscale/workflows/config/ directory under "threads" in the default_workflowparams.settings file. It is not recommended to adjust these parameters without specific need to do so. If you wish execute snakemake with additional cores, there is a --cores flag built into snakemake than can be run in a command followed by an integer (e.g. 16). The following is an example of how to run this with the read_filtering_pretrim_workflow and the default config file:
snakemake --use-singularity read_filtering_pretrim_workflow --cores 16
  1. metscale/workflows/config/default_workflowconfig.settings - This file specifies what samples to run and the configuration of how the workflows should be executed (a.k.a. the "config file"). The config file indicates what specific features snakemake should search for output file names. Custom config files will override specified default settings in the default_workflowconfig.settings file. Alternatively, sample names and settings can be changed by directly editing the default_workflowconfig.settings default config file. To indicate the use of a custom config file, use the --configfile flag to direct snakemake to the exact location of the custom file:
snakemake --use-singularity --configfile=config/my_custom_config.json read_filtering_pretrim_workflow --cores 16

For more information on how to modify and/or create config files, see Getting Started.

Snakefiles

While the snakemake default setting files are important to operating the workflows, the following snakefiles specify how rules will be used in the execution of each workflow:

Our rules have been named with the workflow name at the beginning (e.g., read_filtering, assembly, comparison, taxclass, functional), "workflow" at the end, and intervening words that describe the action that the rule is taking (e.g., the read_filtering_multiqc_workflow rule runs MultiQC with outputs from the read filtering workflow). Rules within the snakefiles break each workflow down into smaller steps, and snakemake determines the dependencies between rules based on matching file names. Please see the "Workflow Execution" sections of the wiki pages for more information about the rules that can be executed within each workflow.

Workflow Progression: The rules of these workflows are designed to be run in a semi-progressive fashion, since the outputs of a number of rules become the inputs of subsequent rules and workflows. For example, read filtering and adapter trimming is typically performed before taxonomic classification. If you were to execute the taxonomic classification workflow without first running the read filtering workflow, snakemake would recognize that the correct "trimmed" files are not present and would first execute the read filtering workflow to gather the input data it needs to run the taxonomic classification workflow. If the proper Singularity images were not available for the read filtering workflow to be run, an error message would appear and the run would fail.

Another example of this would be executing the functional inference workflow before the assembly workflow. If the following command were run before the assembly workflow:

snakemake --use-singularity functional_abricate_with_megahit_workflow 

Then snakemake would recognize the missing MEGAHIT contigs needed for input in the ABRicate container and automatically kick off the assembly_megahit_workflow rule:

snakemake --use-singularity assembly_megahit_workflow

It would successfully complete these steps if the Singularity images for ABRicate and MEGAHIT were both available.

Just as different workflows were created to build on each other, rules within the same workflow were created to build on each other too. If the read_filtering_multiqc_workflow rule, which aggregates all the FastQC outputs into a single report, is executed before the read_filtering_posttrim_workflow rule, which trims the raw reads and generates the FastQC files, snakemake will recognize the missing files and call back to the read_filtering_posttrim_workflow rule to generate the FastQC files before executing the read_filtering_multiqc_workflow rule.

NOTE: The default behavior of snakemake is to rerun intermediate rules if the result of a terminal rule is older than the result of an intermediate rule (e.g., if contigs have a creation date that is older than trimmed reads with the same sample name, then snakemake will rerun assembly rules with the newer trimmed reads to generate new contigs before running any subsequent rules that use contigs as input). This is meant to ensure that the newest data is being used, rather than allowing the user to run an old assembly that is out of date. If this default snakemake behavior interferes with how users would like to operate the workflows, please let us know and we can modify the default functionality.

File Naming Patterns

Expected patterns for file names are defined in the metscale/workflows/config/default_workflowparams.settings file and followed within each Snakefile for each workflow. As the workflows are currently configured, files will only be recognized at different points in the workflows if they adhere to the following file naming patterns (File naming patters below do not apply to raw illumina data files with the naming conventions {Sample}_S*_L*_R{1 or 2}.fastq.gz):

File Description Pattern Name File Pattern
Paired-end raw reads pre_trimming_pattern {sample}_{direction}_reads.fq.gz
Pre-trimmed reads pre_trimming_glob_pattern _1_reads.fq.gz
sample files suffix sample_file_ext .fq.gz
Paired-end reads after quality filtering post_trimming_pattern {sample}*_trim{qual}_{direction}.fq.gz
FastQC *.zip results fastqc_suffix {sample}*_trim{qual}_{direction}_fastqc.zip
Interleaved reads interleave_output_pattern {sample}*_trim{qual}_interleaved_reads.fq.gz
Interleaved reads after subsampling subsample_output_pattern {sample}*_trim{qual}_subset_interleaved_reads.fq.gz
Paired-end reads after subsampling split_interleaved_output_pattern {sample}*_trim{qual}_subset{percent}_{direction}.fq.gz
Contigs assembly_pattern {sample}*_trim{qual}.{assembler}.contigs.fa
QUAST results quast_pattern {sample*}_trim{qual}.{assembler}_quast/

*{sample} is appended with _1_reads suffix for snakemake to recognized file input type

Some pattern values have a limited number of options:

  • {direction} = _1 for forward or _2 for reverse
  • {assembler} = megahit or metaspades

The file naming patterns were created to be compatible with default requirements for tools within the workflows (e.g., the assumed suffixes for SRST2 input files are _1 for forward reads and _2 for reverse reads, which by default must be listed at the end of the file name in order for SRST2 to recognize the files as valid inputs).

The file naming patterns are also used to define the patterns that snakemake searches for and builds on throughout the workflows. For example, the following rule listed within a custom config is searching for all input files that have {sample}_trim{qual}.{assembler}.contigs.fa patterns with values matching to {sample} = SRR606249_subset10, {qual} = 2, {qual} = 30, {assembler} = megahit, and {assembler} = metaspades:

        "assembly_quast_workflow" : {
            "assembler" : ["megahit","metaspades"],
            "sample"    : ["SRR606249_subset10"],
            "qual"      : ["2","30"],
         }

The {sample} value was originally introduced in the pre_trimming_pattern, the {qual} value was originally introduced in the post_trimming_pattern, and the {assembler} value was originally introduced in the assembly_pattern. When the assembly_quast_workflow rule is run, it looks for the values of those parameters in specific places within the expected output file naming patterns.