Snakemake pipeline for shotgun data QC

A Snakemake workflow for assessing the quality of many shotgun sequence samples simultaneously using Snakemake and MultiQC.

It modifies some code from jlanga's snakemake khmer/trinity workflow.

Installation and usage

Currently, this tool depends on a recent version of Snakemake and runs:

Trimmomatic (v. 0.36)
FastQC (≥ v. 0.11)
MultiQC
Bowtie2

The current version also has some provisions for executing HUMAnN2 for looking at functional profiles from the metagenomes.

Snakemake reads the file config.yaml for several variables, including location of relevant executables and parameters for Trimmomatic execution.

Forward and reverse reads currently have to be directly specified per sample in the config.yaml file. For files produced by the UCSD IGM facility, I have included a Jupyter Notebook to create this file from the sample manifest Excel spreadsheet provided to IGM.

If executed directly using snakemake --snakefile Snakefile --configfile config.yaml, the entire workflow will be run locally. Alternatively, the workflow can be executed in a Torque cluster environment with bash launch.sh [dir] --snakefile Snakefile --configfile config.yaml. This will parse cluster.json for job submission parameters and run each rule instance as a separate cluster job.

Usage notes

Snakemake requires that inputs and outputs be specified explicitly to construct the workflow graph. However, I have found that in practice there are samples in the sequencing manifest that do not show up in the demultiplexed sequence files, or do not yield all possible output files in Trimmomatic (for example, if there are no R1 reads that survive trimming). For this reason, I have been invoking this workflow with the --keep-going flag, which will run subsequent steps even if not all outputs are successfully generated.

Currently, the test data are completely artificial very short sequences. So that these run, the template config.yaml file has a very permissive length filter for Trimmomatic (MINLEN:3). For real data, a better default is MINLEN:32.

Finally, note that disk access-intensive steps are set to run on a temporary directory to allow execution on local scratch space in a cluster environment. This variable is called TMP_DIR_ROOT in the config.yaml, and should be set to the local scratch directory to enable this behavior.

How to run

On Barnacle, we want to avoid running compute-intensive jobs on the login node. That's what happens if we just run the included Snakefile without any additional information about how to access the cluster.

local execution (DON'T DO THIS):

snakemake --configfile config_Run1.yaml

Instead, I've provided a launch.sh script that is set up with some defaults chosen to improve execution on our cluster. Here's how you run it:

cluster execution (DO THIS):

bash launch.sh ./ --configfile config_Run1.yaml

Here's what's goingon behind the scenes in launch.sh to invoke the Snakemake workflow:

snakemake -j 16 \
--local-cores 4 \
-w 90 \
--max-jobs-per-second 8 \
--cluster-config cluster.json \
--cluster "qsub -k eo -m n -l nodes=1:ppn={cluster.n} -l mem={cluster.mem}gb -l walltime={cluster.time}" \
--directory "$@"

Let's go through what each of these parameters does.

-j 16: Runs no more than 16 jobs concurrently. If you have 96 samples that each need to get FastQC'd, it will only run 16 of these jobs at a time.

--local-cores 4: For rules specified as local rules (like linking files), limits to use of 4 CPUs at a time.

-w 90: Waits for at most 90 seconds after a job executes for the output files to be available. This has to do with tolerating latency on the filesystem: sometimes a file is created by a job but isn't immediately visible to the Snakemake process that's scheduling things.

--max-jobs-per-second 8: Limits the rate at which Snakemake is sending jobs to the cluster.

--cluster-config cluster.json Looks in the current directory for a file called cluster.json that contains information about how many resources to request from the cluster for each rule type.

--cluster "qsub -k eo [...]": This tells Snakemake how to send a job to the cluster scheduler, and how to request the specific resources defined in the cluster.json file.

--directory "$@": This passes all the input provided after bash launch.sh as further input to Snakemake. Because it comes right after the --directory flag, it's going to expect the first element of that input to be the path to the working directory where Snakemake should execute.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
example/reads		example/reads
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
config_Run1.yaml		config_Run1.yaml
config_prep.ipynb		config_prep.ipynb
dag.svg		dag.svg
dag_all.svg		dag_all.svg
dag_host_fiter.svg		dag_host_fiter.svg
dag_humann2.svg		dag_humann2.svg
install.sh		install.sh
launch.sh		launch.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example/reads

example/reads

README.md

README.md

Snakefile

Snakefile

cluster.json

cluster.json

config_Run1.yaml

config_Run1.yaml

config_prep.ipynb

config_prep.ipynb

dag.svg

dag.svg

dag_all.svg

dag_all.svg

dag_host_fiter.svg

dag_host_fiter.svg

dag_humann2.svg

dag_humann2.svg

install.sh

install.sh

launch.sh

launch.sh

Repository files navigation

Snakemake pipeline for shotgun data QC

Installation and usage

Usage notes

How to run

About

Releases

Packages

Languages

tanaes/snakemake_shotqual

Folders and files

Latest commit

History

Repository files navigation

Snakemake pipeline for shotgun data QC

Installation and usage

Usage notes

How to run

About

Resources

Stars

Watchers

Forks

Languages