An RNA-seq pipeline for the UMich Bioinformatics Core
The official repository is located at https://github.com/umich-brcf-bioinf/Watermelon
- Overview
- Getting Started - Installation
- Getting Started - Conda Environment
- Walkthrough - Alignment and QC Example
- Walkthrough - Differential Expression Example
- Further Reading
Watermelon is an RNAseq pipeline that produces alignments, QC data, feature counts, and differential expression results. Watermelon uses snakemake, which in turn wraps calls to the various bioinformatic tools in the pipeline.
There are two main steps:
- Running waterlemon_init: Creates template config.yaml file
- Running the pipeline: Using the snakemake utility, an RNA-seq workflow (either alignment + QC workflow or differential expression workflow) is evaluated accordingly.
UMich Bioinformatics Core and Advanced Genomics Core can use the installation located at:
/nfs/turbo/umms-brcfpipeline/pipelines/Watermelon/
This is kept up to date with the master branch of this repo, so installation is already complete
Alternatively, you can clone this repository
# To install in your home directory
cd ~/
git clone https://github.com/umich-brcf-bioinf/Watermelon
In order to have access to watermelon_init (which will set up an example config and prepare for the pipeline to be run), you must make sure that the watermelon location is in your PATH environment variable. To achieve this one-time, execute the following. For permanent effect, place the same command in your ~/.bash_profile or ~/.bashrc
# If you're UMich BfxCore or Advanced Genomics Core:
PATH=$PATH:/nfs/turbo/umms-brcfpipeline/pipelines/Watermelon/
# Or if you've cloned to your home directory
PATH=$PATH:/home/$USER/Watermelon/
The watermelon conda environment has all of the required software to start the pipeline. This environment is unlikely to change frequently, and so will only need to be rebuilt at a maximum of once per release. After the environment is built, it can be activated whenever you want to run the pipeline, and later deactivated when it is no longer needed. Note: If you don't already have anaconda3/miniconda3, then I'd recommend installing miniconda3 - Link to conda installation instructions
There is a .yaml file in this repo which can be used to build the watermelon conda environment
# Navigate to Watermelon (BfxCore/MAGC)
cd /nfs/turbo/umms-brcfpipeline/pipelines/Watermelon/
# Alternatively, if you cloned it to your home dir, navigate there
cd ~/Watermelon
# Create the conda environment from the .yaml file
conda env create -f envs/watermelon.yaml
After the conda environment is created, you can activate it with
conda activate watermelon
This environment provides the basic software requirements to run watermelon_init and snakemake. It can be deactivated later with
conda deactivate
This repository contains a set of simulated paired-end read data which will be used for this example. The provided example configuration file and samplesheet used here are set up to run this on this example without any modification, for illustrative purposes. For more examples, see the "Further Reading" section below.
The first step is to run watermelon_init, which will set up the analysis in the project directory where you invoke it.
For this example, you will need to provide:
- genome build
- e.g. GRCh38
- Available genomes
GRCh38, GRCh37, GRCm38, NCBIM37, Rnor_6.0, GRCz11, BDGP6, WBcel235
- These are ENSEMBL reference IDs. Table of equivalent UCSC & ENSEMBL IDs
- There are also the special genome build options
Other
&TestData
- The link to alternative reference example below outlines using
Other
for customized references - The current example uses the
TestData
genome build, which is a human chr22 reference included with this repository.
- The link to alternative reference example below outlines using
- project ID
- e.g. given
20190821
, a_20190821
suffix will be applied to config name and analysis results dir
- e.g. given
- config type
- Either
align_qc
ordiffex
. Will produce a different config file depending on the given type
- Either
- path to at least one sample directory with fastq files
- e.g.
/path/to/Watermelon/data/sim_reads_human_chr22
- Path given on command line must point to a directory containing fastqs. Watermelon_init will attempt to create a shell glob pattern for each of the samples and place this into the generated samplesheet
- Fastq files must have
_R1
or_R2
in their filenames to distinguish between read 1 and read 2. Good filename endings would be_R1.fastq.gz
and_R2.fastq.gz
- e.g.
Here's an example of running waterlemon_init (Note: replace /path/to/Watermelon in the following code block with valid paths)
# Start a screen session (for persistence over ssh):
screen -S watermelon_20190821
# Activate the conda environment:
conda activate watermelon
# Create a project directory & navigate there
mkdir ~/watermelon_example_project
cd ~/watermelon_example_project
# Now run watermelon_init
watermelon_init.py --genome_build TestData --project_id 20190821 --type align_qc --input_run_dirs /path/to/Watermelon/data/sim_reads_human_chr22
Now is a good time to review the output from watermelon_init. It generates the following:
- config_20190821.yaml : example configuration file for the pipeline
- samplesheet.csv : A CSV samplesheet (auto-generated from the input_run_dirs contents) containing a
sample
column and arun_dir
column to be used by the pipeline - Watermelon : Directory containing a copy of the pipeline code
Now you can perform a dry-run, which will validate that the config is compatible with the workflow.
# Singularity must be available to snakemake, for environment management under the hood
module load singularity
# Dry-run to validate the config and check the execution plan:
snakemake --dryrun --printshellcmds --configfile config_20190821.yaml --snakefile Watermelon/align_qc.smk
You should still be in the project directory, and ready to run the pipeline.
To run on bfx-comp5/6 (notice the profile):
snakemake --configfile config_20190821.yaml --snakefile Watermelon/align_qc.smk --profile Watermelon/config/profile-comp5-6
Similarly, to run the pipeline on the GreatLakes compute cluster:
# Singularity must be available to snakemake, for environment management under the hood
module load singularity
snakemake --configfile config_20190821.yaml --snakefile Watermelon/align_qc.smk --profile Watermelon/config/profile-greatlakes
A differential expression workflow starts out the same, but differs in execution:
- Use watermelon_init to create a config file for the differential expression analysis
- Inspect the generated config file, modify differential expression details based on experimental design
- Run the deseq2_analysis.R script using the WAT_diffex singularity image, which provides a standardized compute environment
When running watermelon_init, as before you'll provide a genome build, a project ID, and a config type. Since it will be the diffex
type, we will also give it a count matrix and a samplesheet. I will take the same samplesheet that was generated earlier, and add an additional column treatment
with half of the samples labeled control
and the other half drug
. In this example, the config generated will be compatible with the given samples. In a real use-case scenario, the config file should be treated as a template, and edited so that it matches the project plan and dataset that it applies to.
Before running watermelon_init, I'll grab the count matrix (excluding annotation columns).
cut -f1,5- analysis_20190821/deliverables/counts/gene_expected_count.annot.txt > counts_ready.txt
Count matrix details:
- Text file of tab-separated values
- Row for each gene, matching the IDs of the GTF used in genome_build
- Column for each sample with count values, with column names matching the samples in the samplesheet
Note: It is important that the count matrix only contains the above information. If annotation columns are present (such as description), these must be removed before running the pipeline
Running watermelon_init for diffex
type:
watermelon_init.py --genome_build TestData --project_id 20190821d --type diffex --sample_sheet samplesheet.csv --count_matrix counts_ready.txt
Notes: It will prompt about overwriting Watermelon. In this example, it's inconsequential. In real use-cases, use your judgement and move/rename things if overwrite isn't desired. Also, you might've noticed the slightly altered project_id in this example (suffixed with d) to prevent overwriting the existing config and/or mixing the results from the workflows.
After running this, you'll have:
- config_20190821d.yaml : a
diffex
type configuration file for the differential expression workflow. In this example, it needs no modification to work with the differential expression workflow. In a real use-case scenario, you will modify this config file according to the project plan. - Watermelon : Directory containing a copy of the pipeline code
For this example we should modify our samplesheet that we auto-generated earlier, so that it matches our example analysis. In a real example, you'd likely have your samplesheet already set up with phenotypes and conditions, and you would then modify the config to match. Here, our example analysis compares two treatment
s, control
and drug
. So let's add a column to our samplesheet with this information. We could remove or exclude the input_glob column, since it's not used in the diffex analysis, but we'll leave it in place now because that's slightly easier.
sample,input_glob,treatment
sample_01,...,control
sample_02,...,control
sample_03,...,control
sample_04,...,control
sample_05,...,drug
sample_06,...,drug
sample_07,...,drug
sample_08,...,drug
Now we can run the deseq2_analysis.R script. Since we're using a singularity image, we'll first set an environment variable that will define a read-only bind mount for the annotation data.
# Define the read-only bind mount to our reference data
export SINGULARITY_BIND="/nfs/turbo/umms-brcfpipeline/references:/nfs/turbo/umms-brcfpipeline/references:ro"
# Run the deseq2_analysis.R script, providing environment with the singularity image
singularity exec docker://umichbfxcore/wat_diffex:0.7.0 Rscript Watermelon/scripts/deseq2_analysis.R --threads 8 --configfile config_20190821d.yaml --markdownfile Watermelon/report/report_diffex.Rmd 2>&1 | tee deseq2_$(date +%FT%H%M%S).log
When this runs, by default it will run the DESeq2 analysis and save the outputs, including a .Rdata
file, and then it will knit a report using these outputs.
Note: If it is desired to run just the analysis portion of the script or just the reporting portion of the script,
deseq2_analysis.R
has command line flags to allow this behavior,--no_knit
and--no_analysis
, respectively.
Note: It is also possible to use the singularity image with an RStudio session, similar to how it's described on our single cell analysis environment page, if an interactive session is desired.
Next we will edit (if desired) and finalize the report. Inspect report_draft.html
, and make any desired edits to the report_draft.md
that you need. After you're satisfied with the edits, knit this report_draft.md
into our final report.
# Run the deseq2_analysis.R script again, this time with
# our report_draft.md as the --markdownfile
# and with the flag --report_finalize
#
singularity exec docker://umichbfxcore/wat_diffex:0.7.0 Rscript Watermelon/scripts/deseq2_analysis.R --configfile config_20190821d.yaml --markdownfile analysis_20190821d/report/report_draft.md --report_finalize
Finally, we will move the deliverables into their desired location. Within the DESeq2 analysis script, we've tracked all deliverables and created a file - deliverables_list.txt
- that can be used directly with rsync
to transfer results to a desired location.
rsync --files-from deliverables_list.txt . analysis_20190821d/deliverables
Files can be delivered anywhere - just replace final argument with your destination (no trailing slash!)
# rsync --files-from deliverables_list.txt . <dest>
Note: The
deliverables_list.txt
file has/./
that are inserted into the filepaths by the analysis script when it is written. This is a syntax recognized and used by rsync, where directories are re-created after the/./
but not before, allowing the desired output structure to be placed anywhere.
- Alignment & QC Report generation
- Troubleshooting
- Example - Multiple sequencing runs
- Example - Alternative references
- Example - Generating an annotation TSV file
- Example - Advanced options for differential expression analysis
- Notes - Generating or updating the references
- Notes - Git hooks for automated tests
- Snakemake documentation