This is an example of a Snakemake workflow that I put together for the Noble lab in late 2021. This workflow analyses proteomics data from four different beers using data that was part of "The 2020 ABRF Beer Study: beer proteomics at the global scale" (MSV000088080).
It specifically performs the following steps:
- Downloads the four mass spectrometry data files from MassIVE using ppx.
- Downloads an appropriate beer FASTA file consisting of the yeast, barley, wheat, and hops verified UniProt proteomes.
- Converts the raw mass spectrometry data files to an open format (mzML) using ThermoRawFileParser.
- Searches each of the data files against the beer FASTA file using Comet.
- Refines the search results with mokapot using a joint model.
- Creates a plot showing the number of PSMs, peptides, and proteins from each.
This repository includes a conda environment that is compatible with MacOS and Linux systems. First, if you'll need a working conda installation. If you need to install one, I recommend miniconda. You'll also need git to clone this repository, which can be installed using conda:
conda install git
With conda installed, you should first clone this repository:
git clone https://github.com/wfondrie/snakemake-beer-proteomics.git
Then enter it:
cd snakemake-beer-proteomics
Create the conda environment:
conda env create --prefix ./envs -f environment.yaml
Activate the conda environment:
conda activate ./envs
To run this workflow on your local machine using all available cores:
snakemake --cores all
When you run this workflow for the first time, snakemake organize jobs into the directed acyclic graph (DAG) below. During execution, independent jobs are conducted in parallel while dependent jobs wait for their dependencies to become available.
To run this workflow on the Noble lab SGE cluster:
snakemake --cores all --profile sge --use-conda
Note, you should ideally encapsulate this command into its own job, rather than running it on the head node.
Once the workflow has completed, you should find that it created results/figures/detections.png
. The figure should look like this:
This is an overview of how this repository is organized after the workflow has been executed.
snakemake-beer-proteomics
|- Snakefile # The instructions for Snakemake
|
|- data # The downloaded data.
| |- raw # The Thermo raw files.
| |- mzML # The mzML files
| `- fasta # The FASTA files.
|
|- results # Results from Comet, mokapot, and the final figure.
| |- comet # The comet results.
| |- mokapot # The mokapot results.
| `- figures # The final figure.
|
|- scripts # The scripts used during the analysis.
| `- make_figure.py # The script to create the final figure.
|
|- profiles # Profiles for cluster jobs.
| `- sge # A basic SGE profile, tailored for UWGS.
| `- config.yaml # The configuration file that tells snakemake how to
| # submit jobs to the cluster and what resources we
| # can specify.
|
|- params # Parameter files.
| `- comet.params # The Comet search parameters.
|
|- static # Static assets for things like the README
| `- dag.png # The DAG for this workflow.
|
|- logs # Log files from the various steps of the pipelne.
|- envs # The installed conda environement.
|- job.sh # An example SGE job script to run the workflow.
|- README.md # This file.
`- LICENSE # MIT.