Snakemake Methylation Pipeline

A standardized, reproducible pipeline to process WGBS bisulfite & EM-seq data. This goes from .fastq to methylation calls (via biscuit) and includes extensive QC and plotting, using a Snakemake pipeline.

At a high level, this pipeline reproducibly:

Builds a reference genome
Trims & (minimally) filters reads
Aligns & calls methylation using biscuit
Flags non-converted reads
Generates standardized outputs & QC including
- FastQC
- fastp
- Biscuit QC
- samtools stats
- MethylDackel mbias plots
- Goleft covplots
- epibed/epiread files
Runs multiqc across entire projects

Getting Started

Installation

Clone this repo: git clone https://github.com/semenko/serpent-methylation-pipeline.git
Install mamba: conda install -c conda-forge mamba or mambaforge
Install snakemake: mamba install -c bioconda -c conda-forge snakemake
(Optionally) Test install the dependencies: mamba env create -n test_env -f serpent-methylation-pipeline/workflow/envs/env.yaml

Test Run

$ nice snakemake --cores 50 --use-conda --printshellcmds --rerun-incomplete --rerun-triggers mtime --keep-going --dry-run

After removing the --dry-run flag, this will download reference genomes and build indices.

Data Definition

Expected Output

Raw data files from data are processed and analyzed by this snakemake workflow. Within each project directory, the output is (roughly) structured as:

SAMPLE_01/                  # e.g. melanoma / crc / healthy, etc.
│   SAMPLE.bam              # The final alignment file 
|   SAMPLE.bam.bai          #   (and its index)
|── biscuit_qc/             # biscuit QC.sh text files
|── epibeds/                # epibed files (bgzip-compressed & tabix-indexed)
├── fastp/                  # fastp statistics & logs
├── fastqc/                 # fastqc graphs 
├── goleft/                 # goleft coverage plots
├── logs/                   # runlogs from each pipeline component
├── methyldackel/           # mbias plots
├── raw/
│   ├── ...fastq.gz         # Raw reads
|   ├── ...md5.txt          # Checksums and validation
├── samtools/               # samtools statistics
SAMPLE_02/
...
...
multiqc/                    # A project-level multiqc stats across all data

Note each project also has a _subsampled directory with identical structure, which is the result of the pipeline run on only 10M reads/sample.

Production Runs

Pipeline Details

This pipeline was designed for highly reproducible, explainable alignments and analysis of epigenetic sequencing data.

Reference Genome

I chose GRCh38, with these specifics:

No patches
Includes the hs38d1 decoy
Includes Alt chromosomes
Applies the U2AF1 masking file
Applies the Encode DAC exclusion

You can see a good explanation of the rationale for some of these components at this NCBI explainer.

Requirements

All software requirements are specified in env.yaml.

Most are relatively common, but a few are semi-unique:

biscuit (for alignment)
NEB's mark-nonconverted-reads (to mark partially converted reads)

biscuit was chosen after a comparison with bwa-meth and bismark — its latest version was the most flexible with extremely well annotated .bams (some critical tags are missing from bwa-meth for identifying read level methylation, and would require patching MethylDackel to extract data).

I briefly experimented with wgbs_tools (which defines nice .pat/.beta formats) but its licensing is too restrictive to use.

Trimming Approach

I chose a relatively conservative approach to trimming -- which is needed due to end-repair bias, adaptase bias, and more.

For EMseq, I trim 10 bp everywhere, after personal QC and offline discussions with NEB. See my notes here.

For BSseq, I trim 15 bp 5' R2, and 10 bp everywhere else due to adaptase bias.

For all reads, I set --trim_poly_g (due to two color bias) and set a --length_required (minimum read length) of 10 bp.

No Quality Filtering

Notably I do NOT do quality filtering here (I set --disable_quality_filtering), and save this for downstream analyses as desired.

I experimented with more stringent quality filtering early on, and found it had little yield / performance benefit.

Background & Inspiration

I strongly suggest reading work from Felix Krueger (author of Bismark) as background. In particular:

TrimGalore's RRBS guide
The Babraham WGBS/RRBS tutorials

For similar pipelines and inspiration, see:

NEB's EM-seq pipeline
Felix Krueger's Nextflow WGBS Pipeline
The Snakepipes WGBS pipeline

Pipeline Graph

Here's a high-level overview of the Snakemake pipeline (generated via snakemake --rulegraph | dot -Tpng > rules.png)

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
config		config
data		data
workflow		workflow
.gitignore		.gitignore
.snakemake-workflow-catalog.yml		.snakemake-workflow-catalog.yml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

config

config

data

data

workflow

workflow

.gitignore

.gitignore

.snakemake-workflow-catalog.yml

.snakemake-workflow-catalog.yml

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Snakemake Methylation Pipeline

Getting Started

Installation

Test Run

Data Definition

Expected Output

Production Runs

Pipeline Details

Reference Genome

Requirements

Trimming Approach

No Quality Filtering

Background & Inspiration

Pipeline Graph

About

Languages

License

semenko/serpent-methylation-pipeline

Folders and files

Latest commit

History

Repository files navigation

Snakemake Methylation Pipeline

Getting Started

Installation

Test Run

Data Definition

Expected Output

Production Runs

Pipeline Details

Reference Genome

Requirements

Trimming Approach

No Quality Filtering

Background & Inspiration

Pipeline Graph

About

Topics

Resources

License

Stars

Watchers

Forks

Languages