# <b>Module 4. nf-meripseq - an integrated nexflow pipeline </b>
--------------------------------------------

## Overview
In the first three modules, you learned how to perform MeRIP-seq data analysis step-by-step, running each part of the workflow manually. In this module, we transition to an automated, scalable, and reproducible pipeline — <b>nf-meripseq</b>. Built using <a href="https://www.nextflow.io/docs/latest/index.html">Nextflow</a> and inspired by <a href ="https://nf-co.re/">nf-core</a> best practices, this pipeline integrates the full workflow:
<b>Quality Control → Alignment → Peak Calling → Differential Analysis → Reporting </b>— all managed via a single command.

Advantages of using a nextflow pipeline (nf-meripseq):
- **Automation**: Run all steps in one workflow, reducing manual error.
- **Reproducibility**: Ensure consistent results across computing environments.
- **Scalability**: Handle large datasets on local machines, HPC clusters, or cloud platforms.
- **Community best-practices**: Built and maintained by nf-core following strict standards for quality and testing.

## Learning Objectives
By the end of this module, you will:
+ Understand the benefits of using a Nextflow pipeline for MeRIP-seq analysis.
+ Learn to set up the environment using Nextflow and Singularity (or other container engine: Docker or Conda).
+ Run the nf-meripseq pipeline on a test dataset.
+ Explore the output structure, results, and QC reports.

## Prerequisites
This module assumes that you have completed module 1–3 of this tutorial (covering MeRIP-seq theory, alignment, and peak calling manually) and have a basic understanding of command-line operations

## Get Started
In this section, you will set up the necessary environment and successfully run the nf-meripseq pipeline on a test dataset.

### 1. 🛠️ Install necessary packages
- <b>Nextflow</b> A workflow manager designed for scalable and reproducible scientific workflows. 
- <b>Singularity</b> A container engine that allows us to run pre-built software environments without installing packages manually.

We’ll use <code>conda</code> to install them:

In [None]:
! conda install bioconda::nextflow conda-forge::singularity conda-forge::tree -y --quiet

### 2. Prepare the Dataset
We use the same dataset from previous modules. This ensures consistency when comparing manual vs. pipeline-based results.

In [None]:
# copy the data from s3 bucket to example_dataset directory
! aws s3 cp s3://ovarian-cancer-example-fastqs/ example_dataset --recursive
# decompress the sequence reads files
! tar -zxvf example_dataset/fastqs.tar.gz -C example_dataset

Make sure that after extraction:
- FASTQ files are located in example_dataset/fastqs
- Reference genome .fasta and .gtf files are located in example dataset
- A <code>samplesheet.csv</code> file is present with metadata describing each sample (see below).

#### About samplesheet ####

The <code>samplesheet.csv</code> will be provided using <code>--input</code> to provide information about the samples that need to be analyzed. It should be a comma-seperated file with columns below:
- <b>sample</b>: Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample.
- <b>fastq1</b>: Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”.
- <b>fastq2</b> (optional): Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension “.fastq.gz” or “.fq.gz”.
- <b>replicate</b>: Integer representing replicate number. This will be identical for re-sequenced libraries. Must start from 1..\<number of replicates\>.
- <b>control</b>: The control column should be the sample identifier for the (input) controls for any given m6A-IP sample. It sets the corresponding (input) control for each of the samples in the table.
- <b>group</b>: experimental groups, required to separate the downstream consensus peak merging and differential peak calling for different groups.

### 3. Run the <code>nf-meripseq</code> pipeline
Run the integrated pipeline using the following command:
```
nextflow run nf-meripseq -profile singularity \
    --input example_dataset/samplesheet.csv \
    --gtf example_dataset/gencode.v46.pri.chr11.1.5M.gtf \
    --fasta example_dataset/chr11_1.5M.fasta \
    --genome hg38 \
    --read_length 37 \
    --outdir="Tutorial_4" \
    --contrast "omental_tumor_vs_normal_Fallopian_tube" \
    -c add.config \
    -resume 
``` 
#### What does this command do? 
The command <code>nextflow run nf-meripseq</code> starts the MeRIP-seq analysis pipeline using Nextflow. It reads the workflow defined in the **nf-meripseq** directory (or repo), loads configuration files, and executes the steps for processing MeRIP-seq data. You can add options to customize the run by adding options:
<table>
  <tr>
    <th style="width: 150px">Parameter</th>
    <th>Description</th>
  </tr>
  <tr>
      <td><code>-profile</code></code></td>
      <td>Specifies the compute profile to use. In this case, it enables the **Singularity** container engine. Profiles are helpful for applying predefined configuration settings tailored for local machines, HPC clusters, or cloud platforms. </td>
  </tr>
  <tr>
    <td><code>--input</code></td>
    <td>Path to the <code>samplesheet.csv</code>. This file contains sample metadata, including conditions, replicates, and control mappings.</td>
  </tr>
  <tr>
    <td><code>--gtf/--fasta</code></td>
    <td>Paths to the reference annotation (<code>.gtf</code>) and reference genome sequence (<code>.fasta</code>). These are **recommanded** for accurate alignment and feature annotation.</td>
  </tr>
    <tr>
      <td><code>--genome</code></td>
      <td>If <code>--gtf/--fasta</code> are not provided, <code>--genome</code> can be used to pull reference files from <i>iGenomes</i>. Additionally, <code>--genome</code> defines the genome-specific R annotation package, used by exomePeak2 for GC bias correction and peak calling. If not provided, this GC bias correction step will be skipped</td>
  </tr>
    <tr>
      <td><code>--read_length</code></td>
      <td>Specifies the read length of the sequencing data. This is important for estimating effective genome size in tools like MACS3 (used for peak calling). Default: 150 (bp)</td>
  </tr>
    <tr>
      <td><code>--outdir</code></td>
      <td>Path to the output directory where all results and reports will be stored. Default:<code>results/</code></td>
  </tr>
    <tr>
      <td><code>--contrast</code> </td>
      <td>Defines the **condition contrast** for differential peak calling. Format: <code>groupA_vs_groupB</code>, where both groups must be present in the group column of the samplesheet. Alternatively, path of a file containing multiple contrast strings can be provided. </td>
  </tr>
    <tr>
      <td><code>-c</code> </td>
      <td>Loads an additional custom configuration file (e.g., <code>add.config</code>) to override or extend default settings. Useful for specifying custom paths, containers, or resource allocations. (optional)</td>
  </tr>
    <tr>
      <td><code>-resume</code> </td>
      <td>Tells Nextflow to resume a previous run from where it left off. This is highly recommended to avoid reprocessing completed steps when rerunning the pipeline. (optional)</td>
</table>



In [36]:
! nextflow run nf-meripseq -profile singularity \
    --input example_dataset/samplesheet.csv \
    --gtf example_dataset/gencode.v46.pri.chr11.1.5M.gtf \
    --fasta example_dataset/chr11_1.5M.fasta \
    --genome hg38 \
    --read_length 37 \
    --outdir="Tutorial_4" \
    --contrast "omental_tumor_vs_normal_Fallopian_tube" \
    -c add.config \
    -resume 


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 24.10.6[m
[K
Launching[35m `nf-meripseq/main.nf` [0;2m[[0;1;36mfervent_kirch[0;2m] DSL2 - [36mrevision: [0;36m1063b9507c[m
[K
[1mInput/output options[0m
  [0;34minput                : [0;32mexample_dataset/samplesheet.csv[0m
  [0;34mcontrast             : [0;32momental_tumor_vs_normal_Fallopian_tube[0m
  [0;34mread_length          : [0;32m37[0m
  [0;34moutdir               : [0;32mTutorial_4[0m
  [0;34msave_reference       : [0;32mtrue[0m

[1mReference genome options[0m
  [0;34mgenome               : [0;32mhg38[0m
  [0;34mfasta                : [0;32mexample_dataset/chr11_1.5M.fasta[0m
  [0;34mgtf                  : [0;32mexample_dataset/gencode.v46.pri.chr11.1.5M.gtf[0m

[1mQuality Control[0m
  [0;34msave_trimmed         : [0;32mtrue[0m

[1mAlignment options[0m
  [0;34mextra_star_align_args: [0;32mnull[0m

[1mPeak calling options[0m
  [0;34mnarrow_peak          : [0;32m

<div style="border: 1px solid #9ec5fe; padding: 0px; border-radius: 4px;">
  <div style="background-color: #cfe2ff; padding: 5px;">
    <i class="fas fa-file-alt" style="color: #052c65;margin-right: 5px;"></i><a style="color: #052c65"><b>Notes</b>: Resources and <code>add.config</code> file</a>
  </div>
  <p style="margin-left: 5px;">
The <code>add.config</code> provided here is mainly used for resource control within the Nextflow pipeline. Since we are running this on SageMaker notebooks, which have limited compute capacity, it's important to explicitly restrict the resources used by each process to avoid exceeding the available system resources. For example, if our VM type is <code>m3.xlarge</code>, which offers <b>4 vCPUs</b> and <b>15 GB</b> of RAM, we can configure the following resources limits in <code>add.config</code> file: <pre>      process {
          resourceLimits = [
            cpus: 4,
            memory: 12.GB,
            time: 24.h
          ]
        }  </pre>
  </p>

### 4. Output Overview 

After running the <code>nf-meripseq</code> pipeline, results are organized under the specified output directory (<code>Tutorial_4/</code>). Here's a breakdown of the directory structure and what each component contains:

In [None]:
! tree -d Tutorial_4

#### Summary of Key Outputs:
- <code>star_rsem_index/</code>: STAR/RSEM reference index files (if built during run)
- <code>fastqc/</code>: Raw read quality reports for initial quality assessment.
- <code>sorted_alignment/</code>: Aligned reads (IP and input samples), useful for downstream visualization or reanalysis.
- <code>bigwig/</code>: Normalized signal tracks for viewing in IGV or UCSC Genome Browser.
- <code>input_rnaseq/</code>: Expression-level data for input RNA, generated via STAR + RSEM (optional but useful for interpretation).
- <code>peak_calling/</code>: Raw peak calls from exomePeak2 and MACS3, provided per sample and per group consensus peaks.
- <code>differential_peaks/</code>: Results from exomePeak2 differential peak calling, organized by condition contrast.
- <code>multiqc/</code>: Consolidated HTML and plot reports summarizing QC, alignment, and quantification.
- <code>pipeline_info/</code>: Runtime metadata including logs and software versions, important for reproducibility.


## Conclusion
In this module, you:
- Shifted from manual to automated MeRIP-seq analysis.
- Learned how to use Nextflow with Singularity.
- Successfully ran the nf-meripseq workflow.
- Explored how outputs are structured and what each result represents.

You're now equipped to run reproducible, large-scale MeRIP-seq analyses!

## 🧹 Clean up
A reminder to shutdown VM/notebook and delete any relevant resources. <br><br>