# <b>Module 1 - MeRIP-seq Data Preprocessing</b>
--------------------------------------------

## Overview
This module will guide you through the acquisition and preprocessing of MeRIP-seq data, including essential steps such as quality control, adapter trimming, reference genome preparation, read alignment, peak calling, and motif discovery. The focus is on understanding how to process raw MeRIP-seq data to prepare it for downstream analysis.

## Learning Objectives
+ Explore an example MeRIP-seq dataset and its experiment design
+ Understand the core steps involved in preprocessing MeRIP-seq data
    - quality control, alignment, m6A peak identification, and annotation.
    - get familiar with the bioinformatics tools and important parameters for MeRIP-seq analysis
+ Data Visualization using IGV
    - understand the role of file formats (e.g., BAM, BED, BigWig) in genomic data visualization
    - explore and interpret alignment and m6A peak data for biological insights

## Prerequisites
- APIs that should be enabled: Amazon S3 (Example dataset are stored there), Amazon SageMaker
- Cloud platform account roles that must be assigned: SageMaker Exacution Role, S3 Access Role
- Submodule0: introduction of the MeRIP-seq basics

## Outline
- **Getting started**
    1. Installing packages
    2. Setting up directory structures
    3. Downloading the Example Dataset
- **Step-by-Step Data Preprocessing**
    1. Quality Control 
    2. Read alignment
    3. Peak Calling and annotaion
    4. Motif Discovery
- **Data Visualization with IGV**
    1. View Alignments (BAM, bigWig)
    2. View identified m6A Peaks (BED, bedGraph)
- **Input sample RNAseq preprocessing**
    1. Generate gene count tables using <code>featureCounts</code>
    2. Generate gene count tables using <code>RSEM</code>

---
## **1. Getting started**
### 1.1 Installing packages
[**mamba**](https://mamba.readthedocs.io/en/latest/user_guide/mamba.html) is a re-implementation of the **conda** package manager in C++. It is designed to address some of the performance limitations of conda, offering faster environment creation, package installation, and dependency resolution. Mamba is fully compatible with the conda ecosystem, meaning it can be used as a drop-in replacement for conda, providing access to the same package repositories.

In [None]:
! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh
! bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

Installation of the tools for this tutorial using <code>mamba</code>:

In [None]:
# Install necessary packages using mamba
! mamba install -y -c conda-forge -c bioconda \
    fastqc \
    multiqc \
    trim-galore \
    star \
    bedtools \
    samtools \
    deeptools \
    ucsc-bigwigmerge \
    macs2 \
    meme 

Install **addtional** packages and tools:
- a Python package <code>igv-notebook</code> for embedding IGV (Integrative Genomics Viewer) in an IPyhon notebook.

In [None]:
! pip install igv-notebook

### 1.2 Setting up directory structures
Establishing input and output directories

In [None]:
! mkdir -p Tutorial_1
! mkdir -p Tutorial_1/fastqc
! mkdir -p Tutorial_1/trimmed
! mkdir -p Tutorial_1/ref_genome
! mkdir -p Tutorial_1/macs2
! mkdir -p Tutorial_1/meme
! mkdir -p Tutorial_1/igv
! mkdir -p Tutorial_1/FeatureCounts
! mkdir -p Tutorial_1/RSEM

### 1.3 Downloading/Preparing the example dataset
#### About the dataset
The dataset used in this tutorial is derived from **GSE119168**, which was originally published as part of the study using the **RADAR** pipeline for MeRIP-seq. The RADAR pipeline is a computational framework designed to identify m6A-modified regions in RNA, specifically focusing on high-throughput MeRIP-seq data analysis, and we will use it for downstream analysis in the next tutorial. The dataset includes six omental tumor tissues and seven normal fallopian tube tissues, and both input and m6A immunoprecipation libraries were sequenced by the NextSeq 500 platform at PE37 mode (pair-end, 37bp). 

For this tutorial, a subset of the original data has been selected, focusing specifically on chromosome 11 (chr11:1-1,000,000). This region was chosen to include the **HRAS** gene, which is a key gene implicated in cancer progression and is regulated by m6A methylation. The HRAS gene is highlighted in another study ([Pan, Yongbo, et al.  PNAS (2023)](https://www.pnas.org/doi/abs/10.1073/pnas.2302291120)), which demonstrates how m6A modifications on HRAS RNA influence tumor progression, making it a biologically relevant region for studying m6A methylation.

The dataset provides a small, manageable region of the genome for tutorial purposes, enabling users to quickly process the data and explore m6A peak calling and motif discovery in the context of a known cancer-associated gene. 

This example dataset is stored at an AWS S3 bucket: s3://ovarian-cancer-example-fastqs

In [None]:
# copy the data from s3 bucket to Tutorial_1 directory
! aws s3 cp s3://ovarian-cancer-example-fastqs/ Tutorial_1 --recursive
# decompress the sequence reads files
! tar -zxvf Tutorial_1/fastqs.tar.gz -C Tutorial_1

---
## **2. Step-by-Step Data Preprocessing**
### 2.1 Quality control

#### Step 1. FastQC 
[**FastQC**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is simple tool that allows you to do some quality control checks on raw sequence data coming from high-throughput sequencing pipelines. It provides a modular set of analyses that you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. You can find examples of "good" and "bad" sequencing data from the [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) website, in section "Example Reports". 

Run FastQC on the sequence files and the fastqc reports will be saved in Tutorial_1/fastqc. The `for` loop iterates through all the fastq.gz files in the Tutorial_1 directory and run `fastqc` each one of them separately.

In [None]:
! for file in Tutorial_1/fastqs/*.gz; do fastqc -q -o Tutorial_1/fastqc "${file}"; done

#### MultiQC (optional)
__[MultiQC](https://multiqc.info/)__ can aggregate results from bioinformatics analysis across many samples into a single report. In our case, it reads in the FastQC reports and generates a compiled report for all the eight analyzed FASTQ files, and the report will be saved in `Tutorial_1/multiqc`. We can also view the report in .html format:

In [None]:
# Run multiqc to summarize all the fastqc reports
! multiqc -f -p  Tutorial_1/fastqc -o Tutorial_1/multiqc

# View multiqc report
from IPython.display import IFrame
IFrame(src='Tutorial_1/multiqc/multiqc_report.html', width=1200, height=400)

#### Step 2. Adapter Trimming and Quality Filtering
Adapter sequences should be removed from reads because they interfere with downstream analyses, such as alignment of reads to a reference. In the FastQC report, the adapter content plot shows the percentage of reads (y-axis), which has an adapter starting at a particular position along a read (x-axis). And if the reads were fragmentaed to lower than than the target molecule length, high propotion of reads with adapters will be observed (right).

<img src="images/1-adapter_content.png" width="400" />

In this tutorial, we use __[Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)__ for adaptor trimming and quality control. Use the `--fastqc` flag to run FastQC again to check the trimming results. Use flag `-j` to indicate the number of cores to be used for trimming. This command will trim adapters from paired-end reads, save the results in the trimmed directory, and re-run FastQC on the trimmed data. **Note**: Don't forget to checkout the FastQC reports after adapter trimming.

In [None]:
! trim_galore -j 4 --paired --illumina -o Tutorial_1/trimmed --fastqc Tutorial_1/fastqs/*.gz

### 2.2 Read alignment
To determine where on the human genome our reads originated from, we will align our reads to the reference genome using **STAR** (Spliced Transcripts Alignment to a Reference). [STAR](https://github.com/alexdobin/STAR?tab=readme-ov-file) is an aligner designed to specifically address many of the challenges of RNA-seq data mapping using a strategy to account for spliced alignments. More details about how to use STAR can be found [here](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html).
#### Step 1. Reference Genome Preparation
For this tutorial, we are using reads that originate from a small subsection of chromosome 11 and so we are using only a small region of human chr1 (**chr11:1-1.5M**) as the reference genome. This subset of the reference genome should already be included with the example dataset provided in the earlier steps. However, for a full-scale alignment of a complete dataset, the entire human genome should be downloaded and indexed properly. The latest FASTA and GTF annotation files for the human genome can be obtained from [Gencode](https://www.gencodegenes.org/), with the FASTA file size being approximately 800 MB. Proper indexing of the genome is required before running alignment for comprehensive analyses.

In [None]:
# Generate Reference Genome Before using STAR
! STAR --runThreadN 4 --runMode genomeGenerate \
    --genomeDir Tutorial_1/ref_genome \
    --genomeFastaFiles Tutorial_1/chr11_1.5M.fasta \
    --sjdbGTFfile Tutorial_1/gencode.v46.pri.chr11.1.5M.gtf \
    --genomeSAindexNbases 9


<div style="border: 1px solid #ffe69c; padding: 0px; border-radius: 4px;">
  <div style="background-color: #fff3cd; padding: 5px; font-weight: bold;">
    <i class="fas fa-exclamation-triangle" style="color: #664d03;margin-right: 5px;"></i><a style="color: #664d03">Note</a>
  </div>
  <p style="margin-left: 5px;">
If we were building a fullscale human genome (THIS WILL TAKE HOURS), you can use commands similar as below: 
      <p style="background:#EEEEEE;color:black"><code>! STAR --runThreadN 4 --runMode genomeGenerate --genomeDir ref/genome --genomeFastaFiles ref/GRCh38.primary_assembly.genome.fa --sjdbGTFfile ref/gencode.v46.primary_assembly.annotation.gtf </code>
    </p>

Once this job has successfully finished, we should have a STAR folder in the genome directory, with the following files: 
- chrLength.txt    
- chrNameLength.txt    
- chrName.txt    
- chrStart.txt
- exonGeTrInfo.tab
- exonInfo.tab
- geneInfo.tab
- Genome
- genomeParameters.txt
- SA
- SAindex
- sjdbInfo.txt
- sjdbList.fromGTF.out.tab
- sjdbList.out.tab
- transcriptInfo.tab

#### Step 2. Aligning reads
After the genome indices are generated, we can perform the read alignment using STAR. Note that the compressed fastq.gz files need to be decompressed before using STAR alignment. The output files will be in Tutorial_1/STAR directory.
<div style="border: 1px solid #659078; padding: 0px; border-radius: 4px;">
  <div style="background-color: #d4edda; padding: 5px; font-weight: bold;">
    <i class="fas fa-lightbulb" style="color: #0e4628;margin-right: 5px;"></i>
      <span style="color: #0e4628">Tips - <code>%%bash</code> vs. <code>!</code> </span>
  </div>
  <p style="margin-left: 5px;margin-right: 5px;">
     In Jupyter notebooks, <code>%%bash</code> is a <b>cell</b> magic that lets you run multiple Bash commands in a single cell, maintaining a consistent environment where variables persist across commands. In contrast, <code>!</code> is a <b>line</b> magic for executing a single shell command, but each command runs in a separate session, so variables or settings do not persist. Use <code>%%bash</code> for multi-step Bash processes and <code>!</code> for quick, isolated commands.
  </p>
</div>



In [None]:
%%bash

# Decompress the trimmed fastq.gz file so STAR can read them
for i in Tutorial_1/trimmed/*.gz; do
    gzip -d $i
done

for i in Tutorial_1/trimmed/*_R1_val_1.fq; do
    base_name=$(basename "$i" _R1_val_1.fq)
   
   # Run STAR with the proper parameters
    STAR --runThreadN 4 \
         --genomeDir Tutorial_1/ref_genome \
         --readFilesIn "$i" "${i/_R1_val_1.fq/_R2_val_2.fq}" \
         --outFileNamePrefix Tutorial_1/STAR/"$base_name" \
         --outSAMtype BAM SortedByCoordinate \
         --quantMode TranscriptomeSAM GeneCounts
done

#### MutliQC (optional) 
Again, we can use multiQC to generate reports of alignment.

In [None]:
# Run multiqc to summarize all the alignment reports
! multiqc -f -p  Tutorial_1/STAR -o Tutorial_1/multiqc

# View multiqc report
from IPython.display import IFrame
IFrame(src='Tutorial_1/multiqc/multiqc_report.html', width=1200, height=400)

### 2.3 Peak Calling and annotaion using MACS2
[**MACS2**](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html) (Model-based Analysis of ChIP-Seq) is a widely used tool for identifying peaks in the alignment data, such as **ChIP-Seq**, to locate regions of DNA where proteins like transcription factors or histones bind. It does this by comparing treatment (e.g., IP) and control (e.g., input) samples, leveraging sophisticated statistical models to identify enriched regions while adjusting for biases such as GC content and sequencing depth. MACS2 is designed specifically for DNA-based sequencing data, making it suitable for ChIP-Seq, ATAC-Seq, and similar assays.

While MACS2 can technically be used on **MeRIP-Seq** or other RNA-Seq data, it’s not optimized for the unique characteristics of RNA data, which often include transcriptome complexity, splice variants, and the influence of gene expression levels. Therefore, in the next tutorial, we’ll introduce tools specifically designed for RNA-based enrichment sequencing, such as Exomepeak2, MeTpeak, etc., which are more effective in handling the complexities of RNA-Seq data for methylation or modification studies.

#### Step 1. Convert alignment files (BAM) to MACS2 directory
To facilitate the analysis of MeRIP-seq, it is useful to first transform the sequence alignment into platform independent data structure representing the experiment, analogous to loading the data into a database.  We use <code>samtools merge</code> to combine the read alignments from the same group (tumor vs. normal, input vs. m6A-IP）together to identify peaks for each group.

In [None]:
# Combine BAM files for each group
! samtools merge -f Tutorial_1/macs2/tumor_input.bam $(ls Tutorial_1/STAR/*ByCoord.out.bam | grep -E '3558|3559|3560|3561|3562|3563')
! samtools merge -f Tutorial_1/macs2/normal_input.bam $(ls Tutorial_1/STAR/*ByCoord.out.bam | grep -E '3564|3565|3566|3567|3568|3569|3570')
! samtools merge -f Tutorial_1/macs2/tumor_m6AIP.bam $(ls Tutorial_1/STAR/*ByCoord.out.bam | grep -E '3571|3572|3573|3574|3575|3576')
! samtools merge -f Tutorial_1/macs2/normal_m6AIP.bam $(ls Tutorial_1/STAR/*ByCoord.out.bam | grep -E '3577|3578|3579|3580|3581|3582|3583')

#### Step 2. Peak Calling
Finding peaks is one of the central goals of MeRIP-Seq experiment, and the same basic principles apply to other types of sequencing such as ChIP-Seq and DNase-Seq. The basic idea is to identify regions in the genome where we find more sequencing reads than we would expect to see by chance.

In [None]:
# Call peaks using MACS2
! macs2 callpeak -t Tutorial_1/macs2/tumor_m6AIP.bam -c Tutorial_1/macs2/tumor_input.bam -f BAM -g hs -n tumor_peaks --outdir Tutorial_1/macs2 --nomodel
! macs2 callpeak -t Tutorial_1/macs2/normal_m6AIP.bam -c Tutorial_1/macs2/normal_input.bam -f BAM -g hs -n normal_peaks --outdir Tutorial_1/macs2 --nomodel

In [None]:
# Merge peaks across samples
! cat Tutorial_1/macs2/tumor_peaks_peaks.narrowPeak Tutorial_1/macs2/normal_peaks_peaks.narrowPeak | sort -k1,1 -k2,2n > Tutorial_1/macs2/sorted_peaks.txt
! bedtools merge -i Tutorial_1/macs2/sorted_peaks.txt > Tutorial_1/macs2/merged_peaks.bed

In [None]:
# Annotate peaks using bedtools
! bedtools intersect -a Tutorial_1/macs2/merged_peaks.bed -b Tutorial_1/gencode.v46.pri.chr11.1.5M.gtf > Tutorial_1/macs2/annotated_merged_peaks.bed

### 2.4 Motif Discovery
Motifs are biologically significant nucleic acid sequence patterns that RNA methylation-related enzymes recognize and bind to regulate gene expression. The **RRACH** motif is a well-known consensus sequence associated with m6A (N6-methyladenosine) modifications in RNA (R = A or G, H = A, C or U), where the adenosine serves as the methylation site. This motif is commonly found in **3' UTRs** and near **stop codons**, playing a crucial role in post-transcriptional gene regulation by influencing RNA stability, splicing, and translation efficiency. Highly conserved across species, RRACH is a primary target for m6A methylation and dynamically regulated in response to cellular conditions. 

<img src="images/1-RRACH-motif.png" width="600" />

In **MeRIP-seq** studies, RRACH motifs are often enriched within m6A peaks, making them key markers for understanding the functional impact of m6A modifications. To identify whether m6A peaks contain the RRACH motif, tools like **MEME** can be used for motif discovery in peak regions, providing insights into credible methylation sites.

[**MEME** (Multiple Em for Motif Elicitation)](https://meme-suite.org/meme/doc/meme.html?man_type=web) is a widely-used tool for motif discovery in DNA, RNA, and protein sequences. It  discovers novel, ungapped motifs (recurring, fixed-length patterns) that may represent binding sites, functional domains, or regulatory elements by applying statistical models to find motifs enriched in the input sequences. MEME is particularly useful in genomics and bioinformatics for uncovering sequence motifs without prior knowledge of their patterns.

In [None]:
# Extract sequences around peaks
! bedtools getfasta -fi Tutorial_1/chr11_1.5M.fasta -bed Tutorial_1/macs2/merged_peaks.bed -fo Tutorial_1/macs2/peaks.fa

# Filter out sequences shorter than 6 characters
! awk 'BEGIN {RS=">"; ORS=""} length($2) >= 6 {print ">"$0}' Tutorial_1/macs2/peaks.fa > Tutorial_1/macs2/filtered_peaks.fa

# Run MEME for motif discovery on filtered sequences with RNA flag
! meme Tutorial_1/macs2/filtered_peaks.fa -oc Tutorial_1/meme -rna -mod anr -nmotifs 5 -minw 5 -maxw 7


#### Visualization of identified motifs

In [None]:
from IPython.display import IFrame
IFrame('Tutorial_1/meme/meme.html', width=800, height=400)

---
## **3. Data Visualization with IGV**
The Integrative Genomics Viewer (IGV) is an interactive tool for the visual exploration of genomic data. It supports flexible integration of all the common types of genomic data and metadata. IGV supports many different file formats, such as .bam, .bed, GFF/GTF, .fasta. For a full list of file formats IGV supported, please visit https://software.broadinstitute.org/software/igv/FileFormats.

IGV can be downloaded as a desktop application, and it also has a JavaScript version that can embed IGV in the web apps. The igv-notebook we are going to use in this tutorial is a Python package which wraps igv.js for embedding it in an IPython notebook.

#### Basic usage
+ Select reference genome - IGV hosts dozens of genomes and you can load other genomes too
+ Load data tracks
+ Navigate
    - Zoom in/out - from whole genome view to base pair resolution
    - Scroo/pan - view neighboring regions
    - Jump to locus - enter coordinates or name
    
#### Install igv-notebook
The Python package <code>igv-notebook</code> needs to be installed with pip: <code>pip install igv-notebook</code>. It should already be installed in the installation steps in this module.

#### Intitialize IGV
Create a browser "b", showing a mouse reference hg38 from chromosome 11. You can change the settings in the browser interactively. The output should like this:

<img src="images/1-igv1.png" width="800" />

In [None]:
import igv_notebook
igv_notebook.init()
b = igv_notebook.Browser(
    {
        "genome": "hg38",
        "locus": "chr11:523,498-543,502"
    }
)

After intialization, you can start to **load data tracks** to IGV brouser. IGV displays data in horizontal rows called **tracks**. Typically, each track represents one sample or experiment. Track names are listed in the far-left panel. Legibility of the names depends on the height of the tracks, i.e., the smaller the track the less legible the name. There are different types of tracks (different file formats) that IGV can display:  
- **Data tracks** display numeric values, such as the methylation levels in our tutorial
- **Feature tracks** identify genomic features. For an example, see the Refseq Genes track, which IGV loads when you select a genome.
- **Alignment Track** display alignments 

### 3.1 View Alignments
#### 1. BAM files and indexing  

IGV requires that both SAM and BAM files be **sorted** by position and **indexed**, and that the index files follow a specific naming convention. Specifically, a BAM index file should be named by appending `.BAI` to the BAM file name. 

To view the alignment files generated from STAR alignment in this tutorial, we first need to sort the alignment files (.bam) to generate the index (.bai) files: (the .bam files will be found in `Tutorial_1/STAR` directory)

In [None]:
%%bash
# Loop through all .bam files in the STAR directory
for bam_file in Tutorial_1/STAR/*Aligned.sortedByCoord.out.bam; do
    base_name=$(basename "$bam_file" Aligned.sortedByCoord.out.bam)
    #base_name=$(basename "$bam_file" Aligned.toTranscriptome.out.bam)
    # Sort the BAM file
    samtools sort "$bam_file" -o "Tutorial_1/igv/${base_name}.sorted.bam"
    
    # Index the sorted BAM file
    samtools index "Tutorial_1/igv/${base_name}.sorted.bam"
done

#### 2. Alignment coverage and bigWig files

<div style="border: 1px solid #9ec5fe; padding: 0px; border-radius: 4px;">
  <div style="background-color: #cfe2ff; padding: 5px;">
    <i class="fas fa-file-alt" style="color: #052c65;margin-right: 5px;"></i><a style="color: #052c65">Notes: <code><b>bigWig</b></code>format</a>
  </div>
  <p style="margin-left: 5px;">
The <b>bigWig</b> format is in an indexed binary format useful for displaying dense, continuous data in Genome Browsers such as the UCSC and IGV. This mitigates the need to load the much larger BAM files for data visualisation purposes which will be slower and result in memory issues. The coverage values represented in the bigWig file can also be normalised in order to be able to compare the coverage across multiple samples - this is not possible with BAM files. The bigWig format is also supported by various bioinformatics software for downstream processing such as meta-profile plotting.
    </p>
</div>
<img src="images/1-bigWig.png" width="600" />

Image source: [deepTools documentation](https://deeptools.readthedocs.io/en/latest/content/tools/bamCoverage.html)

In [None]:
%%bash
# Loop through all BAM files and convert them to BigWig
for bam_file in Tutorial_1/igv/*.bam; do
    # Define the output BigWig filename
    bw_file="${bam_file%.sorted.bam}.bw"  # Change extension from .bam to .bw

    # Run bamCoverage with normalization by CPM and a bin size of 100 bp
    bamCoverage -b "$bam_file" -o "$bw_file" --normalizeUsing CPM --binSize 10
done

#### 3. Load and view the alignment (BAM) and coverage (bigWig) files to igv-notebook.
Start a new browser "b2", load into the browser: 
+ reference genome
+ the alignment .bam and its index file .bai
+ bigWig coverage file 

In this example, only data from one sample (SRR7763558, input library from tumor samples) is used to show the visualziaiton of alignemnt and its coverage.

In [None]:
import igv_notebook
igv_notebook.init()
b2 = igv_notebook.Browser(
    {
        "genome": "hg38",
        "locus": "chr11:531,000-536,000"
    }
)
b2.load_track({
    "name": ".bam",
    "path": "Tutorial_1/igv/subset_SRR7763558.sorted.bam",
    "indexPath": "Tutorial_1/igv/subset_SRR7763558.sorted.bam.bai",
    "format": "bam",
    "type": "alignment",
    "height": 100
})
b2.load_track({
    "name": ".bigWig",
    "path": "Tutorial_1/igv/subset_SRR7763558.bw",
    "format": "bigWig"
})

### 3.2 View identified m6A peaks
#### BED format (<code>.bed</code> files)
<div style="border: 1px solid #9ec5fe; padding: 0px; border-radius: 4px;">
  <div style="background-color: #cfe2ff; padding: 5px;">
    <i class="fas fa-file-alt" style="color: #052c65;margin-right: 5px;"></i><a style="color: #052c65">Notes: <code><b>BED</b></code>format</a>
  </div>
  <p style="margin-left: 5px;">
The <b>BED</b> (Browser Extensible Data) file format is a simple, tab-delimited text format widely used to represent genomic regions. In the context of m6A peak analysis, a <code>.bed</code> file contains the coordinates of identified m6A-modified regions (peaks) in RNA. Each row in the file typically represents one peak and includes three required fields and nine additional optional fields:
<ul>
    <li><b>Chromosome</b> (Required): The chromosome where the peak is located.</li>
    <li><b>Start and End Positions</b> (Required): Coordinates defining the range of the peak.</li>
    <li><b>Name</b> (Optional): Identifier for the peak, which may include peak number or other labels.</li>
    <li><b>Score</b> (Optional): A numerical value often representing peak significance, such as enrichment level or p-value.</li>
</ul>
    </p>
</div>
The BED format is compatible with many genomic tools and browsers, making it easy to visualize m6A peaks across the genome and integrate them with other datasets for downstream analysis. Here is how to extract information and create .bed files for peaks identifed using MACS2:

In [None]:
# View the peak output from MACS2
! grep -v '^#'  Tutorial_1/macs2/tumor_peaks_peaks.narrowPeak | awk '!/^#/ {print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6}' > Tutorial_1/igv/tumor-peaks.bed
! grep -v '^#'  Tutorial_1/macs2/normal_peaks_peaks.narrowPeak | awk '!/^#/ {print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6}' > Tutorial_1/igv/normal-peaks.bed
! head Tutorial_1/igv/normal-peaks.bed

#### .bedGraph files
<div style="border: 1px solid #9ec5fe; padding: 0px; border-radius: 4px;">
  <div style="background-color: #cfe2ff; padding: 5px;">
    <i class="fas fa-file-alt" style="color: #052c65;margin-right: 5px;"></i><a style="color: #052c65">Notes: <b><code>.bedGraph</code></b> format</a>
  </div>
  <p style="margin-left: 5px;">
    While <code>.bed</code> files are ideal for representing discrete regions like peaks, <code>.bedgraph</code> files provide more granular information about the distribution and intensity of modifications across larger genomic spans.
  </p>
  <p style="margin-left: 5px;">
The <code>bedGraph</code> format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data. It follows the definition in <b>four</b> column BED format:
<table><tbody><tr><td>chromA</td><td>chromStartA</td><td>chromEndA</td><td>dataValueA</td></tr>
<tr><td>chromB</td><td>chromStartB</td><td>chromEndB</td><td>dataValueB</td></tr></tdody></table>
  </p>
</div>

<p>Merge <code>.bigWig</code> files into <code>.bedGraph</code> files using <code>bigWigMerge</code>. These merged files can then be visualized in IGV as a single track, simplifying the comparison between input and IP (immunoprecipitated) samples.</p>

<div style="border: 1px solid #659078; padding: 0px; border-radius: 4px;">
  <div style="background-color: #d4edda; padding: 5px; font-weight: bold;">
    <i class="fas fa-lightbulb" style="color: #0e4628;margin-right: 5px;"></i>
      <span style="color: #0e4628">Tips - Changing Directory in Jupyter Notebooks </span>
  </div>
  <p style="margin-left: 5px;margin-right: 5px;">
     When you use the <code><b>!cd</b></code> command in a Jupyter notebook (or any notebook with a shell interface), the command only changes the directory within the subshell it runs in. This change does not persist, even for the next <code>!</code> command in the same cell. Each <code>!</code> command runs in its own separate subshell. If you'd like to change the current working directory and execute multiple commands in the same shell, you can use <code><b>%%bash</b></code> to include all the commands in the same subshell. However, this will not affect the directory in the next cell; the working directory will revert to the original one after the cell finishes running.
  </p>
</div>

In [None]:
%%bash
cd Tutorial_1/igv
bigWigMerge subset_SRR7763558.bw subset_SRR7763559.bw subset_SRR7763560.bw subset_SRR7763561.bw subset_SRR7763562.bw subset_SRR7763563.bw tumor-input.bg
bigWigMerge subset_SRR7763564.bw subset_SRR7763565.bw subset_SRR7763566.bw subset_SRR7763567.bw subset_SRR7763568.bw subset_SRR7763569.bw subset_SRR7763570.bw normal-input.bg
bigWigMerge subset_SRR7763571.bw subset_SRR7763572.bw subset_SRR7763573.bw subset_SRR7763574.bw subset_SRR7763575.bw subset_SRR7763576.bw tumor-m6A-IP.bg
bigWigMerge subset_SRR7763577.bw subset_SRR7763578.bw subset_SRR7763579.bw subset_SRR7763580.bw subset_SRR7763581.bw subset_SRR7763582.bw subset_SRR7763583.bw normal-m6A-IP.bg

#### Load .bed and .bedGraph tracks in a IGV browser
<img src="images/1-igv2.png" width="700"/>

In [None]:
import igv_notebook
igv_notebook.init()
b3 = igv_notebook.Browser(
    {
        "genome": "hg38",
        "locus": "chr11:531,000-536,000"
    }
)
b3.load_track({
    "name": "Tumor input",
    "path": "Tutorial_1/igv/tumor-input.bg",
    "format": "bedGraph",
    "color": "blue"
})
b3.load_track({
    "name": "Tumor m6A-IP",
    "path": "Tutorial_1/igv/tumor-m6A-IP.bg",
    "format": "bedGraph",
    "color": "red"
})
b3.load_track({
    "name": "Normal input",
    "path": "Tutorial_1/igv/normal-input.bg",
    "format": "bedGraph",
    "color": "blue"
})
b3.load_track({
    "name": "Normal m6A-IP",
    "path": "Tutorial_1/igv/normal-m6A-IP.bg",
    "format": "bedGraph",
    "color": "red"
})
b3.load_track({
    "name": "tumor-peaks.bed",
    "path": "Tutorial_1/igv/tumor-peaks.bed",
    "format": "bed",
    "color": "black",
    "height": 40
})
b3.load_track({
    "name": "normal-peaks.bed",
    "path": "Tutorial_1/igv/normal-peaks.bed",
    "format": "bed",
    "color": "darkgreen",
    "height": 40
})


---
## **4. RNA-seq processing using Input samples**

In [None]:
# install featurecounts
! mamba install -y -c bioconda subread
! mamba install -y -c bioconda rsem

### Generate gene count table
#### Method 1. featureCounts
**<code>featureCounts</code>** is a high-performance tool for counting reads from RNA-seq or other sequencing data that align to specific genomic features, such as genes or exons. It efficiently processes large BAM or SAM files by using multi-threading, making it suitable for high-throughput datasets. To run <code>featureCounts</code>, users provide a GTF or GFF annotation file that defines the features of interest and specify the BAM files containing aligned reads. For example, the command <code>featureCounts -T 4 -a annotation.gtf -o gene_counts.txt sample1.bam sample2.bam</code> will count reads from <code>sample1.bam</code> and <code>sample2.bam</code> mapped to features in <code>annotation.gtf</code>, using 4 threads. The result is an output table (gene_counts.txt) with genes (or other features) as rows and samples as columns, ready for downstream differential expression analysis. <code>featureCounts</code> supports both single-end and paired-end data, as well as strand-specific options, making it a versatile tool in RNA-seq analysis workflows.

In [None]:
# list all the input alignment result bam files
! awk '$2 ~ /input/ {print "Tutorial_1/STAR/subset_" $4 "Aligned.sortedByCoord.out.bam"}' Tutorial_1/meta.txt
# using featurecounts to get the gene count table
! awk '$2 ~ /input/ {print "Tutorial_1/STAR/subset_" $4 "Aligned.sortedByCoord.out.bam"}' Tutorial_1/meta.txt | xargs featureCounts -T 4 -a Tutorial_1/gencode.v46.pri.chr11.1.5M.gtf -o Tutorial_1/FeatureCounts/gene_counts.txt 

#### Method 2. RSEM
**<code>RSEM</code>** (RNA-Seq by Expectation-Maximization) is a popular tool for quantifying gene and transcript expression levels from RNA-seq data. RSEM uses probabilistic models to assign reads to transcripts, which helps accurately estimate transcript abundance, even when reads map to multiple transcripts (multi-mapping reads). To start, <code>RSEM</code> requires a transcriptome reference prepared from a GTF file and reference genome, after which it aligns reads (or works with existing BAM files) to calculate transcript and gene-level counts, TPMs, and FPKMs. For instance, running <code>rsem-calculate-expression --paired-end sample_R1.fastq sample_R2.fastq reference output_prefix</code> quantifies expression levels for the paired-end data in <code>sample_R1.fastq</code> and <code>sample_R2.fastq</code>, producing output files like <code>output_prefix.genes.results</code> and <code>output_prefix.isoforms.results</code>. RSEM’s accurate handling of transcript-level information makes it ideal for studies requiring insights into isoform-specific expression or alternative splicing.

In [None]:
%%bash

# Define the RSEM reference and output directory
RSEM_REF="Tutorial_1/RSEM/rsem_reference"
OUTPUT_DIR="Tutorial_1/RSEM"

# Step 1: Prepare the RSEM reference genome
rsem-prepare-reference --gtf Tutorial_1/gencode.v46.pri.chr11.1.5M.gtf Tutorial_1/chr11_1.5M.fasta $RSEM_REF

# Step 2: Run RSEM on each BAM file for input samples
awk '$2 ~ /input/ {print "Tutorial_1/STAR/subset_" $4 "Aligned.toTranscriptome.out.bam"}' Tutorial_1/meta.txt | 
while read bam_file; do
    # Extract the sample name without the suffix
    sample_name=$(basename "$bam_file" "Aligned.toTranscriptome.out.bam")
    
    # Filter for properly paired reads to avoid issues with partially paired reads
    filtered_bam="${bam_file%.bam}_properlyPaired.bam"
    samtools view -b -f 1 "$bam_file" > "$filtered_bam"
    
    # Run RSEM using the filtered BAM file
    rsem-calculate-expression --paired-end --bam --no-bam-output -p 4 \
        "$filtered_bam" $RSEM_REF $OUTPUT_DIR/$sample_name -q
    
    # Optional Cleanup: Remove the filtered BAM file to save space
    rm "$filtered_bam"  
done
echo "Done"

Below is how to create combined tables from RSEM outputs for combined expected counts, TPM, and FPKM tables using <code>Python</code>. This script will load each RSEM output file, extract the required columns, and save each metric into separate tables.

In [None]:
import pandas as pd
import glob

# Define directory with RSEM output files
output_dir = "Tutorial_1/RSEM"

# Get all RSEM gene count file paths
count_files = glob.glob(output_dir + "/*.genes.results")

# Initialize empty lists to store dataframes for each metric
expected_count_dfs = []
tpm_dfs = []
fpkm_dfs = []

# Loop through each RSEM output file
for file in count_files:
    # Read the file
    df = pd.read_csv(file, sep='\t', index_col=0)
    
    # Extract sample name from the filename
    sample_name = file.split('/')[-1].split('.genes.results')[0]
    
    # Extract and rename the columns for each metric
    expected_count_df = df[['expected_count']].copy()
    expected_count_df.columns = [sample_name]  # Rename column to sample name
    expected_count_dfs.append(expected_count_df)

    tpm_df = df[['TPM']].copy()
    tpm_df.columns = [sample_name]
    tpm_dfs.append(tpm_df)

    fpkm_df = df[['FPKM']].copy()
    fpkm_df.columns = [sample_name]
    fpkm_dfs.append(fpkm_df)

# Concatenate dataframes for each metric along gene_id (index)
combined_expected_counts = pd.concat(expected_count_dfs, axis=1)
combined_tpm = pd.concat(tpm_dfs, axis=1)
combined_fpkm = pd.concat(fpkm_dfs, axis=1)

# Save combined tables to files
combined_expected_counts.to_csv("Tutorial_1/RSEM/combined_expected_counts.txt", sep='\t')
combined_tpm.to_csv("Tutorial_1/RSEM/combined_tpm.txt", sep='\t')
combined_fpkm.to_csv("Tutorial_1/RSEM/combined_fpkm.txt", sep='\t')

## Conclusion
In this module, we covered the following key concepts and workflows:
+ **MeRIP-seq Data Preprocessing**: Downloading the dataset, setting up directories, quality control with FastQC, adapter trimming, and aligning reads using STAR.
+ **Peak Calling and Annotation**: Using HOMER for peak calling and motif discovery, helping us identify regions enriched in m6A modifications.
+ **Data Visualization**: Using IGV for exploring alignment files, coverage, and peaks, providing a comprehensive view of the MeRIP-seq data.
+ **BigWig and BAM Manipulations**: Converting alignment files and coverage tracks into formats suitable for visualization, merging replicates for more streamlined analysis.
+ **Processing input samples as RNA-seq**: The input samples used as control files in the MeRIP-seq experiment, can be used to measure gene expession profiles across conditions
By following these steps, you now have a full understanding of the MeRIP-seq data preprocessing pipeline and can apply similar workflows to your own datasets for robust RNA methylation analysis.

## Clean up
When you’re finished with this tutorial, you can:
- **Delete the Notebook Instance**: This will prevent any unexpected charges. Note that if you delete the notebook, you’ll need to  restart from the beginning if you want to rerun the tutorial.
- **Remove Output Files and Stop the Notebook**: Delete any output files if they aren’t needed, and then stop the notebook to avoid charges. You can re-run the tutorial later to regenerate these files if needed.
- **Stop the Notebook and Keep Output Files**: Stop the notebook instance but keep the output files, especially if you’re moving on to Tutorial 2, which may require these outputs.

<div style="border: 1px solid #659078; padding: 0px; border-radius: 4px;">
  <div style="background-color: #d4edda; padding: 5px;">
    <i class="fas fa-lightbulb" style="color: #0e4628;margin-right: 5px;"></i><a style="color: #0e4628"><b>Tips</b>: </a>
  </div>
  <p style="margin-left: 5px;">
Add a lifecycle configuration to automatically shut down the notebook after a set period of inactivity to manage costs more effectively.
  </p>