# AMIA 2016 Annual Symposium Workshop (WG13)

## Automated and Scalable Cloud-based RNA-Seq Data Analysis, Part I


Riyue Bao, Ph.D. 
Center for Research Informatics,
The University of Chicago.
November 13, 2016

***

## Objective

* Learn the best-practice RNAseq analysis pipeline
* Learn commonly used bioinformatics tools
* Practice the automated, scalable pipeline
* Explore the quality metrics and input/output of the RNAseq pipeline
* Visualize result files and quality plots

***

## Workflow

<img src='assets/RNAseq.workflow.png', title = 'RNAseq workflow', width = 1000, height = 1000>

***

## Dataset

The test datasets used in this workshop are from 
Fog. et al. 2015. Loss of PRDM11 promotes MYC-driven lymphomagenesis. Blood, 125(8):1272-81
<http://www.bloodjournal.org/content/125/8/1272.long?sso-checked=true>

***

## Pipeline

For this workshop, the machine you are using has everything pre-compiled and pre-installed. It is ready for analysis.

In the future, if you'd like to use the pipleine on your own machines, download analysis pipeline from [Github](https://github.com/riyuebao/CRI-Workshop-Nov2016-RNAseq) and follow the instructions to install.

```{bash}
git clone https://github.com/riyuebao/CRI-Workshop-Nov2016-RNAseq.git
```

Detailed documentation of the pipeline can be found on Github [README](https://github.com/riyuebao/CRI-Workshop-Nov2016-RNAseq) and [wiki](https://github.com/riyuebao/CRI-Workshop-Nov2016-RNAseq/wiki).

***

## Run pipeline

### 1. Open terminal from Jupyter Notebook

Go to [New] button on top of the notebook. In the dropdown menu, click [Terminal]. 

### 2. Launch pipeline (takes ~ 5 minutes)

```{bash}
##-- commands 
pwd
cd dev/rnaseq/CRI-Workshop-Nov2016-RNAseq/pipeline/test/
ls -alt
./Build_RNAseq.DLBC.sh &
jobs
##-- running ... ~ 5 minutes
ls -alt
```

***

## How to do RNAseq analysis?

* Steps 1 - 5 : Automated pipeline (Run_RNAseq.bds)
* Steps 6 - 7 : Interactive R & Bioconductor (Notebook III)

*Unless pointed out otherwise, all commands shown apply to the test datasets only. Do not use them directly on other datasets. Refer to the pipeline documentation for instructions on how to set up the pipeline for your own projects.*

### 1. Quality assessment of raw sequencing reads: FastQC
* **Goal**
    * Check if the reads are of high quality
    * Check if any preprocessing step is required (e.g. base trimming, adapter clipping, read filtering)
* **Method**
    * [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), *version 0.11.5*
    * Scan raw sequencing reads and produce QC reports for evaluation

<blockquote>
```
fastqc --extract -o $out.dir -t 2 --nogroup $r1.fastq.gz
```
</blockquote>

<img src='assets/Figure1.png', title = 'Figure1', width = 800, height = 800>

### 2. (optional) Preprocessing: Trimmomatic

* **Goal**
    * Clean up reads for improved alignment rate and accuracy (for the next step)
* **Method**
    * [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic), *version 0.36*
    * **Clip adapters**
    * **Trim leading/trailing low quality or N bases**
    * Trim reads to a specific length
    * Filter out reads of low average quality / of specific length
    * Convert base quality scores between Phred33 and Phred64 FastQ format
* **Avoid over trimming in RNAseq!**
    * Williams et al. 2016. Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinformatics. [DOI: 10.1186/s12859-016-0956-2](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0956-2)

<blockquote>
```
java -Xmx4G -jar trimmomatic-0.36.jar SE -threads 4 -phred33 $r1.fastq.gz $r1.trim.fastq.gz ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:5 TRAILING:5 MINLEN:36 SLIDINGWINDOW:4:15
```
</blockquote>

<img src='assets/Figure2.png', title = 'Figure2', width = 800, height = 800>

### 3. Map reads to reference genome (GRCh38): STAR

### 4. (optional) Collect RNAseq metrics & coverage: Picardtools, bedtools, RSeQC

### 5. Quantify transcript abundance: featureCounts

### 6. Identify differentially expressed genes (DEGs) between conditions: DESeq2

### 7. Identify biological processes and pathways enriched in genes of interest: clusterProfilter