# AMIA 2016 Annual Symposium Workshop (WG13)

### Mining Large-scale Cancer Genomics Data Using Cloud-based Bioinformatics Approaches (RNAseq)

Riyue Bao, Ph.D. 
Center for Research Informatics,
The University of Chicago.
8:45 AM - 10:00 AM, November 13, 2016

***

## Objective

* **Part I: Introduction to RNAseq technology, clinical application and analysis pipeline**
    * Learn the background and clinical application of RNAseq
    * Learn the best-practice analysis protocol
* **Part II: Practice how to perform downstream analysis of RNAseq data (hands-on)**
    * Detect differentially expressed genes (DEGs) between conditions
    * Identify pathways / network enriched in genes of interest
    * Generate high-quality publication figures (Principle Component Analysis (PCA), heatmap, pathway, etc.)
* **Part III: Practice how to associate RNAseq with clincial data (hands-on)**
    * The Cancer Genome Atlas (TCGA) 
    * NCI Genomics Data Commons (GDC) hosts the new harmonized TCGA data

***

## Dataset

* **Data to run RNAseq analysis**
    * Two groups (*PRDM11* KO vs WT, U2932 cells), 6 samples
    * Aim to identify DEGs / pathways altered between KO and WT groups
    * Fog. et al. 2015. [Loss of PRDM11 promotes MYC-driven lymphomagenesis](http://www.bloodjournal.org/content/125/8/1272.long?sso-checked=true). Blood 125(8):1272-81

* **Data to associate RNAseq with clinical outcome**
    * TCGA ovarian cancer, 379 primary tumors
    * Aim to use gene expression to group patients into subtypes and detect survival differences
    * The Cancer Genome Atlas Research Network. 2011. [Integrated genomic analyses of ovarian carcinoma](http://www.nature.com/nature/journal/v474/n7353/full/nature10166.html). Nature 474, 609–615

***

## Workflow

<img src='notebook_ext/ipynb_data/assets/RNAseq.workflow.v2.png', title = 'RNAseq workflow', width = 900, height = 900>

***

## Part I: Introduction to RNAseq technology, clinical application and analysis pipeline

### 1.1 What is RNAseq

* **High-throughput sequencing of RNA to profile, identify or assemble transcripts**
    * Detect gene / isoform expression changes between conditions
    * Identify novel splice sites / exons, mutations, fusion genes, etc.
    * Available for all species (reference genome is optional): reference genome-guided alignment or *de novo* assembly
    * Transcriptome-wide approach for quantitative measurements and gene discovery without prior knowledge
<img src='notebook_ext/ipynb_data/assets/Figure14.png', title = 'A typical RNAseq experiment', width = 800, height = 400>
* **Experimental Design** ([considerations and limitations](https://github.com/riyuebao/CRI-Workshop-Nov2016-RNAseq/tree/master/notebook_ext/ipynb_data/assets/Figure13.png))
<img src='notebook_ext/ipynb_data/assets/Figure15.png', title = 'Biological replicates', width = 400, height = 400>

### 1.2 Clinical applications 

### 1.3 How to perform RNAseq analysis

The best-practice analysis protocol takes 8 major steps.

### 1.3.1 - 1.3.2 Quality accessment and preprocessing of raw sequencing reads

* **Raw sequencing reads** are stored in FastQ format (e.g. `KO01.fastq.gz`), where each read is presented by 4 lines
<img src='notebook_ext/ipynb_data/assets/Figure10.png', title = 'Sequencing reads in FastQ format', width = 600, height = 90>
* QC produces reports that help you evaluate if a sequencing run is successful and if reads are of high quality ([example](https://github.com/riyuebao/CRI-Workshop-Nov2016-RNAseq/tree/master/notebook_ext/ipynb_data/assets/multiqc_report.html))
* *Optional* Preprocess reads to improve mapping rate and accuracy (in the next step, 1.3.3)
    * Trim low-quality bases, clip adapters, etc.
    * Avoid overtrimming in RNAseq! ([Williams et al. 2016](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0956-2))

### 1.3.3 - 1.3.4 Map reads to reference genome and quantification of transcript abundance
* Read mapping identifies the location in the genome where a sequencing read comes from
* Accurate mapping result is the key for quantification and DEG identification
* **splice-aware aligner** (e.g. [STAR](https://github.com/alexdobin/STAR))
<img src='notebook_ext/ipynb_data/assets/Figure12.png', title = 'RNAseq mapping result', width = 600, height = 400>
* **Different aligners may generate very different results** ([Engström et al. 2013](http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2722.html))

### RNAseq metrics
* Collect metrics to evaluate RNA sample quality and identify potential problems ([example](https://github.com/riyuebao/CRI-Workshop-Nov2016-RNAseq/tree/master/notebook_ext/ipynb_data/assets/multiqc_report.html))
    * **Is there high-level genomic DNA contamination?**
    * Is the RNA highly degraded?
    * Was ribosome RNA sucessfully depleted during library prep?
    * How do reads distribute on the genome? (exons, introns, intergenic, etc.)
    * Is the strandness of read alignment consistent with library type? (non-stranded or forward/reverse strand-specific)
    * Does the target gene knocked down in KO samples have reduced expression as expected?
<blockquote>
Which sample (S1-4) has the most severe contamination from genomic DNA?   
Hint: higher percentage of intergenic reads indicates more severe DNA contamination in RNA samples
</blockquote>
<img src='notebook_ext/ipynb_data/assets/Figure5.png', title = 'Figure5', width = 600, height = 600>
*For answers to other questions, refer to the extended version of notebooks* (`notebook_ext/02.Run_RNAseq.tutorial.ipynb`)

* From raw sequencing to transcript quantification (steps 1-5) are **automated through BigDataScript pipeline**.
* DEG and pathway analysis (steps 6 to 7) can be automated, but it is recommended to **perform the analysis interactively to better interpret the results**.

*Due to time limit, we will not run the pipeline in this workshop. Since it is automated, participants are encouraged to practice it post-workshop following the instructions at [Github](https://github.com/riyuebao/CRI-Workshop-Nov2016-RNAseq)*

## Part II: Practice the downstream analysis of RNAseq data

## Part III: Practice how to associate RNAseq with clincial data 