# 0 How to use this tutorial
All hyperlinks are [clickable](img/IMG1.jpg "What happened to the curious cat?"), text in <span title="This tutorial was written in Jupyter (iPython notebook) and rendered on Github."><b>bold</b></span> has hovertext with additional information (move your mouse over the text to read the messages), and all code should be run in a Unix shell (Bash) terminal. Code will appear as indented blocks of text like this:
```
### These are comments within a block of code
echo "You can copy and paste this code in a Bash terminal on your computer"
```

# 1 Introduction
<span title="The process of identifying the order of the individual components in a chain of molecules eg. nucleotides in RNA/DNA or amino acids in proteins."><b>Sequencing</b></span> plays an important role in diverse fields including \[but not limited to\] forensics, evolutionary biology, pharmacology, clinical pathology, oncology, and even anthropology. Nucleic acid sequencing is not exactly a *new* technology; scientists have been using various methods of nucleotide sequencing since the 1970s. Nearly half a centry later, the cost of sequencing has dropped dramatically thanks to numerous and discoveries and innovations.

|<span title="Source: https://www.genome.gov/images/content/costpergenome2015_4.jpg">The decreasing cost of sequencing</span>|<span title="Source: 'The cancer genome' by Michael R. Stratton, Peter J. Campbell, and P. Andrew Futreal. doi:10.1038/nature07943">Evolution of sequencing technology</span>|
|:-------:|:------:|
|<span title="Source: https://www.genome.gov/images/content/costpergenome2015_4.jpg"><img src='img/IMG2.jpg' width='400'></span>| <span title="Source: 'The cancer genome' by Michael R. Stratton, Peter J. Campbell, and P. Andrew Futreal. doi:10.1038/nature07943"><img src=img/IMG4.jpg width='400'></span>|

<center>[Click here](http://www.sciencedirect.com/science/article/pii/S0888754315300410) to learn more about the history sequencing technology.</center>

Have you ever worked with sequencing data before? If your answer was '*Yes*', what were you trying to find? How was the data organized? Where did it come from? If your answer was '*No*', then let me welcome you to the era of !

## 1.1 Get to know your data

In this tutorial we are going to be working with RNA sequencing data from the model organism *Saccharomyces cerevisiae*. As *S. cerevisiae* is a very well-studied organism, <span title="This isn't always a good idea; depending on your research objectives, you may need to consider which criteria were used to decide which features go in the reference annotations and what was discarded."><b>we will assume</b></span> that the reference genome & assembly are mostly correct and complete.

What kind of organism is *S. cerevisiae*? *S. cerevisiae* is a budding yeast; a <span title="'Simple' is relative term..."><b>simple</b></span> eukaryote with 16 chromosomes and <span title="According to yeastgenome.org, there are 5155 'validated' ORFs... but according to the NCBI, there are 6350 genes. Take note: annotations can vary between sources and releases."><b>~6000 genes</b></span>. Yeast are among the easiest organisms to grow and study in a laboratory setting. There are dozens of well-documented experimental techniques to manipulate the genomes of yeast- some of which you can even <span title="Try doing that with a chimpanzee... On second thought, please don't try that."><b>[do at home!](http://www.the-odin.com/diy-yeast-crispr-kit/)</b></span>

|<span title="Cell cycle of budding yeast. Source: https://voer.edu.vn/file/3025">Cell cycle</span>|<span title="S. cerevisiae dividing. Source: http://www.csb.ethz.ch/research/experimental-yeast-biology.html">Your favorite sequence can be tagged with a fluorescent reporter!</span>|
|:-------:|:-------:|
|<span title="Cell cycle of budding yeast. Source: https://voer.edu.vn/file/3025"><img src="img/IMG6.jpg" width='350'>|<span title="S. cerevisiae dividing. Source: http://www.csb.ethz.ch/research/experimental-yeast-biology.html"><img src="img/IMG5.png" width='400'> </span>

We grew our sample in a rich medium, meaning our yeast were happily growing before we extracted and purified their RNA. Our RNA sequencing (RNAseq) data is stored in a pair of FASTQ files: these files are composed of millions of reads which were run on the <span title="Additional information about our data: it is 50bp, paired-end, and stranded."><b>Illumina HiSeq2500 platform</b></span>-- a second generation sequencer.The main innovation of second generation sequencing [AKA next generation sequencing] was to split the task of identifying each base into a massively parallel process.<span title="Source: 'Next Generation Sequencing in Aquatic Models' by Yuan Lu, Yingjia Shen, Wesley Warren and Ronald Walter. DOI: 10.5772/61657"><img src=img/IMG3.png width=500></span>
If you haven't already, download the RNAseq data we will be working with, two FASTQ files called *s_cerevisiae_chrX_read1.fastq.gz* and * 	s_cerevisiae_chrX_read2.fastq.gz* from [this link](https://github.com/willblev/assembly_workshop_MA_2016). Typically a RNAseq run on a NGS platform will generate several gigabytes of data; our files are <span title="100MB vs 2.3GB"><b> significantly smaller</b></span> because we are starting with only those reads which map to chromosome 10. 

## 1.2 Prerequesites
First you need to download and <span title="You will need to download and install lots of different programs throughout your carrers as bioinformaticians. As each program author is different, the installation methods will vary. HINT: Most packages contain a file called README or INSTALL; this is always a good place to start."><b>install</b></span> the following programs:
* [FastQC 0.11.5 for Linux](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip) or [FastQC 0.11.5 for OSX (Mac)](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.dmg)
* [Trimmomatic 0.67](http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.36.zip) (you'll need Java as well, but most systems have Java already installed)
* [Bowtie2 2.2.9 for Linux](https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/bowtie2-2.2.9-linux-x86_64.zip/download) or [Bowtie2 2.2.9 for OSX (Mac)](https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/bowtie2-2.2.9-macos-x86_64.zip/download)
* [SAMtools 1.3.1](https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2)
* [Cufflinks 2.2.1 for Linux](http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz)  or [Cufflinks 2.2.1 for OSX (Mac)](http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.OSX_x86_64.tar.gz)  

After you have downloaded, decompressed, and installed all of these programs, you may need to tell your computer where to find the executeable files. When you try to execute a program in commandline, your terminal looks for executable files defined in the \$PATH environment variable. To add the directory containing your new program to the \$PATH environment variable, you can use the following code:
```
### for example, if the executeable file 'fastqc' is installed 
### in the directory '/home/username/Download/FastQC-0.11.5'

export PATH=/home/username/Download/FastQC-0.11.5:$PATH
```
<br>
You could also add this line to your ```~/.bash_profile``` if you wanted to store this path permanently. 


# 2 Create a working directory and download reference files
One of the most important things you will learn as a researcher is to organize and archive your work as you go, leaving notes and comments about your thought process and results. Use meaningful names for files using version control where necessary, and create directory structures that make sense for your data. As this tutorial is about RNA sequencing of *S. cerevisiae*, we will call our working directory "S_cerevisiae_RNAseq_tutorial_2016". Create this new directory on your computer wherever you please, then navigate to it.
```
### Make a new directory and navigate to it
mkdir S_cerevisiae_RNAseq_tutorial_2016
cd S_cerevisiae_RNAseq_tutorial_2016
```
Instead of using a browser to download the reference annotations like you did with with the programs you just installed, you can also download the files through a terminal window. 
```
### Download the reference genome and annotations for Saccharomyces cerevisiae (S288c)
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz 
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz
```
As these files are compressed, <span title="Some bioinformatic tools are designed to work with compressed files and thus do not require this step."><b>we have to decompress them</b></span>.
```
### Unzip the two files we just downloaded
gunzip GCF_000146045.2_R64_genomic*.gz
```
Now let's rename these files so that they are easier to work with (and it will be easier to remember what they are). 
```
### Rename the files with names are easier to remember
mv GCF_000146045.2_R64_genomic.fna s_cerevisiae.fasta
mv GCF_000146045.2_R64_genomic.gff s_cerevisiae.gff
```
We also need to download some files 

## 2.1 Create new files containing only chromosome 10 (ChrX)
Due to the time constraints, we're not going to do our analysis using the whole genome; instead, we will work exclusively with chomosome 10 (ChrX). The Fasta file *s_cerevisiae.fasta* contains the sequences of all 16 chromosomes, but we only need chromosome 10. <span title="A slow way to do this is to open the file s_cerevisiae.fasta in a text editor and search for 'NC_001142.9' which is the ID that corresponds to chromosome 10. Delete everything except for chromosome 10 then save the file as 's_cerevisiae_chrX.fasta'."><b>How would you make a new file called *s_cerevisiae_chrX.fasta* that only contains chromosome 10?</b></span> Here is a table which will help you figure out which header+sequence to keep:

| ID in s_cerevisiae.fasta | Chromosome |
|---------------------|------------|
| NC_001133.9         | I          |
| NC_001134.8         | II         |
| NC_001135.5         | III        |
| NC_001136.10        | IV         |
| NC_001137.3         | V          |
| NC_001138.5         | VI         |
| NC_001139.9         | VII        |
| NC_001140.6         | VIII       |
| NC_001141.2         | IX         |
| NC_001142.9         | X          |
| NC_001143.9         | XI         |
| NC_001144.5         | XII        |
| NC_001145.3         | XIII       |
| NC_001146.8         | XIV        |
| NC_001147.6         | XV         |
| NC_001148.4         | XVI        |
| NC_001224.1         | Mito       |

Try to figure it out on your own, but is a <span title="One way to make s_cerevisiae_chrX.fasta is to trim off the beginning and end using head and tail
head s_cerevisiae.fasta -n first_line_of_chrX-1 | tail -n number_of_lines_in_chrX > s_cerevisiae_chrX.fasta"><b>hint if you get stuck</b></span>.

We have to do the same thing with the annotations; we only want features which appear on chromosome X. How would you go about generating a new file, *s_cerevisiae_chrX.gff*, <span title="Remember, chromosome X is called 'NC_001142.9' in the annotations. Look at the GFF file- the first column of every line is the chromosome."><b> with only the features from chromosome 10?</b></span> Here's a <span title="We could use grep to find all the features on chromosome 10; the ^ symbol tells grep to look for a match at the beginning of each line."><b>hint if you need it</b></span>.

Open the files and check to make sure that you did everything correctly. Run the following code and you should get the same results:
```
### check the size of these files
wc s_cerevisiae_chrX.fasta s_cerevisiae_chrX.gff s_cerevisiae_chrX.fasta
```
<br>
```
    1623    29194   469934 s_cerevisiae_chrX.gff
  151990   152108 12310392 s_cerevisiae.fasta
  153613   181302 12780326 total
```

# 3 Trimming RNAseq reads with *Trimmomatic*

NGS sequencing typically results in <span title="Our samples had approximately 25 million read pairs."><b>tens of millions of reads</b></span>. A proportion of these reads will contain artifacts or low-quality bases which we would like to remove before starting our analyses. *Trimmomatic* performs a variety of useful trimming tasks for illumina paired-end and single ended data. The selection of trimming steps and their associated parameters are supplied on the command line. Our sample of *S. cerevisiae* was prepared with the TruSeq RNA library preparation protocol using adapter number 11, we will search through our reads (the FASTQ files) and remove this adapter. 

Instead of downloading a file and editing it, we can simply create a file containing the one adapter sequence by running the follwing code:
```
### This code will create a file called illumina_adapter.fasta which contains
### the sequence of TruSeq adapter number 11
cat <<EOF > illumina_adapter.fasta
>TruSeq_Adapter_Index_11
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG
EOF
```


We need to create a directory to store the output of *Trimmomatic*
```
### First we will create a new directory within our working directory to store the trimmed fastq files
mkdir trimmomatic
```
We can then run *Trimmomatic* on our RNAseq raw data. We use several parameters to tell the program how we want to filter and trim our data:
* ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
* LEADING: Cut bases off the start of a read, if below a threshold quality
* TRAILING: Cut bases off the end of a read, if below a threshold quality
* SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
* MINLEN: Drop the read if it is below a specified length

```
### Running Trimmomatic on our RNAseq data. You may need to add the directory/path to 
### the command if Java can't locate it
java -jar trimmomatic-0.67.jar \
PE -phred33 s_cerevisiae_chrX_read1.fastq.gz s_cerevisiae_chrX_read2.fastq.gz \
trimmomatic/s_cerevisiae_chrX_read1_paired.fastq trimmomatic/s_cerevisiae_chrX_read1_unpaired.fastq \
trimmomatic/s_cerevisiae_chrX_read2_paired.fastq trimmomatic/s_cerevisiae_chrX_read2_unpaired.fastq \
ILLUMINACLIP:illumina_adapter.fasta:2:30:10:2:true LEADING:30 TRAILING:30 SLIDINGWINDOW:4:15 MINLEN:35
```



# 4 Quality control with *FastQC*
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. It generates a HTML file with graphs that allow you to quickly spot artifacts or other problems with your raw data.

Again, we first create a directory to contain the output of fastqc
```
### We create a new directory which will contain the results of our analysis
mkdir fastqc
```
And then run the analysis
```
### Run fastqc on the pair of fastq files that trimmomatic produced [using 3 threads to speed things up]
fastqc trimmomatic/ s_cerevisiae_chrX_read1_paired.fastq -o fastqc -t 3 
fastqc trimmomatic/ s_cerevisiae_chrX_read2_paired.fastq -o fastqc -t 3 
```

### 4.1 Analysis of *FastQC* results
Open the files s_cerevisiae_chrX_read1_paired_fastqc.html and s_cerevisiae_chrX_read2_paired_fastqc.html in your browser. Does our data have any warnings or failures? Is the data ready for analysis? You can see [the description of all the different modules here](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/).

# 5 Map RNAseq reads to the reference genome with *Bowtie2*

After we check to make sure that our raw RNAseq reads are useable, we can map them to our reference genome. In our case, we will be mapping only to a smaller part of the genome, chromosome X. 

### 5.1 Build an index
First we must create a* bowtie2* index using chromosome 10 so that the millions of reads can be mapped efficiently to the reference sequence. 

```
### Building a bowtie2 index with our .fasta file. It will create several files with the *.bt2 extension
bowtie2-build  s_cerevisiae_chrX.fasta  s_cerevisiae_chrX
```


### 5.2 Map the reads with *Bowtie2*
Now that we have built the index, we need to tell* bowtie2* where to find these index files:
```
### create env variable to show bowtie2 know where the index files are
export BOWTIE2_INDEXES='.'
```
Then we can run *bowtie2 *to map our reads to the reference, which will result in a .SAM file
```
### Run bowtie2 to map our reads to the reference genome
bowtie2 -x s_cerevisiae_chrX -p 2 \
-1 trimmomatic/s_cerevisiae_chrX_read1_paired.fastq \
-2 trimmomatic/s_cerevisiae_chrX_read2_paired.fastq \
-S 's_cerevisiae_chrX.sam'
```

To be able to use our mapped reads with *cufflinks*, we need to sort the mapped reads by location- we use the *samtools* suite of tools to convert the .SAM file into a .BAM file and sort it. 
```
### convert the .sam file into a .bam file and sort it 
samtools view -Su s_cerevisiae_chrX.sam | samtools sort - s_cerevisiae_chrX_sorted
```



# 5. Assemble the mapped reads into a transcriptome with *Cufflinks*

*Cufflinks* uses aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. It has several functions including the assembly of transcripts, estimating their abundances, and testing for differential expression and regulation. <span title="Source: 'Identification of novel transcripts in annotated genomes using RNA-Seq' by Adam Roberts, Harold Pimentel, Cole Trapnel, and Lior Pachter. DOI: 10.1093/bioinformatics/btr355"><img src='img/IMG7.jpg' width='750'></span>

Create a directory to store the files generated by cufflinks
```
mkdir cufflinks_output
```
Instead of simply running cufflinks, this time we're going to define several variables and then pass cufflinks the variable names instead of writing out the name of the files when we call cufflinks. This will give the same final result, but it can be useful when you are writing scripts which you want to recycle. If you want to re-run the code with a different file, it is easier to edit a script if you define all your variables at the beginning. 
```
DIR_OF_OUTPUT_FILES="cufflinks_output"
REFERENCE_GENOME="s_cerevisiae_chrX.fasta"
REFERENCE_ANNOTATION="s_cerevisiae_chrX.gff"
READ_1="trimmomatic/s_cerevisiae_chX_read1_paired.fastq"
READ_2="trimmomatic/s_cerevisiae_chX_read2_paired.fastq"
BAM_FILE="s_cerevisiae_chrX_sorted.bam"

cufflinks $BAM_FILE -p 4 \
--frag-bias-correct $REFERENCE_GENOME \
-g $REFERENCE_ANNOTATION \
--max-bundle-frags 999999999 \
--max-bundle-length 4000000 \
--library-type fr-firststrand \
-o $DIR_OF_OUTPUT_FILES 
```