# Gene expression

## 0. Pick a partner

We'll be in partners for this for both practical and pedagogical reasons.

1. So your compute jobs go through more quickly so you'll actually be able to finish the exercise
2. [Studies](http://files.software-carpentry.org/training-course/2013/08/p34-porter.pdf) have shown that students are more engaged, learn faster, and retain the knowledge better when learning to program in pairs.

We'll take a short poll to assign you to partners (this will be your partner for the HW too)

## 1. Log on to TSCC

Mac/Linux:

    ssh ucsd-train##tscc.sdsc.edu

Windows: Use Putty

## 2. Install the programs we'll need

Some very nice people have made [bioconda](https://bioconda.github.io/), so you can install bioinformatics programs using the Anaconda installer we already installed! The programs are:

- `star` - Aligning
- `samtools` - we'll need to use this to sort, binarize, and index the alignment file after aligning
- `subread` - contains `featureCounts` which we'll use to quanitfy gene expression after sorting and binarizing
- `kallisto` - alternative method to both map and quanitfy gene expression at once

This won't take long. The `--yes` is so `conda` won't bother you by asking "are you SURE you want to install this?"

```
$ conda install --yes --channel bioconda star samtools subread kallisto 
```
    

## 3. Link to your scratch directory if you haven't already

The "scratch" directory can be thought of ["scratch paper"](http://www.nytimes.com/2010/12/05/magazine/05FOB-onlanguage-t.html?_r=0) for data you want to store "just for now." These folders have VERY fast read/write (10-100x faster) than the `/home` directory but things that are 60+ days old get purged (deleted) every so often.

Your scratch directory is located in `/oasis/tscc/scratch/$USER/` but that's annoying to type out and inconvenient to remember. So, let's make a **soft link** from `$HOME/scratch` to this directory. Since we're using the dollar-sign variable notation, you can directly copy the command below.

```
ln -s /oasis/tscc/scratch/$USER $HOME/scratch
```

Now you should see something like this in your home directory:

```
[ucsd-train01@tscc-login2 ~]$ ls -lh
total 271M
drwxr-xr-x 16 ucsd-train01 biom262-group   17 Jan 24 21:17 anaconda3
-rw-r--r--  1 ucsd-train01 biom262-group 271M Dec  8 11:24 Anaconda3-2.4.1-Linux-x86_64.sh
-rw-r--r--  1 ucsd-train01 biom262-group   49 Jan 20 20:35 asdf
drwxr-xr-x  2 ucsd-train01 biom262-group    3 Jan 10 22:14 bin
drwxr-xr-x  3 ucsd-train01 biom262-group    3 Jan  7 07:39 code
-rw-r--r--  1 ucsd-train01 biom262-group  155 Jan 20 20:28 ls1.txt
-rw-r--r--  1 ucsd-train01 biom262-group   64 Jan 20 20:31 ls_nonexistent_folder.txt
-rw-r--r--  1 ucsd-train01 biom262-group  121 Jan 20 20:12 ls.txt
drwxr-xr-x  2 ucsd-train01 biom262-group    2 Jan  7 06:05 notebooks
lrwxrwxrwx  1 ucsd-train01 biom262-group   32 Jan 24 23:09 scratch -> /oasis/tscc/scratch/ucsd-train01
-rw-------  1 ucsd-train01 biom262-group   48 Jan 22 08:15 stderrtest.e4194600
-rw-------  1 ucsd-train01 biom262-group   70 Jan 22 08:15 stderrtest.o4194600
-rw-r--r--  1 ucsd-train01 biom262-group  112 Jan 20 22:07 stderrtest.sh
-rw-r--r--  1 ucsd-train01 biom262-group   25 Jan  7 19:32 test_script.sh
-rw-------  1 ucsd-train01 biom262-group    0 Jan  7 19:32 test_script.sh.e3962194
-rw-------  1 ucsd-train01 biom262-group   42 Jan  7 19:32 test_script.sh.o3962194
```

## 4. Update your local `biom262-2016` repository

Go to your `biom262-2016` folder on TSCC.

```
$ cd ~/code/biom262-2016
```

Check what branch you are on:

```
[ucsd-train01@tscc-login2 biom262-2016]$ git branch -v
  master cc0da4c [ahead 28] add nfkb tf file
* week01 7c25c24 merged updates from upstream/master
```

The asterisk "`*`" indicates your current branch. Change to `master` with:

```
$ git checkout master
Switched to branch 'master'
Your branch is ahead of 'origin/master' by 28 commits.
```
If it says you're head by some number of commits, it's totally fine.

Make sure the `biom262/biom262-2016` repository is set as "upstream":

```
[ucsd-train01@tscc-login2 biom262-2016]$ git remote -v
origin  https://github.com/olgabot/biom262-2016.git (fetch)
origin  https://github.com/olgabot/biom262-2016.git (push)
upstream        https://github.com/biom262/biom262-2016.git (fetch)
upstream        https://github.com/biom262/biom262-2016.git (push)
```

Now get the latest update with:

```
$ git pull upstream master
```

You may get prompted to write a `.MERGE_MSG`. What it says is fine - exit and save.

If you get merge conflicts, use "`git checkout upstream/master -- filename`" (replacing "`filename`" with the actual name of the file but keeping the two dashes `--`) to accept the upstream version.

## 5. Start a Jupyter notebook in the background and navigate to this folder

### Start a Jupyter notebook

For your port, to avoid conflicts, try to not use common or easy port numbers like 4000 or 1234, but more of a mix.

#### On TSCC

    jupyter notebook --no-browser --port #### &

#### On your laptop (Mac/Linux)

    ssh -NL ####:localhost:#### ucsd-train##@tscc-login#.sdsc.edu

### In the Jupyter notebook, navigate to ***this*** notebook

It's probably in `~/code/biom262-2016/weeks/week04`


## 6. FASTQ files

The current standard of sequencing files is [`fastq`](https://en.wikipedia.org/wiki/FASTQ_format), which is derived from the `fasta` format from your homework. Check out an example `fastq` format below which shows 3 entries:

```
$ zcat 250662/SRR578548_1.fastq.gz | head -n 12
@SRR578548.1 C0G7VACXX120322:8:2306:19628:74804 length=101
TAAATTTTAAGTAAATGTTTAAGGGATTTTACACCGGTCTATGGAGGTTTGCATGTGTAATTTTACCTCTAATTAATTATAAGGCCAGGACCAAACCTTTC
+SRR578548.1 C0G7VACXX120322:8:2306:19628:74804 length=101
@88DD?DDFAA<C4:C>C@IE<C@F3+CF9CHHB9CDF@F<DF<*0)?(0?<4BC=FCEFIIIC=A)=CEE>?CA7?EE@?;?B@BBC;;5=9=?5?BA>A
@SRR578548.2 C0G7VACXX120322:8:2203:18331:38524 length=101
CCTTAAATTTTAAGTAAATGTTTAAAGGATTTTACACCGGTCTATGGAGGTTTGCATGTGTAATTTTACCTCTAATTAATTATAAGGCCAGGACCAAACCT
+SRR578548.2 C0G7VACXX120322:8:2203:18331:38524 length=101
CCCFFFFFHHHHHJHIJJJJIJJIJJJJHJJJJJJJJJJJFHGIJJJJJIGIJJJJJJIIHIIIIJJJJJJJJHHHHHHHFFFFFFFEDDDDDDDDDDDDD
@SRR578548.3 C0G7VACXX120322:8:2206:17405:142266 length=101
CCTTAAATTTTAAGTAAATGTTTAAGGGATTTTACACCGGTCTATGGAGGTTTGCATGTGTAATTTTACCTCTAATTAATTATAAGGCCAGGACCAAACCT
+SRR578548.3 C0G7VACXX120322:8:2206:17405:142266 length=101
CCCDFFFFGHHFFIHHIGJGJIIJIJJJGIHIJJFJJJIGHHIIGGGHJJ?BHHIJJJHH=FGIIIJJIJJGHHHHHHGFFFFFDDEEDD?BDDBDDDDDC
```

Notice that each read takes up 4 lines, where

```
@unique_identifier length=101
[genomic sequence]
+unique_identifier length=101
[ASCII-encoded PHRED quality score]
```

### Exercise 1

Read about the FASTQ format and PHRED quality scores in the link above. 

**Q: What probability of incorrect mapping does "A" correspond to for Illumina 1.8+?**

Hint 0: Remember logging numbers and using exponentiation?

Hint 1: you can use two asterisks "`**`" to exponentiate numbers in Python:

In [None]:
5**2

Hint 2: You can get the "ordinal number" of a character using the "`ord`" function in Python. Notice how the character "I" has the number 73 - that's the literal number that the character "I" is stored under in the computer. Notice how it matches with "41" in the Illumina 1.8+ scoring.

In [None]:
ord('I')

In [None]:
ord('!')

Use Python to write code to find the probability of incorrect mapping corresponding to the letter "A"

In [3]:
10**(-(ord("A") - ord("!"))/10)

0.000630957344480193

In [4]:
# checks that the last created output is equal to the following number
assert _ == 0.000630957344480193

## 7. STAR aligner program

Double-check that you can use the STAR aligner by typing `STAR -h`. This has a HUGE output so we'll just look at the top with `head`.

```
[ucsd-train01@tscc-0-5 ~]$ STAR -h | head
Usage: STAR  [options]... --genomeDir REFERENCE   --readFilesIn R1.fq R2.fq
Spliced Transcripts Alignment to a Reference (c) Alexander Dobin, 2009-2015

### versions
versionSTAR             020201
    int>0: STAR release numeric ID. Please do not change this value!
versionGenome           020101 020200
    int>0: oldest value of the Genome version compatible with this STAR release. Please do not change this value!

### Parameter Files
```

### Exercise 2: Read through the STAR parameters

Use `less -S` to scan through the STAR parameters. Remember the keyboard shortcuts:

* [Space] - advances to next page
* b - previous page
* q - exit
* h - help
* Arrow keys - advance up/down one row or left/right one column

**Q: What flag would you use to output the unmapped reads as a separate FASTQ file?** (1 sentence)

--outReadsUnmapped Fastx

## 8. GENCODE Gene annotations

### Exercise 3: Understanding annotations

We'll be using the [GENCODE M8](http://www.gencodegenes.org/mouse_releases/8.html) mouse annotation for this exercise. Read about the [GENCODE GTF Format](http://www.gencodegenes.org/data_format.html). Specifically, we'll be using the "basic" gene annotation and not the "comprehensive" one. Read about [tags](http://www.gencodegenes.org/gencode_tags.html).

**Q: When you would want to use the "comprehensive" annotation, and when you would want to use the "basic" annotation?** (2-3 sentences)

YOUR ANSWER HERE

## 9. Set up a project

For this project, which we will call `shalek2013`, create the following directories for the projects:

```
mkdir $HOME/projects
mkdir $HOME/projects/shalek2013
mkdir $HOME/projects/shalek2013/processing_scripts
```

Make directories in scratch:

```
mkdir $HOME/scratch/shalek2013
mkdir $HOME/scratch/shalek2013/processed_data
```

### Exercise 4: Create soft links

Now use `ln -s filename newplace` to create **soft links** (aka "shortcuts" or "pointers") of the folders in your `~/scratch/shalek2013/` directory to your `~/projects/shalek2013` directory so when you `ls -lha` in your `~/projects/shalek2013` directory, it looks like this:

```
ln -s /projects/ps-yeolab/biom262-2016/seqdata/shalek2013 $HOME/projects/shalek2013/raw_data
ln -s $HOME/scratch/shalek2013/processed_data $HOME/projects/shalek2013/processed data
```

Fix bottom one to:
```
ln -s $HOME/scratch/shalek2013/processed_data $HOME/projects/shalek2013/processed_data
```

If you're seeing black and red links you've done something wrong... remove the `~/projects/shalek2013` directory and start over.

```
rm -f ~/projects/shalek2013
```

Then make sure your:

```
[ucsd-train01@tscc-0-62 ~]$ cd ~/projects/shalek2013
[ucsd-train01@tscc-0-62 shalek2013]$ ls -lha
total 10K
drwxr-xr-x 3 ucsd-train01 biom262-group  5 Jan 25 16:05 .
drwxr-xr-x 3 ucsd-train01 biom262-group  3 Jan 25 16:05 ..
lrwxrwxrwx 1 ucsd-train01 biom262-group 53 Jan 25 16:05 processed_data -> /home/ucsd-train01/scratch/shalek2013/processed_data/
drwxr-xr-x 2 ucsd-train01 biom262-group  2 Jan 25 16:05 processing_scripts
lrwxrwxrwx 1 ucsd-train01 biom262-group 47 Jan 25 16:05 raw_data -> /projects/ps-yeolab/biom262-2016/seqdata/shalek2013
```



## 10. Align reads with `STAR`


Importantly, the help output of "`STAR -h`" points you to the manual at the end:

```
[ucsd-train01@tscc-0-5 ~]$ STAR -h | tail
    string: 2-pass mapping mode.
                            None        ... 1-pass mapping
                            Basic       ... basic 2-pass mapping, with all 1st pass junctions inserted into the genome indices on the fly

twopass1readsN              -1
    int: number of reads to process for the 1st step. Use very large number (or default -1) to map all reads in the first step.

For more details see:
<https://github.com/alexdobin/STAR>
<https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>
```

We've prepared a subset of the data from Shalek 2013 which contains only the data from chromosome 11 (both *Irgm1* and *Irf7* from the paper are on chromsome 11). We'll use the default parameters for mapping except for:

* `--runThreadN` parameter which we'll set to 4, as we requested 4 processors.
* `--genomeDir /projects/ps-yeolab/biom262-2016/genomes/mm10/gencode/m8/star/` - How to specify the genome we are using 
* `--outSAMType BAM SortedByCoordinate` - We'll output a BAM binary file instead of a SAM file (so we don't have to do 3 extra steps later - makes our life easier)
* `--readFilesIN $HOME/scratch/shalek2013/raw_data/S10_R1.fastq.gz $HOME/scratch/shalek2013/raw_data/S10_R2.fastq.gz` - the sequencing read files we'll use, referencing our soft links of the files.
* `--outFileNamePrefix $HOME/projects/shalek2013/processed_data/S10.` - Specify output directory and beginning of the filename (the dot is important - this way the output will be `$HOME/projects/shalek2013/processed_data/S10.Aligned.out.bam` rather than `$HOME/projects/shalek2013/processed_data/S10Aligned.out.sam` and I like the separation you get with the ".")
* `--readFilesCommand zcat` - because our FASTQ files are compressed using `gzip` and the `zcat` command decompresses them to standard out (could also say `--readFilesCommand gunzip -c`)

### Excercise 5: Submit a script to run STAR on "S10"

[Write a PBS script](http://www.sdsc.edu/support/user_guides/tscc-quick-start.html), put it in "`~/projects/shalek2013/processing_scripts`" and call it "`s10_align.sh`" to run the STAR aligner. Request one node and 16 processors with "`-l nodes=1:ppn=16`" and request the time to be 30 minutes. Below is the alignment command to put in your script:

```
STAR --runThreadN 16 \
    --genomeDir \
    /projects/ps-yeolab/biom262-2016/genomes/mm10/gencode/m8/star/ \
    --readFilesIn \
    $HOME/projects/shalek2013/raw_data/S10_R1.fastq.gz \
    $HOME/projects/shalek2013/raw_data/S10_R2.fastq.gz \
    --outFileNamePrefix $HOME/projects/shalek2013/processed_data/S10. \
    --readFilesCommand zcat
```


Remember you can use `qdel` to kill jobs you want to stop:

```
[ucsd-train01@tscc-login2 processing_scripts]$ qdel 4251332
```


## 11. Directly quasi-map and count features with `kallisto`

Now we'll use `kallisto` to directly quantify gene expression. We've already created an `index` for you, so use that. The parameters we'll be using are:

* `--index /home/ucsd-train01/genomes/mm10/gencode/m8/gencode.vM8.pc_transcripts.kallisto` - kallisto index created from **protein-coding** transcripts from GENCODE
* `--output-dir ~/projects/shalek2013/processed_data/S10_kallisto` - The specific folder where we'll be outputting the kallisto results

### Exercise 6: read `kallisto` help

`kallisto`, like `git`, is the name of the main program and you have to specify the sub-program to run. When you say `kallisto -h`, it tells you about these sub-programs:

```
[ucsd-train01@tscc-login2 ~]$ kallisto -h
Error: invalid command -h
kallisto 0.42.4

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index 
    quant         Runs the quantification algorithm 
    h5dump        Converts HDF5-formatted results to plaintext
    version       Prints version information

Running kallisto <CMD> without arguments prints usage information for <CMD>

```


Today, we'll be using `kallisto quant` on paired end reads, but for this question: **Q: How would you use `kallisto quant` for single-end reads?** You'll need to check the [manual](https://pachterlab.github.io/kallisto/manual.html) as well as the help documentation

--single --fragment-length=200.1

### Exercise 7: Submit a kallisto script

We're running `kallisto quant` on paired-end data. **Write a submitter script in your `~/projects/shalek2013/processing_scripts` directory called `s10_kallisto.sh` with the following command**. Request for 8 processors (threads) with `-l nodes=1:ppn=8` and 30 minutes.

```
kallisto quant \
    --index \
    /projects/ps-yeolab/biom262-2016/genomes/mm10/gencode/m8/gencode.vM8.pc_transcripts.kallisto \
    --threads 8 --output-dir \
    ~/projects/shalek2013/processed_data/S10_kallisto \
    [S10 R1 fastq file] [S10 R2 fastq file]
```




## 12. Look at the output of `STAR`

At this point, the `STAR` job should be done and we can take a look at the output. Your script should have created something like this in your `~/projects/shalek2013/processed_data` directory:

```
[ucsd-train01@tscc-login2 ~]$ cd ~/projects/shalek2013/processed_data
[ucsd-train01@tscc-login2 processed_data]$ ll
total 32
-rw-r--r-- 1 ucsd-train01 biom262-group 31881871283 Jan 26 07:15 S10.Aligned.out.sam
-rw-r--r-- 1 ucsd-train01 biom262-group        1870 Jan 26 07:16 S10.Log.final.out
-rw-r--r-- 1 ucsd-train01 biom262-group       19784 Jan 26 07:16 S10.Log.out
-rw-r--r-- 1 ucsd-train01 biom262-group        1426 Jan 26 07:16 S10.Log.progress.out
-rw-r--r-- 1 ucsd-train01 biom262-group     1008726 Jan 26 07:16 S10.SJ.out.tab
```

### Exercise 8: Look at the outputs of STAR

Read about the [SAM](https://samtools.github.io/hts-specs/SAMv1.pdf) format and check out [this](https://broadinstitute.github.io/picard/explain-flags.html) helpful website for explaining SAM flags.

1. Which file would you use to get the percentage of mapped reads from your data?
2. Which file would you send to someone when the alignment didn't go as expected and thye asked for the parameters you used?

To view the SAM file, you'll want to use `samtools`. Viewing directly with `head` gives you the "header" of the SAM file which has a bunch of parameters about the genome and the program that was used to align:

```
[ucsd-train01@tscc-login2 processed_data]$ head S10.Aligned.out.sam
@HD     VN:1.4
@SQ     SN:chr1 LN:195471971
@SQ     SN:chr2 LN:182113224
@SQ     SN:chr3 LN:160039680
@SQ     SN:chr4 LN:156508116
@SQ     SN:chr5 LN:151834684
@SQ     SN:chr6 LN:149736546
@SQ     SN:chr7 LN:145441459
@SQ     SN:chr8 LN:129401213
@SQ     SN:chr9 LN:124595110
```

Instead, use `samtools view` and pipe that output to `head`:

```
[ucsd-train01@tscc-login2 processed_data]$ samtools view S10.Aligned.out.sam | head
SRR578577.310219        163     chr1    24613435        3       101M    =       24613625        291     TGAAGCTTGGAGGATGGTGAAGTAAAGTCCTAGTATAATGGTAATTAGTAGGGCTTGATTTATGTGGTTTCGTTTACCTTCTATAAGGCTATGATGAGCTC     @@@DFFFFHHDHHGGHI:FHGIGI<HHIHIIIJHGIGGIIJ?DEHHGIGABDGHIIHDHHIGJJHIICFH@DGIHCHHHFDFFFFFEEDEEEDDEDEDDDD   NH:i:2  HI:i:1  AS:i:200        nM:i:0
SRR578577.310219        83      chr1    24613625        3       101M    =       24613435        -291    CAGCAGCCTCCTAGATCATGTGTTGGTACGAGGCTAGAATGATAGAACGCTCAGAAGAATCCTGCAAAGAAAAATACTTCCGAGACGATGAATAGAATTAT     DDDDBDDDCCDDEDDDCEDDDFFCB?;HFGIIHIHGCIGHFJIIGGFGHHFBIGJJIIGDIIHFFIJIGJJIIHFBIIFJJJJJJJIJHHFHHFFFFF@C@   NH:i:2  HI:i:1  AS:i:200        nM:i:0
SRR578577.310219        355     chrM    8853    3       101M    =       9043    291     ATAATTCTATTCATCGTCTCGGAAGTATTTTTCTTTGCAGGATTCTTCTGAGCGTTCTATCATTCTAGCCTCGTACCAACACATGATCTAGGAGGCTGCTG     @C@FFFFFHHFHHJIJJJJJJJFIIBFHIIJJGIJIFFHIIDGIIJJGIBFHHGFGGIIJFHGICGHIHIIGFH;?BCFFDDDECDDDEDDCCDDDBDDDD   NH:i:2  HI:i:2  AS:i:200        nM:i:0
SRR578577.310219        403     chrM    9043    3       101M    =       8853    -291    GAGCTCATCATAGCCTTATAGAAGGTAAACGAAACCACATAAATCAAGCCCTACTAATTACCATTATACTAGGACTTTACTTCACCATCCTCCAAGCTTCA     DDDDEDEDDEEEDEEFFFFFDFHHHCHIGD@HFCIIHJJGIHHDHIIHGDBAGIGHHED?JIIGGIGHJIIIHIHH<IGIGHF:IHGGHHDHHFFFFD@@@   NH:i:2  HI:i:2  AS:i:200        nM:i:0
SRR578577.310220        99      chr1    24613513        3       101M    =       24613625        213     TTCTATAAGGCTATGATGAGCTCATGTAATTGAAACACCTGATGCTAGAAGTACTGAAGTATTAAGTAGTGGGACTTCTAGAGGGTTAAGTGGTGAAATTC     CCCFFFFFHHHHHJJJJJJJJIJJIIJIIJJJJJJJIJJJJJJJJJJJJJJGHIJJJGHGGHIJJJGIIIIJJIJEHIJIIHHHHDDFFFDEECCEDDCDD   NH:i:2  HI:i:1  AS:i:198        nM:i:1
SRR578577.310220        147     chr1    24613625        3       101M    =       24613513        -213    CAGCAGCCTCCTAGATCATGTGTTGGTACGAGGCTAGAATGATAGAACGTTCAGAAGAATCCTGCAAAGAAAAATACTTCCGAGACGATGAATAGAATTAT     DCDDDDDCDDDDEEEECEDFFFFHHHFGJJIHGGIIIIJJJJJJIHHIIHJIJJJIJJIIJIJIJJIIIJJIIEC<IHBJJJIGHJJJHHHHHFFFFFCCC   NH:i:2  HI:i:1  AS:i:198        nM:i:1
SRR578577.310220        419     chrM    8853    3       101M    =       8965    213     ATAATTCTATTCATCGTCTCGGAAGTATTTTTCTTTGCAGGATTCTTCTGAACGTTCTATCATTCTAGCCTCGTACCAACACATGATCTAGGAGGCTGCTG     CCCFFFFFHHHHHJJJHGIJJJBHI<CEIIJJIIIJJIJIJIIJJIJJJIJHIIHHIJJJJJJIIIIGGHIJJGFHHHFFFFDECEEEEDDDDCDDDDDCD   NH:i:2  HI:i:2  AS:i:198        nM:i:1
SRR578577.310220        339     chrM    8965    3       101M    =       8853    -213    GAATTTCACCACTTAACCCTCTAGAAGTCCCACTACTTAATACTTCAGTACTTCTAGCATCAGGTGTTTCAATTACATGAGCTCATCATAGCCTTATAGAA     DDCDDECCEEDFFFDDHHHHIIJIHEJIJJIIIIGJJJIHGGHGJJJIHGJJJJJJJJJJJJJJIJJJJJJJIIJIIJJIJJJJJJJJHHHHHFFFFFCCC   NH:i:2  HI:i:2  AS:i:198        nM:i:1
SRR578577.310221        99      chr1    24613349        3       101M    =       24613625        377     ACATGGAGTCCATGGAATCCAGTATCCATGAAGAATGTAGAACCATAGATACCATCTGAAATGGAGAATGACGTTTCAAAGTATTCTGAAGCTTGGAGGAT     @@@D7DDDHHADHB9?FBF<C?C:FHAHEEHGGGGIGEHFEHFGGBFGGFHIHIIEGIAH:B==DGIGAD=@D@;==A=AE7?B>A;@@BACCCCB3;55;   NH:i:2  HI:i:1  AS:i:194        nM:i:3
SRR578577.310221        147     chr1    24613625        3       101M    =       24613349        -377    CAGCAGCCGCCTAGATCATGTGTTGGTACGAGGCTAGAATGATAGAACGCTCAGAAGAATCCTGCAAAGAAAAATACTTCCGAGACGATGAATAGAATTAT     ##########CA@55;;@9A=;73;.(E6GGHEDG@@F:FAHCIGB@@<?9*?0?>DDDDAC9B9IGG@IGCA+2<@F:GBC;A23F@?DFD?BABB17?@   NH:i:2  HI:i:1  AS:i:194        nM:i:3
```

1. S10.Log.final.out
2. S10.Log.out

## 13. Read help for sort, index, make BAM using `samtools`

For most programs downstream of alignment, they prefer to use the input format of `bam` over `sam` because it's more compressed, and because a `bam` file can be indexed so you can quickly jump a particular region instead of having to read the entire file to get to the few lines that you need.

To  do this, we'll use `samtools` that we installed (Note that we also could have used `module load biotools` but the goal of this was to show you how these tools could be installed as well as used).

### Exercise 9: Read help functions of `samtools view` and `sort`

Read through the documentation of `samtools view` and `samtools sort`. Despite the name, most `samtools` functions operate with BAM files - the name is a historical artefact of the fact that SAM was invented virst

1. Which flags do you use with `samtools view` that the input is SAM, not BAM?
2. Which flags do you use with `samtools view` to output BAM, not SAM?
3. Which flags do you use with `samtools sort` to indicate how many threads/processors to use?

1. -S
2. -b
3. -@ 4

### Exercise 10:  Write a script to sort and index bam file

Write a TSCC submitter script that views, sorts, and indexes the SAM file so that at the end you have a sorted, indexed bam file. **Request 1 hour of time.**

The backslash "`\`" indicates that the command will continue on the next line. It doesn't matter to the computer, but it's for human readability.

**Use at most 8 threads/processors for `samtools sort`** so your job gets scheduled quickly.

#### Note: `samtools sort` must output to a file otherwise your `.out` file will have the bam in it

```
samtools view [flags from exercise 9.1, 9.2] $HOME/projects/shalek2013/processed_data/S10.Aligned.out.sam \
    > $HOME/projects/shalek2013/processed_data/S10.Aligned.out.bam
samtools sort [flags from exercise 9.3] $HOME/projects/shalek2013/processed_data/S10.Aligned.out.bam \
    > $HOME/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam
samtools index $HOME/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam
```

## 14. Look at the output of `kallisto`
The `kallisto` quantification should be done now.

Your script created a stderr file and a stdout file, which should look like this:

```
[ucsd-train01@tscc-login2 ~]$ cd ~/projects/shalek2013/processing_scripts
[ucsd-train01@tscc-login2 processing_scripts]$ cat S10_kallisto.sh.out 
Nodes:        tscc-0-25
[ucsd-train01@tscc-login2 processing_scripts]$ cat S10_kallisto.sh.err 

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 56,504
[index] number of k-mers: 67,545,152
[index] number of equivalence classes: 163,664
[quant] running in paired-end mode
[quant] will process pair 1: /home/ucsd-train01/scratch/shalek2013/raw_data/S10_R1.fastq.gz
                             /home/ucsd-train01/scratch/shalek2013/raw_data/S10_R2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 58,872,578 reads, 33,014,674 reads pseudoaligned
[quant] estimated average fragment length: 242.629
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1,136 rounds
```

Now let's look at the actual output folder. There's three files here.

```
[ucsd-train01@tscc-login2 ~]$ cd ~/projects/shalek2013/processed_data
[ucsd-train01@tscc-login2 processed_data]$ ll S10_kallisto/
total 10792
-rw-r--r-- 1 ucsd-train01 biom262-group 2331756 Jan 26 03:16 abundance.h5
-rw-r--r-- 1 ucsd-train01 biom262-group 8710545 Jan 26 03:16 abundance.tsv
-rw-r--r-- 1 ucsd-train01 biom262-group     500 Jan 26 03:16 run_info.json
```

* `abundance.h5` - binary file stored in [hdf5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) format which is a compressed way to store data (notice it's 1/4 the size of `abundance.tsv`)
* `abundance.txt` - basic text file of the quantifications
* `run_info.json` - [javascript object notation](https://en.wikipedia.org/wiki/JSON) (JSON) format file describing the parameters of the `kallisto` run


### Exercise 11: Reading `kallisto` output

Use `head` and `tail` to look at the tops and bottoms of each of the output files. Which files would you use for reading the quantification in to R or other programs? What would you use the `json` file vs the `tsv` vs the `h5` file? (Hint: The `h5` file is for use with their own analysis software, [`sleuth`](http://pachterlab.github.io/sleuth/))

YOUR ANSWER HERE


## 15. Quantify gene expression with `featureCounts`

### Exercise 12: Read `featureCounts` help

Read the output of `featureCounts -h`. **Q: Which flags do you need to use to do (1) strand-specific counting, (2) only when both read pairs were successfully mapped, and (3) using only the primary alignment of a multi-mapped read?**

1. -s
2. -B
3. --primary

### Exercise 13: Run `featureCounts`

To speed things up, we've created a subset of the Gencode annotation that's only chr11. Make sure to use that one. Now run the following command, replacing the flags and filename from the exercises as indicated. We'll use the annotation stored in `/projects/ps-yeolab/biom262-2016/genomes/mm10/gencode/m8/gencode.vM8.basic.annotation.gtf`.

Write a submitter script called `s10_featurecounts.sh` and put it in `~/projects/shalek2013/processing_scripts`. Besides the flags you found, we'll also use:

* `-T 8` - specifies to use 8 threads. (in the [Subread](http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf) documentation)
* `-a /projects/ps-yeolab/biom262-2016/genomes/mm10/gencode/m8/gencode.vM8.basic.annotation.gtf`-  Genome annotation from GENCODE.
* `-o ~/projects/shalek2013/processed_data/s10_featureCounts.txt` - Output file

Here's the actual running code below. Specify to ask for 10 minutes and 8 processors.

```
featureCounts -T 8 \
    -s -B --primary \
    -a /projects/ps-yeolab/biom262-2016/genomes/mm10/gencode/m8/gencode.vM8.basic.annotation.gtf \
    -o ~/projects/shalek2013/processed_data/s10_featureCounts.txt \
    [sorted, indexed bam file from exercise 10]
```

#### Psssst... compare your `bam` and `sam` file sizes:

```
[ucsd-train01@tscc-login2 processing_scripts]$ cd ~/projects/shalek2013/processed_data
[ucsd-train01@tscc-login2 processed_data]$ ls -lha S10*
-rw-r--r-- 1 ucsd-train01 biom262-group 6.9G Feb  2 11:37 S10.Aligned.out.bam
-rw-r--r-- 1 ucsd-train01 biom262-group  30G Jan 28 10:20 S10.Aligned.out.sam
-rw-r--r-- 1 ucsd-train01 biom262-group 4.0G Feb  2 12:02 S10.Aligned.out.sorted.bam
-rw-r--r-- 1 ucsd-train01 biom262-group 2.6M Feb  2 12:04 S10.Aligned.out.sorted.bam.bai
-rw-r--r-- 1 ucsd-train01 biom262-group 1.9K Jan 28 10:20 S10.Log.final.out
-rw-r--r-- 1 ucsd-train01 biom262-group  20K Jan 28 10:20 S10.Log.out
-rw-r--r-- 1 ucsd-train01 biom262-group 1.3K Jan 28 10:20 S10.Log.progress.out
-rw-r--r-- 1 ucsd-train01 biom262-group 986K Jan 28 10:20 S10.SJ.out.tab

S10_kallisto:
total 11M
drwxr-xr-x 2 ucsd-train01 biom262-group 4.0K Jan 28 10:43 .
drwxr-xr-x 4 ucsd-train01 biom262-group 4.0K Feb  2 12:07 ..
-rw-r--r-- 1 ucsd-train01 biom262-group 2.3M Jan 28 10:43 abundance.h5
-rw-r--r-- 1 ucsd-train01 biom262-group 8.4M Jan 28 10:43 abundance.tsv
-rw-r--r-- 1 ucsd-train01 biom262-group  514 Jan 28 10:43 run_info.json
```

30G (sam) vs 6.9G (bam) vs 4.0G (sorted.bam) is a huge space savings!! The sorted one is even smaller because there's similar sequences next to each other and it compresses better.

## 16. Read `featureCounts` output

There should be two files created from the run of `featureCounts`:

```
[ucsd-train01@tscc-login2 ~]$ ls -lha ~/projects/shalek2013/processed_data/s10_featureCounts.txt*
-rw-r--r-- 1 ucsd-train01 biom262-group 7.7M Feb  2 12:07 /home/ucsd-train01/projects/shalek2013/processed_data/s10_featureCounts.txt
-rw-r--r-- 1 ucsd-train01 biom262-group  369 Feb  2 12:07 /home/ucsd-train01/projects/shalek2013/processed_data/s10_featureCounts.txt.summary
```

### Exercise 14: Compare `featureCounts` and `kallisto` output

**Q: How does `s10_featureCounts.txt` differ from `s10_kallisto/abundance.tsv` in the information it shows?**

YOUR ANSWER HERE

## 17. Additional reading

* [SAM Format specification](https://samtools.github.io/hts-specs/SAMv1.pdf)