# Repeating Kallisto and DESeq2

Based on my previous [DESeq2 analyses](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/notebooks/2016-11-11-oly-gonad-OA-part4-differential-expression.ipynb), I did not find any significant differences in gene expression for pairwise comparisons. This could be because my standard deviation was too high when running [`kallisto quant`](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/notebooks/2016-11-04-oly-gonad-OA-part3-kallisto.ipynb). I will now rerun `kallisto quant` with a lower standard deviation and see if that affects my DeSeq2 results.

## Rerun `kallisto quant`

### 1. Create a database

Similar to running a blastx, I first need to create a database (kallisto index). I will use the same database I creased in my previous attempt with `kallisto`. The code I used is as follows:

1. define the program, `kallisto index`
2. `-i` indicate the name for the new index
3. fasta file to be used to create an index

In [None]:
!/Applications/kallisto/kallisto index \
-i /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/kallisto-index-OlyO-v6 \
/Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/OlyO_v6_transcriptome.fa \

I can now use my newly created index to run 4 separate commands to quantify my reads in the larger transcriptome. The code is as follows:

1. Define the program, `kallisto quant`
2. `-i` indicates when index to use
3. `-o` tells the program where to write the output
4. `--single` allows me to process single-end reads
5. `-l` estimated average fragment length from [FastQC output](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/notebooks/2016-10-19-oly-gonad-OA-part-1-FASTQC-results.ipynb)
6. `-s` estimated standard deviation of fragment length. I changed this from 20% to 10%, or 0.10.
7. fastq file to be used

### 2. **filtered_106A_Female_Mix_GATCAG_L004_R1.fastq**

In [2]:
!/Applications/kallisto/kallisto quant \
-i /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/kallisto-index-OlyO-v6 \
-o /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/analyses/11-29-kallisto-female-106 \
--single \
-l 76 \
-s .10 \
/Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/filtered_106A_Female_Mix_GATCAG_L004_R1.fastq


[quant] fragment length distribution is truncated gaussian with mean = 76, sd = 0.1
[index] k-mer length: 31
[index] number of targets: 148,557
[index] number of k-mers: 74,111,966
[index] number of equivalence classes: 349,214
[quant] running in single-end mode
[quant] will process file 1: /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/filtered_106A_Female_Mix_GATCAG_L004_R1.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 39,823,239 reads, 34,794,164 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1,194 rounds



[`kallisto quant` analysis results](https://github.com/yaaminiv/yaaminiv-fish546-2016/tree/master/analyses/11-29-kallisto-female-106)

### 3. **filtered_106A_Male_Mix_TAGCTT_L004_R1.fastq**

In [3]:
!/Applications/kallisto/kallisto quant \
-i /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/kallisto-index-OlyO-v6 \
-o /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/analyses/11-29-kallisto-male-106 \
--single \
-l 76 \
-s .10 \
/Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/filtered_106A_Male_Mix_TAGCTT_L004_R1.fastq


[quant] fragment length distribution is truncated gaussian with mean = 76, sd = 0.1
[index] k-mer length: 31
[index] number of targets: 148,557
[index] number of k-mers: 74,111,966
[index] number of equivalence classes: 349,214
[quant] running in single-end mode
[quant] will process file 1: /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/filtered_106A_Male_Mix_TAGCTT_L004_R1.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 59,446,949 reads, 54,059,394 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1,141 rounds



[`kallisto quant` analysis results](https://github.com/yaaminiv/yaaminiv-fish546-2016/tree/master/analyses/11-29-kallisto-male-106)

### 4. **filtered_108A_Female_Mix_GGCTAC_L004_R1.fastq**

In [4]:
!/Applications/kallisto/kallisto quant \
-i /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/kallisto-index-OlyO-v6 \
-o /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/analyses/11-29-kallisto-female-108 \
--single \
-l 76 \
-s .10 \
/Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/filtered_108A_Female_Mix_GGCTAC_L004_R1.fastq


[quant] fragment length distribution is truncated gaussian with mean = 76, sd = 0.1
[index] k-mer length: 31
[index] number of targets: 148,557
[index] number of k-mers: 74,111,966
[index] number of equivalence classes: 349,214
[quant] running in single-end mode
[quant] will process file 1: /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/filtered_108A_Female_Mix_GGCTAC_L004_R1.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 45,936,627 reads, 41,710,716 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1,171 rounds



[`kallisto quant` analysis results](https://github.com/yaaminiv/yaaminiv-fish546-2016/tree/master/analyses/11-29-kallisto-female-108)

### 5. **filtered_108A_Male_Mix_AGTCAA_L004_R1.fastq**

In [5]:
!/Applications/kallisto/kallisto quant \
-i /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/kallisto-index-OlyO-v6 \
-o /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/analyses/11-29-kallisto-male-108 \
--single \
-l 76 \
-s .10 \
/Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/filtered_108A_Male_Mix_AGTCAA_L004_R1.fastq


[quant] fragment length distribution is truncated gaussian with mean = 76, sd = 0.1
[index] k-mer length: 31
[index] number of targets: 148,557
[index] number of k-mers: 74,111,966
[index] number of equivalence classes: 349,214
[quant] running in single-end mode
[quant] will process file 1: /Users/yaaminivenkataraman/Documents/School/Year1/FISH-546/yaaminiv-fish546-2016/data/filtered_108A_Male_Mix_AGTCAA_L004_R1.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 55,791,565 reads, 50,931,304 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1,285 rounds



[`kallisto quant` analysis results](https://github.com/yaaminiv/yaaminiv-fish546-2016/tree/master/analyses/11-29-kallisto-male-108)

## 2. Rerun DESeq2

Now that I have my revised count data, I can use [DESeq2](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/tutorials/DESeq2-tutorial/2016-10-26-DESeq2-Tutorial-Part-2.ipynb) to analyze differential expression. The process I used was adapted from [this notebook](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/notebooks/2016-11-11-oly-gonad-OA-part4-differential-expression.ipynb).

### 1. Reformat `kallisto quant` output files

#### Convert to a `.txt` file

See [previous notebook](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/notebooks/2016-11-11-oly-gonad-OA-part4-differential-expression.ipynb). 

#### Merge count data files

See [previous notebook](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/notebooks/2016-11-11-oly-gonad-OA-part4-differential-expression.ipynb).

### 2. Use DESeq2 in R

I am now ready to complete my analyses in R.

The first thing I did was use DESeq2 to compare differentially expressed genes in all treatments, with the two p-values I used in my previous DESeq2 analysis. I still found a sparse number of genes indicated as differentially expressed between all treatments when my p-value was 0.05.

p-value = 0.05

[Count data](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/data/2016-11-29-count-data/2016-11-29-oly-gonad-oa-count-data.txt)

[R Script](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/analyses/11-29-oly-oa-gonad-DESeq2/2016-11-29-alltreatments-DESeq2.R)

[List of differentially expressed genes](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/analyses/11-29-oly-oa-gonad-DESeq2/alltreatments_DEG.tab)

![all treatments](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/analyses/11-29-oly-oa-gonad-DESeq2/alltreatments.png)

p-value = 0.5

[R Script (p = 0.5)](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/analyses/11-29-oly-oa-gonad-DESeq2/2016-11-29-alltreatments-DESeq2.R)

[List of differentially expressed genes (p = 0.5)](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/analyses/11-29-oly-oa-gonad-DESeq2/alltreatments-p0.5-_DEG.tab)

![all treatments (p = 0.5)](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/analyses/11-29-oly-oa-gonad-DESeq2/alltreatments-p0.5-.png)

For all of my pairwise comparisons, I once again found no significant differences in differentially expressed genes.

#### Control vs. Ocean Acidification conditions

**Female_106 vs. Female_108**

[Count data](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/data/2016-11-29-count-data/2016-11-29-oly-gonad-oa-count-data-female106-female108.txt)

[R Script](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/analyses/11-29-oly-oa-gonad-DESeq2/2016-11-29-female106-female108-DESeq2.R)

![female-106 vs. female-108](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/analyses/11-29-oly-oa-gonad-DESeq2/female106-female108.png)

**Male_106 vs. Male_108**

[Count data](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/data/2016-11-29-count-data/2016-11-29-oly-gonad-oa-count-data-male106-male108.txt)

[R Script](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/analyses/11-29-oly-oa-gonad-DESeq2/2016-11-29-male106-male108-DESeq2.R)

![male-106 vs. male-108](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/analyses/11-29-oly-oa-gonad-DESeq2/male106-male108.png)

#### Female vs. Male Gonads

**Female_106 vs. Male_106**

[Count data](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/data/2016-11-29-count-data/2016-11-29-oly-gonad-oa-count-data-female106-male106.txt)

[R Script](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/analyses/11-29-oly-oa-gonad-DESeq2/2016-11-29-female106-male106-DESeq2.R)

![female-106 vs. male-106](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/analyses/11-29-oly-oa-gonad-DESeq2/female106-male106.png)

**Female_108 vs. Male_108**

[Count data](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/data/2016-11-29-count-data/2016-11-29-oly-gonad-oa-count-data-female108-male108.txt)

[R Script](https://github.com/yaaminiv/yaaminiv-fish546-2016/blob/master/analyses/11-29-oly-oa-gonad-DESeq2/2016-11-29-female108-male108-DESeq2.R)

![female-108 vs. male-108](https://raw.githubusercontent.com/yaaminiv/yaaminiv-fish546-2016/master/analyses/11-29-oly-oa-gonad-DESeq2/female108-male108.png)

Once again, when I examined the adjusted p-values (p-adj) for each comparison, I found the following:

- Female_106 vs. Female_108: No p-adj less than 0.9996995
- Male_106 vs. Male_108: No p-adj less than 1
- Female_106 vs. Male_106: No p-adj less than 0.9996995
- Female_108 vs. Male_108: No p-adj less than 1

Now I'm really stumped! My next step will either by trying a pairwise comparison between Female_106 and Male_108 to see if that's really where the differences are, or rerunning `kallisto quant` with an extreme standard deviation value of 0.

Update 2016-11-30: After talking to my adviser, I learned that performing pairwise comparisons using DESeq2 is not scientifically valuable! I will proceed with my analyses using the `kallisto quant` files generated here with a standard deviation of 10%.

In [None]:
With a p-value of 0.05