# Interpreting the results

## Introduction


The main objective of this part of the tutorial is to use simple Unix commands to get a list of significantly differentially expressed genes. Using this gene list and the quantitative information from our analysis we can then start to make biological inferences about our dataset.

Using the R script (`sleuth.R`), we printed out a file of results describing the differentially expressed genes in our dataset. This file is called **`kallisto.results`**. 

The file contains several columns, of which the most important are:

  * **Column 1**: target_id (gene id)  
  * **Column 2**: description (some more useful description of the gene than its id)  
  * **Column 3**: pval (p value)    
  * **Column 4**: qval (p value corrected for multiple hypothesis testing)  
  * **Column 5**: b (fold change)  

With a little Linux magic we can get the list of differentially expressed genes with only the columns of interest as above.

***

## Exercise 6

**Make sure you are in the `data` directory with the tutorial files.**

In [None]:
cd data

To get the genes which are most highly expressed in our SBP samples, we must first filter our results. There are two columns we want to filter our data on: **b** (column 5)  and **qval** (column 4). These columns represent whether the gene is differentially expressed and whether that change is significant.

The following command will get those genes which have an adjusted p value (qval) less than 0.01 and a positive fold change. These genes are more highly expressed in the SBP samples.

In [None]:
awk -F "\t" '$4 < 0.01 && $5 > 0' kallisto.results | cut -f1,2,3,4,5 | head 

We used `awk` to filter the gene list and print only the lines which met our search criteria (qval > 0.01, b > 0). The option `-F` tells awk what delimiter is used to separate the columns. In this case, it was a tab or its regular expression "\t".  We then use cut to only print out columns 1-5. You can also do that within the `awk` command. Finally, we use `head` to get the first 10 lines of the output.

Alternatively, we can look for the genes which are more highly expressed in the MT samples.

In [None]:
awk -F "\t" '$4 < 0.01 && $5 < 0' kallisto.results | cut -f1,2,3,4,5 | head

If you want to read more about this work related to this data it is published:

> **Vector transmission regulates immune control of _Plasmodium_ virulence**  
> Philip J. Spence, William Jarra, Prisca Lévy, Adam J. Reid, Lia Chappell, Thibaut Brugat, Mandy Sanders, Matthew Berriman and Jean Langhorne  
> _Nature. 2013 Jun 13; 498(7453): 228–231 doi:[10.1038/nature12231](https://www.nature.com/articles/nature12231)_

***

## Questions

### Q1: How many genes are more highly expressed in the SBP samples?
_Hint: try replacing `head` in the earlier command with another unix command to count the number of lines_

### Q2: How many genes are more highly expressed in the MT samples?
_Hint: try replacing `head` in the earlier command with another unix command to count the number of lines_

### Q3: Do you notice any particular genes that come up in the analysis?
_Hint: you want to count the number of times each description occurs using `awk`, `sort` and `uniq`_

***

## What's next?

You can head back to **[identifying differentially expressed genes with Sleuth](sleuth-de.ipynb)** or continue on to  **[key aspects of differential expression analysis](key-aspects.ipynb)**.