# Group Task 1 Answers

Given an RNA-Seq experiment with a knocked out gene, you were asked to answer the following questions:

  * What influence does it have?   
  * What is the name of the knock out gene?
  * How did you determine those?  

***

## Dataset

You were given the following files in order to conduct the analysis:

  * Wild type sample reads in FASTQ format  
    `WT[replicate]_[1|2].fastq.gz`  


  * Knockout sample reads in FASTQ format  
    `KO[replicate]_[1|2].fastq.gz`
  
  
  * _P. berghi_ genome in FASTA format  
    `PbANKA_v3.fasta`
    
    
  * _P. berghi_ transcripts in FASTA format  
    `Pb.CDS.fasta`
    

  * _P. berghi_ annotations in GFF format  
    `PbANKA_v3.gff3.gz`
    
    
  * _P. berghi_ gene descriptions in TSV format  
    `Pb.names.txt`
      
    
  * R script to run sleuth  
    `sleuth.R`

### Important questions

**Can you summarise the experimental design?**

The experimental design should explain what each sample represents, i.e. the conditions that were applied and how many replicates there were.  In this experiment, there are two conditions, wild type (WT) and knock out (KO), each of which has three biological replicates.

| Sample name | Condition | Replicate |
| :-: | :-: | :-: |
| WT | wild type | 1 |
| WT | wild type | 2 |
| WT | wild type | 3 |
| KO | knock out | 1 |
| KO | knock out | 2 |
| KO | knock out | 3 |

***

## Aligning sample reads to the genome

**First, we need to move into the directory containing the data.**

In [3]:
cd /home/manager/course_data/group_projects/RNASeq_1

**Then,we build our HISAT2 index for the genome.**

In [3]:
hisat2-build PbANKA_v3.fasta PbANKA_v3_hisat2.idx

Settings:
  Output files: "PbANKA_v3_hisat2.idx.*.ht2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  PbANKA_v3.fasta
Reading reference sizes
  Time reading reference sizes: 00:00:01
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
  Time to read SNPs and splice sites: 00:00:00
Using parameters --bmax 3521364 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 3521364 --dcv 1024
Constructing suffix-array e

**Next, we can use a for loop to align all of our sample files to the genome.**  
_Be patient, this will take a while!_

In [4]:
for fname in *_1.fastq.gz
do
    # Get sample name from file name
    sample=`echo "$fname" | cut -d'_' -f1`

    # Align sample to genome
    echo "Aligning sample..."${sample}
    hisat2 --max-intronlen 10000 -x PbANKA_v3_hisat2.idx \
    -1 ${sample}_1.fastq.gz -2 ${sample}_2.fastq.gz \
    -S ${sample}.sam

    # Convert SAM to sorted BAM
    echo "Converting sample SAM to sorted BAM..."${sample}
    samtools view -b ${sample}.sam | \
    samtools sort -o ${sample}.sorted.bam

    # Index sorted BAM
    echo "Indexing sample BAM..."${sample}
    samtools index ${sample}.sorted.bam
done

Aligning sample...KO1
2160158 reads; of these:
  2160158 (100.00%) were paired; of these:
    100329 (4.64%) aligned concordantly 0 times
    1918950 (88.83%) aligned concordantly exactly 1 time
    140879 (6.52%) aligned concordantly >1 times
    ----
    100329 pairs aligned concordantly 0 times; of these:
      5843 (5.82%) aligned discordantly 1 time
    ----
    94486 pairs aligned 0 times concordantly or discordantly; of these:
      188972 mates make up the pairs; of these:
        123409 (65.31%) aligned 0 times
        56566 (29.93%) aligned exactly 1 time
        8997 (4.76%) aligned >1 times
97.14% overall alignment rate
Converting sample SAM to sorted BAM...KO1
[bam_sort_core] merging from 1 files and 1 in-memory blocks...
Indexing sample BAM...KO1
Aligning sample...KO2
2978598 reads; of these:
  2978598 (100.00%) were paired; of these:
    394393 (13.24%) aligned concordantly 0 times
    2413270 (81.02%) aligned concordantly exactly 1 time
    170935 (5.74%) aligned concor

### Important questions

**What is the overall alignment rate for each of the samples?**

It is important to look at the overall alignment rate to the genome as this can give an idea of whether there are any issues with the experiment (e.g. contamination).

| Sample name | Alignment rate|
| :-: | :-: |
| WT1 | 97.63% |
| WT2 | 83.60% |
| WT3 | 97.83% |
| KO1 | 97.14% |
| KO2 | 88.88% |
| KO3 | 97.12% |

***

## Visualising the genome alignments 

Before you can use IGV to visualise the genome, you must first index the genome using `samtools faidx`.

**Index the genome.**

In [6]:
samtools faidx PbANKA_v3.fasta

Once that has finished, you need to load your genome (`PbANKA_v3.fasta`), annotation (`PbANKA_v3.gff3.gz`) and sorted alignment files (`[sample].sorted.bam`) into IGV.

**Start IGV.**  

In [7]:
igv.sh &

[1] 22359


**Load the genome file `Genomes -> Load Genome from File`**  

**Load the annotation (gff) file `File -> Load from File`**   

**Load the sorted sample BAM files `File -> Load from File`** 

**Make sure to set the alignment tracks to "squished" and to view reads as "paired".**

**Type 'PBANKA_KO' in the search box and click 'Go'.**

![images/PBANKA_KO_IGV.png](images/PBANKA_KO_IGV.png)

### Important questions

**Where in the genome is PBANKA_KO located?**

You can get the co-ordinates of PBANKA_KO from the annotation file (`PbANKA_v3.gff3.gz`) using `grep` (first uncompressing the file with `gunzip`) or `zgrep`.

In [9]:
zgrep PBANKA_KO PbANKA_v3.gff3.gz

PbANKA_14_v3	chado	gene	1416412	1423431	.	+	.	ID=PBANKA_KO;Name=PBANKA_KO
PbANKA_14_v3	chado	mRNA	1416412	1423431	.	+	.	ID=PBANKA_KO.1;Parent=PBANKA_KO
PbANKA_14_v3	chado	CDS	1416412	1423431	.	+	0	ID=PBANKA_KO.1:exon:1;Parent=PBANKA_KO.1
PbANKA_14_v3    chado   polypeptide     1416412 1423431 .       +       .       ID=PBANKA_KO.1:pep;Dbxref=RMgm:PBANKA_KO;Derives_from=PBANKA_KO.1


PBANKA_KO is located on the forward strand of **PbANKA_14_v3** between **1416412** and **1423431**.

**How many exons does PBANKA_KO have?**

In IGV, you can see that PBANKA_KO has **one exon**.  

![images/PBANKA_KO_IGV_exon.png](images/PBANKA_KO_IGV_exon.png)

This can be confirmed by looking for the PBANKA_KO CDS annotations the GFF file. 

In [10]:
zgrep CDS.*PBANKA_KO PbANKA_v3.gff3.gz

PbANKA_14_v3	chado	CDS	1416412	1423431	.	+	0	ID=PBANKA_KO.1:exon:1;Parent=PBANKA_KO.1


**Do you notice anything unusual about any of the alignments to PBANKA_KO? Should there be any reads mapping to this knocked out gene?**

Reads are mapping to PBANKA_KO for the KO3 sample.  Suggests this may be an incomplete knock out.

![images/PBANKA_KO_IGV_coverage.png](images/PBANKA_KO_IGV_coverage.png)

***

## Aligning sample reads to the transcriptome

**First we need to build a Kallisto index of our tanscriptome (`Pb.CDS.fasta`).**

In [11]:
kallisto index -i Pb.CDS.kallisto Pb.CDS.fasta

INFO [2018-12-06 09:39:22,377]  [IGV.java:1700] [pool-1-thread-1]  Loading 3 resources.
INFO [2018-12-06 09:39:22,377]  [TrackLoader.java:120] [pool-1-thread-1]  Loading resource, path /home/manager/course_data/group_projects/RNASeq_1/WT1.sorted.bam
INFO [2018-12-06 09:39:22,454]  [TrackLoader.java:120] [pool-1-thread-1]  Loading resource, path /home/manager/course_data/group_projects/RNASeq_1/WT2.sorted.bam
INFO [2018-12-06 09:39:22,599]  [TrackLoader.java:120] [pool-1-thread-1]  Loading resource, path /home/manager/course_data/group_projects/RNASeq_1/WT3.sorted.bam
INFO [2018-12-06 09:41:28,638]  [IGV.java:1700] [pool-1-thread-1]  Loading 6 resources.
INFO [2018-12-06 09:41:28,640]  [TrackLoader.java:120] [pool-1-thread-1]  Loading resource, path /home/manager/course_data/group_projects/RNASeq_1/KO1.sorted.bam
INFO [2018-12-06 09:41:28,649]  [TrackLoader.java:120] [pool-1-thread-1]  Loading resource, path /home/manager/course_data/group_projects/RNASeq_1/KO2.sorted.bam
INFO [2018-12-

As with the genome alignments, we can run the transcriptome alignments for all the samples using a loop.

**Now, align your samples to the transcriptome using Kallisto.**

In [12]:
for fname in *_1.fastq.gz
do
    # Get sample name from file name
    sample=`echo "$fname" | cut -d'_' -f1`
    
    # Quantify transcript expression in sample
    echo "kallisto quantification for sample..."${sample}
    kallisto quant -i Pb.CDS.kallisto -o ${sample} -b 100 \
    ${sample}_1.fastq.gz ${sample}_2.fastq.gz
done

kallisto quantification for sample...KO1

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 5,077
[index] number of k-mers: 9,949,889
[index] number of equivalence classes: 7,297
[quant] running in paired-end mode
[quant] will process pair 1: KO1_1.fastq.gz
                             KO1_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 2,160,158 reads, 1,760,542 reads pseudoaligned
[quant] estimated average fragment length: 270.873
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 286 rounds
[bstrp] running EM for the bootstrap: 100

kallisto quantification for sample...KO2

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 5,077
[index] number of k-mers: 9,949,889
[index] number of equivalence classes: 7,297
[quant] running in paired-end mode
[quant] will proc

***

## Run DE analysis in sleuth

To identify differentially expressed genes you can use the R package, sleuth.  

**Run the sleuth R script (`sleuth.R`).**

In [13]:
Rscript sleuth.R

Loading required package: methods
Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In file(file, "rt") :
  cannot open file 'hiseq_info.txt': No such file or directory
Execution halted


: 1

This will give you the error:

    cannot open file 'hiseq_info.txt': No such file or directory
    
**Take a look at the R script.**

In [14]:
cat sleuth.R

library("sleuth")
s2c <- read.table("hiseq_info.txt", header = TRUE, stringsAsFactors=FALSE)

sample_id <- c("WT1", "WT2","WT3", "KO1", "KO2", "KO3")
kal_dirs <- sapply(sample_id, function(id) file.path(".", id))
s2c <- dplyr::select(s2c, sample = sample, condition)
s2c <- dplyr::mutate(s2c, path = kal_dirs)

t2g<-read.table("Pb.names.txt", header=T, sep="\t")

so <- sleuth_prep(s2c, ~ condition, target_mapping = t2g)
so <- sleuth_fit(so)
so <- sleuth_fit(so, ~1, 'reduced')
#so <- sleuth_lrt(so, 'reduced', 'full')
so <- sleuth_wt(so, 'conditionWT', 'full')

results_table <- sleuth_results(so, 'conditionWT', test_type = 'wt')

write.table(results_table, file="kallisto.results", quote=FALSE, sep="\t", row.names=FALSE)

sleuth_live(so)


Look at the second line:

    s2c <- read.table("hiseq_info.txt", header = TRUE, stringsAsFactors=FALSE)
    
The script is looking for a file called `hiseq_info.txt`.  

**Let's see if the file is there.**

In [15]:
ls hiseq_info.txt

hiseq_info.txt


Nope.  Well...we did warn you that some files might be missing!

**What does the `hiseq_info.txt` file contain? Let's look at the one we used in the practical.**

In [16]:
cat /home/manager/pathogen-informatics-training/Notebooks/RNA-Seq/data/hiseq_info.txt

sample	condition
MT1	MT
MT2	MT
SBP1	SBP
SBP2	SBP
SBP3	SBP


So, this file shows which condition was applied to each sample.

**Copy this file into the same directory as your sleuth R script.**

In [9]:
cp /home/manager/pathogen-informatics-training/Notebooks/RNA-Seq/data/hiseq_info.txt .

**Let's check that worked.**

In [11]:
cat hiseq_info.txt

sample	condition
MT1	MT
MT2	MT
SBP1	SBP
SBP2	SBP
SBP3	SBP


Good, now let's update this file so it contains our sample names.

**You can edit the file manually by typing the following `nano` command in your terminal.**

In [7]:
nano /home/manager/course_data/group_projects/RNASeq_1/hiseq_info.txt

/home/manager/course_data/group_projects/RNASeq_1


_Hint: be careful which order you put the samples in._

**Alternatively you can make the edits using `sed`.**

In [12]:
sed -i -e 's/MT/WT/g' hiseq_info.txt
sed -i -e 's/SBP/KO/g' hiseq_info.txt
sed -ie '/^WT2/a WT3\tWT' hiseq_info.txt

**And check that it's worked.**

In [13]:
cat hiseq_info.txt

sample	condition
WT1	WT
WT2	WT
WT3	WT
KO1	KO
KO2	KO
KO3	KO


**Now try running the sleuth R script again.**

In [17]:
Rscript sleuth.R

Loading required package: methods
reading in kallisto results
dropping unused factor levels
......
normalizing est_counts
4811 targets passed the filter
normalizing tpm
merging in metadata
summarizing bootstraps
......
fitting measurement error models
shrinkage estimation
11 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
The LOESS fit will be repeated using exact computation of the fitted surface to extrapolate the missing values.
These are the target ids with NA values: PBANKA_0000301, PBANKA_0001101, PBANKA_0700541, PBANKA_1040521, PBANKA_1241400, PBANKA_1418000, PBANKA_1465700, PBANKA_API00390, PBANKA_0707700, PBANKA_0941800, PBANKA_1122700
computing variance of betas
fitting measurement error models
shrinkage estimation
2 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
The LOESS fit will be repeated using exact 

**Click on the link returned: [http://127.0.0.1:42427](http://127.0.0.1:42427).**

### Important questions

Can you summarise how the data processing (i.e. number of reads processed and the proportion of reads mapping to the genome and transcriptome)? 

Looking at the PCA plot (`Maps -> PCA`) do you think the samples form tight, distinct clusters based on the condition (WT or KO) that was applied? 

Was PBANKA_KO differentially expressed? 

Is there anything unusual about the PBANKA_KO expression levels in any of the samples?

How many genes are more highly expressed in the KO than in the WT?

How many genes are more highly expressed in the WT than in the KO?

***

## GO term enrichment analysis


Extract the IDs of the differentially expressed genes.

In [2]:
cd /home/manager/course_data/group_projects/RNASeq_1
awk -F'\t' '$4 < 0.01 && $5 > 0  {print $1}' kallisto.results > kallisto.WT.sig.genes
awk -F'\t' '$4 < 0.01 && $5 < 0  {print $1}' kallisto.results > kallisto.KO.sig.genes

Go to [http://www.plasmodb.org](http://www.plasmodb.org).

My Strategies -> New
Annotation, curation and identifiers -> Gene IDs
Upload file of DE gene IDs from kallisto output
Analyse results (blue button) -> GO enrichment


### Important questions

What are the top 5 enriched GO terms (biological processes and molecular functions) of genes with higher expression in the WT samples?
What are the top 5 enriched GO terms (biological processes and molecular functions) of genes with higher expression in the KO samples?

***

## What is PBANKA_KO?

So, for this group task, we removed the real name of PBANKA_KO from all of the files we gave you. How mean! To get the real name of PBANKA_KO, we need the real genome annotation file.

**Download the real annotation file.**

In [3]:
wget ftp://ftp.sanger.ac.uk/pub/project/pathogens/gff3/CURRENT/Pberghei.gff3.gz

--2018-12-06 12:08:20--  ftp://ftp.sanger.ac.uk/pub/project/pathogens/gff3/CURRENT/Pberghei.gff3.gz
           => ‘Pberghei.gff3.gz’
Resolving ftp.sanger.ac.uk (ftp.sanger.ac.uk)... 193.62.203.115
Connecting to ftp.sanger.ac.uk (ftp.sanger.ac.uk)|193.62.203.115|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/project/pathogens/gff3/CURRENT ... done.
==> SIZE Pberghei.gff3.gz ... 9151369
==> PASV ... done.    ==> RETR Pberghei.gff3.gz ... done.
Length: 9151369 (8.7M) (unauthoritative)


2018-12-06 12:08:21 (12.8 MB/s) - ‘Pberghei.gff3.gz’ saved [9151369]



Now, earlier on, you will have jotted down the location of PBANKA_KO in the genome.

**Search for a gene with the same co-ordinates as PBANKA_KO in the reall annotation file.**

In [6]:
zgrep "PbANKA_14_v3.*gene.*1416412.*1423431" Pberghei.gff3.gz

PbANKA_14_v3	chado	gene	1416412	1423431	.	+	.	ID=PBANKA_1437500;Name=AP2-G;previous_systematic_id=PB000234.02.0,PBANKA_143750;synonym=ApiAP2


Looks like PBANKA_KO is really **PBANKA_1437500**, better known as **AP2-G**, in disguise!