# Welcome to this year's R/jupyter part of the Applied Bioinformatics exam

The dataset for this year's exam is a featureCounts output table from an experiment comparing a Ecuadorian papaya genotype with a genotype from the Philippines.  
The following cells will perform the installation and general differential expression analysis for you.  
Your task is to filter the results according to the cutoffs specified in the related questions on ecampus.

# Run these commands before you start the exam on ecampus!

In [None]:
install.packages("reshape2", verbose = TRUE)
install.packages("ggplot2", verbose = TRUE)
install.packages("statmod", verbose = TRUE)
install.packages("gplots", verbose = TRUE)
install.packages("BiocManager", verbose = TRUE)
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("limma")
BiocManager::install("edgeR")

In [None]:
library(reshape2)
library(limma)
library(edgeR)
library(ggplot2)
library(gplots)
library(statmod)

# Perform these initial steps to perform differential expression analysis as preparation for the questions

In [None]:
counts <- "https://raw.githubusercontent.com/tgstoecker/teaching/master/AppliedBioinformatics/jote/papaya_genotypes.txt"
fc_res <- read.table(counts, header = T, row.names = 1)
colnames(fc_res) <- sub(".dedup.bam", "", colnames(fc_res))
colnames(fc_res) <- sub("removed_duplicates_alignments.", "", colnames(fc_res))
group = c("filipino", "filipino", "filipino", "ecuadorian", "ecuadorian", "ecuadorian")

*Samples W4, W5 and W6	are three replicates of the papaya genotype from Ecuador.  
W1, W2, W3 are three replicates of the papaya genotype from the Philippines.*

In [None]:
# create a DGE list object - the core of using edgeR
# For our purposes the DGEList-object should contain matrixes/dataframes of raw counts, group/treatment info as well as gene names 
dge = DGEList(counts = fc_res[, 6:11], group = group, genes = rownames(fc_res))

In [None]:
design <- model.matrix(~0+group)

In [None]:
keep <- filterByExpr(dge, design)
dge_filtered <- dge[keep, , keep.lib.sizes=FALSE]

In [None]:
dge_normalized <- calcNormFactors(dge_filtered, method = "TMM")

In [None]:
#All 3 dispersion estimates can easily be obtained from the estimateDisp function in one command:
dge_disp <- estimateDisp(dge_normalized, design, robust=TRUE)

In [None]:
#The estimation of QL dispersions is performed using the glmQLFit function:
fit <- glmQLFit(dge_disp, design, robust=TRUE)

In [None]:
# For this we can make use of limma's convenient makeContrasts function:
FvsE <- makeContrasts(groupfilipino-groupecuadorian, levels=design)

#In subsequent results, a positive log2-fold-change (logFC) will indicate a gene up-regulated in the filipino papaya genotype compared to the ecuadorian papaya genotype, 
#whereas a negative logFC will indicate a gene more highly expressed in the ecuadorian genotype.

#EdgeR offers two main kinds of tests - QL F-tests and likelihood ratio tests (LRT).
#We will use the former as they perform stricter error control by accounting for the uncertainty in dispersion estimation:
res <- glmQLFTest(fit, contrast=FvsE)

# Use the created "res" object to answer the "Applying thresholds" question on ecampus: