# 2018-11-22 Redefining maximally varying genes
So far I used a na√Øve criterion to define genes that have maximal variation. However, this criterion is not suitable for a serious analysis of the situation, and proves to be particularly bad when we have to judge which genes have good variation patterns *across* groups. The result is that genes that have been selected as candidates for maximal variation in the untreated cells then are stably repressed in the treated group, giving rise to spurious correlation patterns due to outliers. Here, I want to go back and try to get rid of these problems by defining a more robust and sound criterion for telling whether a gene will enter the clustering analysis.

## Negative binomial distribution

First, let's start by actually assessing the distribution of counts in our data set.

In [None]:
library(Rfast)
library(ggplot2)

In [None]:
# load the data
matrices.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data/matrices"
merged <- read.table(sprintf('%s/exprMatrix.csv', matrices.dir),
                     header = TRUE, row.names = 1,
                     sep = "\t", check.names = FALSE)

# load sample sheet
sampleSheet <- read.table(sprintf('%s/samplesheet.csv', matrices.dir),
                          header = TRUE,
                          row.names = 1)

# remove dead cells
sampleSheet <- sampleSheet[sampleSheet$status != "dead", ]

In [None]:
# load gene annotations file
gene.annotations <- sprintf("%s/gene_annotations.tsv", matrices.dir)
gene.data <- read.delim(gene.annotations, header = TRUE, sep = "\t",
                        row.names = 1, stringsAsFactors = FALSE)
gene.data <- subset(gene.data, rownames(gene.data) %in% rownames(merged))

First let's plot the distribution of read counts.

In [None]:
options(repr.plot.width = 4, repr.plot.height = 4)
hist(as.matrix(log(1+merged)), prob = TRUE)

Let's look at how different this is from a Poisson distribution.

In [None]:
# group the cell types together as factors
groups <- factor(sampleSheet$label, levels = c("Jurkat", "J-Lat+DMSO", "J-Lat+SAHA"))
table(groups)

In [None]:
# calculate and plot the coverage per gene in the various groups
nGenes <- nrow(merged)
coverage <- colSums(merged)/nGenes
ord <- order(groups)
options(repr.plot.width = 10, repr.plot.height = 4)
bar.positions <- barplot(coverage[ord], col=groups[ord],
                        xaxt='n', ylab="Coverage per gene")

In [None]:
# simple normalization method
counts.norm <- t(t(merged)/coverage)
top.genes <- tail(order(rowSums(counts.norm)), 10)
expression <- log2(counts.norm[top.genes, ] + 1) # add a pseudocount of 1

In [None]:
merged.mean <- colMeans(merged)
excess.var <- colVars(as.matrix(merged)) - merged.mean
excess.var[excess.var < 0] <- NA
overdispersion <- excess.var / merged.mean^2

# plot
options(repr.plot.width = 4, repr.plot.height = 4)
hist(log2(overdispersion),main="Variance of read counts is higher than Poisson")

This histogram shows that the overdispersion is positive for most samples, so that the negative binomial is indeed a more adequate representation of the data set.

## Differential expression analysis

Now we're ready to do the differential expression analysis.

In [None]:
library(DESeq)

In [None]:
# we need to prepare the data because the DESeq package does not accept non-integer
# values of the counts
merged.int <- as.data.frame(lapply(merged, as.integer))
rownames(merged.int) <- rownames(merged)

In [None]:
# this is the basic data structure that DESeq understands
cds <- newCountDataSet(merged.int, groups)

In [None]:
# here we estimate the size factors of the libraries (cells), which are linearly
# correlated to the coverage of the libraries but are estimated using a different,
# more robust, method
cds <- estimateSizeFactors(cds)

In [None]:
plot(sizeFactors(cds),colSums(merged.int)/nGenes)

We can see here that there is a good correlation between the size factors and the coverages. It is also very evident that there are two groups in this chart, corresponding clearly to the two batches.

The next step is to estimate the dispersions from the data set that we have.

In [None]:
# the first method that we will use relies only on the cell-wise information
cds <- estimateDispersions(cds, sharingMode="gene-est-only")

In [None]:
# the second method uses fitting across conditions
cds.pooled <- estimateDispersions(cds, method="per-condition", fitType="local")

Let's see how the dispersions relate to the normalied counts.

In [None]:
options(repr.plot.width = 3, repr.plot.height = 3)
plotDispEsts(cds.pooled,name="Jurkat")

In [None]:
plotDispEsts(cds.pooled,name="J-Lat+DMSO")

In [None]:
plotDispEsts(cds.pooled,name="J-Lat+SAHA") 

In [None]:
# here we do the differential expression analysis for the two groups of cells
de.test <- nbinomTest(cds, "J-Lat+DMSO", "J-Lat+SAHA")
de.test.pooled <- nbinomTest(cds.pooled, "J-Lat+DMSO", "J-Lat+SAHA")

In [None]:
# this function allows to filter and sort the results of the differential
# expression analysis
find.significant.genes <- function(de.result, alpha = 0.05) {

  # filter out significant genes based on FDR adjusted p-values
  filtered <- de.result[(de.result$padj < alpha) &
                        !is.infinite(de.result$log2FoldChange) & 
                        !is.nan(de.result$log2FoldChange),]

  # order by p-value, and print out only the gene name, mean count, and log2 fold change
  sorted <- filtered[order(filtered$pval),]#,c(1,2,6)]
}

In [None]:
# perform the filtering and sorting here
de.genes <- find.significant.genes(de.test)
de.genes.pooled <- find.significant.genes(de.test.pooled)

In [None]:
# now attach the information on the genes to the data frames that we obtained
de.genes$symbol <- gene.data[de.genes$id, ]
de.genes.pooled$symbol <- gene.data[de.genes.pooled$id, ]

In [None]:
de.genes

In [None]:
# now create a data frame with the genes in this selection and the HIV, and
# let's see what happens
X <- as.data.frame(t(merged[c(de.genes$id, "FILIONG01"), ]))

In [None]:
# select the top candidates
top.genes <- head(de.genes,n=15)$id
top.genes.pooled <- head(de.genes.pooled,n=15)$id

In [None]:
# let's do a simple scatter plot of some of the most significant genes versus
# the expression of the GFP reporter
i <- 2
ggplot(X, aes_string(top.genes.pooled[i], "FILIONG01")) + geom_point() + 
labs(x = gene.data[top.genes.pooled[i], "gene_symbol"])

## Differential expression across responders versus non-responders
So now that we have an idea of how to perform the analysis, and the results of the analysis actually make sense, let's now go back to the idea of trying to figure out whether there are any signatures for expression in responders versus non-responders.

In [None]:
# let's select the treated cells
treated.cells <- sampleSheet[colnames(merged),"status"]=="treated"
treated <- merged[, treated.cells]
ntreated <- ncol(treated)

# define the responders
responder.cells <- which(treated["FILIONG01", ] > 0)
nonresponder.cells <- which(treated["FILIONG01", ] == 0)

In [None]:
# prepare the "factor" of responders
responders <- factor(rep("responder", ntreated), levels = c("responder", "nonresponder"))
responders[nonresponder.cells] <- "nonresponder"

In [None]:
table(responders)

In [None]:
# prepare the data for DESeq
treated.int <- as.data.frame(lapply(treated, as.integer))
rownames(treated.int) <- rownames(treated)

In [None]:
# define a new object with only the treated cells
cds.treated <- newCountDataSet(treated.int, responders)

In [None]:
# estimate size factors
cds.treated <- estimateSizeFactors(cds.treated)

# estimate the dispersions
cds.treated <- estimateDispersions(cds.treated, sharingMode="gene-est-only")
cds.treated.pooled <- estimateDispersions(cds.treated, method="per-condition", fitType="local")

In [None]:
# do the differential expression analysis
de.responders <- nbinomTest(cds.treated, "responder", "nonresponder")
de.responders.pooled <- nbinomTest(cds.treated.pooled, "responder", "nonresponder")

In [None]:
# perform the filtering and sorting here
de.responder.genes <- find.significant.genes(de.responders)
de.responder.genes.pooled <- find.significant.genes(de.responders.pooled)

Let's now have a look at the results.

In [None]:
de.responder.genes

In [None]:
# let's do the scatter plot
X <- data.frame(t(treated["ENSG00000162927.13",]), t(treated["FILIONG01", ]))
ggplot(X, aes_string("ENSG00000162927.13", "FILIONG01")) + geom_point()

So the result here is that there is only one gene that seems to be differentially expressed in the group of responders versus non-responders. It is "PUS10", an enzyme that catalyzes a reaction involving pseudouridinylation of RNA.