In [None]:
# ggplot stuff
library(ggplot2)
library(RColorBrewer)
theme_set(theme_bw())

# 2019-03-11 First look at new data
I processed the FASTQ data files of the new batch of experiments, and generated one big expression matrix and one big sample sheet, so to make life easier. Let's have a look at some data.

In [None]:
# load the expression matrix
data.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data"
matrix.fname <- sprintf('%s/matrices/exprMatrix.tsv', data.dir)
exprMatrix <- read.table(matrix.fname, header = TRUE, row.names = 1,
                                       sep = "\t", check.names = FALSE)

In [None]:
# load the sample sheet
sample.sheet.fname <- sprintf("%s/metadata/sampleSheet.tsv", data.dir)
sampleSheet <- read.delim(sample.sheet.fname, header = TRUE, row.names = 1)

In [None]:
# load gene annotations file
gene.annotations <- sprintf("%s/matrices/gene_annotations.tsv", data.dir)
gene.data <- read.delim(gene.annotations, header = TRUE, sep = "\t",
                        row.names = 1, stringsAsFactors = FALSE)
gene.data <- subset(gene.data, rownames(gene.data) %in% rownames(exprMatrix))

Let's do the PCA and try to retrieve what we already know.

In [None]:
# remove genes that have no expression
norm.exprMatrix <- exprMatrix[rowSums(exprMatrix)>1, ]

# normalize by row sum
total <- colSums(norm.exprMatrix)
norm.exprMatrix <- t(norm.exprMatrix)
norm.exprMatrix <- norm.exprMatrix / rowSums(norm.exprMatrix)
norm.exprMatrix <- t(norm.exprMatrix)

In [None]:
# do the PCA
exprMatrix.pca <- prcomp(t(norm.exprMatrix), scale = TRUE)

In [None]:
# prepare for plotting
pca <- as.data.frame(exprMatrix.pca$x)
pca$batch <- substring(colnames(norm.exprMatrix), 0, 5)
pca$label <- sampleSheet$label
pca$total <- total

In [None]:
options(repr.plot.width = 6.5, repr.plot.height = 2)
ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color=total))  +
scale_colour_gradient(low="blue", high="red") + theme_bw()

In [None]:
options(repr.plot.width = 4, repr.plot.height = 3)
ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color=total))+
scale_colour_gradient(low="blue", high="red") +
xlim(-15, 40) +
ylim(-50, 20)

Okay, from here we see again that doing the PCA with all the samples together, we still see that the PC1 distinguishes between the samples that have low total expression and the others.

I'll now remove those cells, assuming that they are dead.

In [None]:
dead.cells <- rownames(pca)[pca$PC1 > 0]
alive.cells <- rownames(pca)[pca$PC1 < 0]
table(sampleSheet[dead.cells, "label"])

The great majority of the dead cells are the ones that have been treated with SAHA, which is known to be very toxic to the cells. So this makes sense, so far.

In [None]:
# remove the dead cells from the samples
clean.exprMatrix <- exprMatrix[, alive.cells]

Now let's try to do again the PCA, without the dead cells.

In [None]:
# remove genes that have no expression
norm.clean.exprMatrix <- clean.exprMatrix[rowSums(clean.exprMatrix)>1, ]

# normalize by row sum
total <- colSums(norm.clean.exprMatrix)
norm.clean.exprMatrix <- t(norm.clean.exprMatrix)
norm.clean.exprMatrix <- norm.clean.exprMatrix / rowSums(norm.clean.exprMatrix)
norm.clean.exprMatrix <- t(norm.clean.exprMatrix)

In [None]:
# do the PCA
clean.exprMatrix.pca <- prcomp(t(norm.clean.exprMatrix), scale = TRUE)

In [None]:
# prepare for plotting
clean.pca <- as.data.frame(clean.exprMatrix.pca$x)
clean.pca$batch <- substring(colnames(clean.exprMatrix), 0, 5)
clean.pca$label <- sampleSheet[alive.cells, "label"]
clean.pca$total <- total

In [None]:
options(repr.plot.width = 5, repr.plot.height = 3)
gg <- ggplot(clean.pca, aes(PC1, PC2)) + geom_point(aes(color=label)) +
scale_color_manual(values=c("red", "purple", "blue", "black", "magenta", "brown")) +
theme_bw()
ggsave(filename = "../figures/PCA-new-experiments.png", width = 5, height = 3)
print(gg)

This is a beautiful plot. It shows clustering of the points in three different clouds, each of them corresponding to a different treatment. This shows that there is indeed a large effect of the treatment on the gene expression patterns.

Let's now have a look at the HIV.

In [None]:
# add the information in our data frame
options(repr.plot.width = 4.5, repr.plot.height = 3)
clean.pca$HIV <- log(1+t(clean.exprMatrix["FILIONG01", ]))
gg <- ggplot(clean.pca, aes(PC1, PC2)) + geom_point(aes(color=HIV)) +
scale_colour_gradient(low="blue", high="red") +
theme_bw()
ggsave(filename = "../figures/PCA-new-experiments-HIV.png", width = 5, height = 3)
print(gg)

And this plot shows the same result as we had before, before the second round of experiments came in. That is: the fact that the HIV insertion gets activated or not does not really depend on *global* gene expression patterns, but it is probably hidden in some local features.

## Differential expression analysis

In [None]:
# DESeq
library(DESeq)

We now want to attack the central question of this study: what are the genes that are associated to HIV reactivation by latency-reversal drugs? To do this, we will use the `DESeq` package, which allows to do differential expression analysis with some robustness.

Before doint the whole thing, I will prepare a couple of functions that will allow me to do the analysis in a simple way.

In [None]:
do.DEA <- function(expr.matrix, groups, gene.data,
                   g1, g2, method = "per-condition") {
    
    # cast to integer the expression matrix, otherwise DESeq will complain
    expr.matrix.int <- as.data.frame(lapply(expr.matrix, as.integer))
    
    # give the same names to the new matrix as the ones before
    rownames(expr.matrix.int) <- rownames(expr.matrix)

    # this is the basic data structure that DESeq understands
    cds <- newCountDataSet(expr.matrix.int, groups)

    # estimate size factors
    cds <- estimateSizeFactors(cds)

    # estimate dispersion
    if (method == "per-gene") {
        cds <- estimateDispersions(cds, sharingMode="gene-est-only")
    }
    else if (method == "per-condition"){
        cds <- estimateDispersions(cds, method="per-condition", fitType="local")
    }
    else {
        stop("Invalid method")
    }
    
    # do the differential expression analysis
    de.test <- nbinomTest(cds, g1, g2)
    
    # now attach the information on the genes to the data frames that we obtained
    de.test$symbol <- gene.data[de.test$id, ]
    
    # return
    de.test
}

In [None]:
# this function allows to filter and sort the results of the differential
# expression analysis
find.significant.genes <- function(de.result, alpha = 0.05) {

  # filter out significant genes based on FDR adjusted p-values
  filtered <- de.result[(de.result$padj < alpha) &
                        !is.infinite(de.result$log2FoldChange) & 
                        !is.nan(de.result$log2FoldChange),]

  # order by p-value
  sorted <- filtered[order(filtered$pval),]
}

In [None]:
# first test to see whether everything works well: differential expression analysis between
# cells that are treated with SAHA and latent cells that are not treated
groups <- sampleSheet[alive.cells, "label"]
de.test <- do.DEA(clean.exprMatrix, groups, gene.data, "J-LatA2+DMSO", "J-LatA2+SAHA")
de.genes <- find.significant.genes(de.test)

In [None]:
dim(de.genes)

Okay, this seems kind of right. It is to be expected that thousands of genes are differentially expressed in this case.

Moving on, let's now restrict ourselves to the cells that have been treated with SAHA, and let's divide them into the ones that have a reactivated HIV insertion and those that don't.

### SAHA-treated cells, previous round
The first thing that I want to check is that we are able to recover the results of the previous batch of experiments.

In [None]:
# select cells that are alive and that have been treated with SAHA, and that belong to the first
# two plates of cells
SAHA.treated <- intersect(rownames(sampleSheet)[sampleSheet$label == "J-LatA2+SAHA"], alive.cells)
SAHA.treated <- subset(SAHA.treated, startsWith(SAHA.treated, "P2449") | startsWith(SAHA.treated, "P2458"))

# get the expression matrix corresponding to those cells
SAHA <- exprMatrix[, SAHA.treated]

# prepare the groups of responders and non-responders
SAHA.responders <- factor(rep("non-responder", ncol(SAHA)), levels = c("non-responder", "responder"))
SAHA.responders[SAHA["FILIONG01", ] > 0] <- "responder"

In [None]:
# do the differential expression analysis
de.test.SAHA <- do.DEA(SAHA, SAHA.responders, gene.data,
                       "non-responder", "responder", method = "per-gene")
de.genes.SAHA <- find.significant.genes(de.test.SAHA, alpha = 0.1)

In [None]:
de.genes.SAHA

In this table we see that the PUS10 gene is still present, and that there is another component of the INTS complex, but it is not INTS1. However, the most differentially expressed gene in the group had not been identified earlier, and it is a gene that apparently has nothing to do with HIV: ALDH1B1, which is an aldehyde dehydrogenase.

### SAHA-treated cells
Let's move on and try to look at globally what are the genes that are differentially expressed between responders and non-responders in (all) the SAHA-treated cells.

In [None]:
all.SAHA.treated <- intersect(rownames(sampleSheet)[sampleSheet$label == "J-LatA2+SAHA"], alive.cells)
all.SAHA <- exprMatrix[, all.SAHA.treated]
all.SAHA.responders <- factor(rep("non-responder", ncol(all.SAHA)),
                              levels = c("non-responder", "responder"))
all.SAHA.responders[all.SAHA["FILIONG01", ] > 0] <- "responder"

In [None]:
table(all.SAHA.responders)

In [None]:
de.test.all.SAHA <- do.DEA(all.SAHA, all.SAHA.responders, gene.data,
                       "non-responder", "responder", method = "per-condition")
de.genes.all.SAHA <- find.significant.genes(de.test.all.SAHA, alpha = 0.1)

In [None]:
de.genes.all.SAHA

### PMA-treated cells

In [None]:
all.PMA.treated <- intersect(rownames(sampleSheet)[sampleSheet$label == "J-LatA2+PMA"], alive.cells)
all.PMA <- exprMatrix[, all.PMA.treated]
all.PMA.responders <- factor(rep("non-responder", ncol(all.PMA)),
                              levels = c("non-responder", "responder"))
all.PMA.responders[all.PMA["FILIONG01", ] > 0] <- "responder"

In [None]:
table(all.PMA.responders)

In [None]:
de.test.all.PMA <- do.DEA(all.PMA, all.PMA.responders, gene.data,
                       "non-responder", "responder", method = "per-gene")
de.genes.all.PMA <- find.significant.genes(de.test.all.PMA, alpha = 0.1)

In [None]:
de.genes.all.PMA