# 2018-11-14 PCA responders
Let's look at whether PCA can give us significant information about responders versus non-responders.

In [None]:
library(ggplot2)
library(RColorBrewer)

In [None]:
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

In [None]:
# basic data
matrices.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data/matrices"
sample.names <- c("P2449", "P2458")

# init data structures that will hold our data
exprMatrices <- list()
sampleSheets <- list()

# load data
for (sample.name in sample.names) {
    
    # file names
    matrix.fname <- sprintf("%s/%s.tsv.gz", matrices.dir, sample.name)
    sampleSheet.fname <- sprintf("%s/monocle/%s.pd.tsv", matrices.dir, sample.name)

    # parse data
    exprMatrices[[sample.name]] <- read.table(matrix.fname, header = TRUE, row.names = 1,
                                sep = "\t", check.names = FALSE)
    sampleSheets[[sample.name]] <- read.delim(sampleSheet.fname, header = TRUE, row.names = 1)
}

# load gene annotations file
gene.annotations <- sprintf("%s/gene_annotations.tsv", matrices.dir)
gene.data <- read.delim(gene.annotations, header = TRUE, row.names = 1, sep = "\t")

## Unnormalized P2449

In [None]:
sample.name <- "P2449"

# select only one of the matrices
exprMatrix <- exprMatrices[[sample.name]]

# select only treated cells
jlat.treated <- sampleSheets[[sample.name]]$label == "J-Lat+SAHA"
exprMatrix <- exprMatrix[, jlat.treated]

# select only alive cells
totalExpression <- colSums(exprMatrix)
alive <- totalExpression > 100000
exprMatrix <- exprMatrix[, alive]

# exclude the FILIONG01 gene
HIV <- log(exprMatrix["FILIONG01", ]+1)
exprMatrix <- exprMatrix[rownames(exprMatrix) != 'FILIONG01', ]

# remove genes with zero expression
exprMatrix <- exprMatrix[rowSums(exprMatrix)>1, ]

In [None]:
# do the PCA
exprMatrix.pca <- prcomp(t(exprMatrix), center = TRUE)

In [None]:
# prepare a data.frame for plotting
pca <- as.data.frame(exprMatrix.pca$x)
pca$totalExpression <- colSums(exprMatrix)
pca$HIV <- t(HIV)

In [None]:
options(repr.plot.width = 8, repr.plot.height = 2)
gg1 <- ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color = totalExpression)) +
scale_colour_gradient(low="blue", high="red") + theme_bw()
gg2 <- ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color = HIV))  +
scale_colour_gradient(low="blue", high="red") + theme_bw()
multiplot(gg1, gg2, cols=2)

It's clear from these first plots that the first principal component is acually proportional to the total expression. Let's look at this more clearly.

In [None]:
options(repr.plot.width = 3, repr.plot.height = 3)
ggplot(pca, aes(PC1, totalExpression)) + geom_point() + geom_smooth(method='lm') +
theme_bw()

So I'll try plotting the deeper principal components.

In [None]:
options(repr.plot.width = 4, repr.plot.height = 3)
ggplot(pca, aes(PC2, PC3)) + geom_point(aes(color = HIV)) +
scale_colour_gradient(low="blue", high="red") + theme_bw()

There's not a very clear pattern emerging here. We have few points though, so it would be interesting to perform the same analysis but with the two samples pooled together, and normalized.

# Normalized pooled samples

Here I take the two samples together, normalize the expression of the genes by the total expression of the cell, and do the PCA.

In [None]:
# pool together the expression matrices
exprMatrix <- cbind(exprMatrices[["P2449"]], exprMatrices[["P2458"]])

# select only treated cells
sampleSheet <- cbind(sampleSheets[["P2449"]], sampleSheets[["P2458"]])
jlat.treated <- sampleSheet$label == "J-Lat+SAHA"
exprMatrix <- exprMatrix[, jlat.treated]

# select only alive cells
totalExpression <- colSums(exprMatrix)
alive <- totalExpression > 100000
exprMatrix <- exprMatrix[, alive]

# exclude the FILIONG01 gene
HIV <- log(exprMatrix["FILIONG01", ]+1)
exprMatrix <- exprMatrix[rownames(exprMatrix) != 'FILIONG01', ]

# exclude genes that have zero expression in all the cells
genesTotalExpression <- rowSums(exprMatrix)
exprMatrix <- exprMatrix[genesTotalExpression>0, ]

# now we can normalize by the total expression
totalExpression <- totalExpression[alive]

# normalize
exprMatrix <- t(t(exprMatrix) / totalExpression)

In [None]:
# do the PCA
exprMatrix.pca <- prcomp(t(exprMatrix), center = TRUE, scale. = TRUE)

In [None]:
# prepare a data.frame for plotting
pca <- as.data.frame(exprMatrix.pca$x)
pca$totalExpression <- colSums(exprMatrix)
pca$HIV <- t(HIV)

# add information on the batch
pca$Batch <- substring(rownames(pca), 0, 5)

# add a digital "responder" variable
is.responder <- HIV>0
pca$responder <- rep("Responder", length(is.responder))
pca$responder[!is.responder] <- "Non-responder"

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)
gg1 <- ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color = HIV))  +
scale_colour_gradient(low="blue", high="red") + theme_bw()
gg2 <- ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color = Batch))  + theme_bw()
gg3 <- ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color = responder))  + theme_bw()
multiplot(gg1, gg2, gg3, cols=2)

Here I can see that the PCA reveals two groups of cells, but the two groups are not related to HIV expression, nor the fact that the cells have active or inactive HIV, nor the cell batch.

In [None]:
options(repr.plot.width = 6, repr.plot.height = 3)
gg1 <- ggplot(pca, aes(PC1, HIV)) + geom_point() + geom_smooth(method='lm') +
theme_bw()
gg2 <- ggplot(pca, aes(PC2, HIV)) + geom_point() + geom_smooth(method='lm') +
theme_bw()
multiplot(gg1, gg2, cols=2)

In [None]:
PC1vsHIV <- lm(PC1 ~ HIV, data=pca)
summary(PC1vsHIV)

In [None]:
PC2vsHIV <- lm(PC2 ~ HIV, data=pca)
summary(PC2vsHIV)

The analysis of the first and second principal component as related to the level of HIV expression reveals that there is a relationship, but the results I don't find very convincing.

## Redefining dead cells

So far I defined dead cells as the cells that have less than 100000 transcripts per cell. However, Guillaume showed me a better way for doing it, and it is based on PCA.

In [None]:
# pool together the expression matrices
exprMatrix <- cbind(exprMatrices[["P2449"]], exprMatrices[["P2458"]])

# remove genes that have no expression
exprMatrix <- exprMatrix[rowSums(exprMatrix)>1, ]

# normalize by row sum
total <- colSums(exprMatrix)
exprMatrix <- t(exprMatrix)
exprMatrix <- exprMatrix / rowSums(exprMatrix)
exprMatrix <- t(exprMatrix)

In [None]:
# do the PCA
exprMatrix.pca <- prcomp(t(exprMatrix), scale = TRUE)

In [None]:
# prepare for plotting
pca <- as.data.frame(exprMatrix.pca$x)
pca$batch <- substring(colnames(exprMatrix), 0, 5)
sampleSheet <- rbind(sampleSheets[["P2449"]], sampleSheets[["P2458"]])
pca$label <- sampleSheet$label
pca$total <- total

In [None]:
options(repr.plot.width = 3.5, repr.plot.height = 2)
ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color=total))  +
scale_colour_gradient(low="blue", high="red") + theme_bw()

In [None]:
options(repr.plot.width = 4, repr.plot.height = 2)
ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color=batch)) + theme_bw()

In [None]:
options(repr.plot.width = 4, repr.plot.height = 2.5)
ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color=label)) + theme_bw()

This last plot is very interesting. It shows three very distinct groups of cells. On the left, there are the alive cells, and on the right there are the dead cells. On the top left, there are the untreated cells, on the bottom left there are the treated cells.

Let's define the three groups *based only on the value of the PCA*.

In [None]:
# dead cells
dead <- pca["PC1"]>-5
dead.cells <- rownames(pca[dead, ])

# alive cells
nontreated <- !dead & pca["PC2"]>0
nontreated.cells <- rownames(pca[nontreated, ])
treated <- !dead & pca["PC2"]<0
treated.cells <- rownames(pca[treated, ])

In [None]:
totalExpression <- data.frame(cellnames = names(total), total = total,
                              status = factor(rep("dead", ncol(exprMatrix)),
                                             c("dead", "treated", "nontreated")))
totalExpression[treated.cells, ]$status <- "treated"
totalExpression[nontreated.cells, ]$status <- "nontreated"
totalExpression[rownames(exprMatrix), ]$label <- 

In [None]:
options(repr.plot.width = 15, repr.plot.height = 6)
ggplot(totalExpression, aes(x = cellnames, y = total)) +
geom_bar(aes(fill=status),stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

This plot is particularly interesting because it shows that there is no way that we can really detect dead cells by imposing a hard threshold on the level of total expression of the cells. If we impose a threshold that is too high, we would lose alive cells. If it's too low, we would pick up dead cells. Let's save this information for future reference.

In [None]:
totalExpression$label <- as.character(sampleSheet[rownames(totalExpression), ])

In [None]:
# write the information to a separate file
write.table(x = totalExpression[, c('cellnames', 'status', 'label')],
            file = sprintf('%s/samplesheet.csv', matrices.dir),
            row.names = FALSE,
            quote = FALSE, sep='\t')