In [None]:
# ggplot stuff
library(ggplot2)
library(RColorBrewer)
theme_set(theme_bw())

# 2019-03-05-New_matrices

New gene expression matrices have been calculated by the CNAG. I'll try to get the matrices and the corresponding data sheets.

In [None]:
# directory containing the expression matrix data
matrices.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data/matrices"

# names of the new sample plates
new.matrix.names <- c("P2769", "P2770", "P2771")

# init the data structures that will contain the data
expr.matrices <- list()

# do a loop and load all the data. For the moment, I'll keep the various plates separate.
for (name in new.matrix.names) {
    # build the matrix file name and load it
    matrix.fname <- sprintf("%s/%s.tsv.gz", matrices.dir, name)
    expr.matrices[[name]] <- read.table(matrix.fname, header = TRUE, row.names = 1,
                                       sep = "\t", check.names = FALSE)
}

In [None]:
# load the sample sheet
sample.sheet.fname <- sprintf("%s/samplesheet_2.tsv", matrices.dir)
sample.sheet <- read.delim(sample.sheet.fname, header = TRUE, row.names = 1)

Okay, now that we loaded all the data, we can look at some very basic things. First, let's look at the total number of cells that we have in the sample. That is, let's look at the sum of all the reads in a particular cell.

In [None]:
for (name in new.matrix.names) {
    expr.matrix <- expr.matrices[[name]]
    total.reads <- as.data.frame(colSums(expr.matrix))
    colnames(total.reads) <- "sum"
    total.reads$label <- sample.sheet[rownames(total.reads), "label"]

    # plot
    options(repr.plot.width = 15, repr.plot.height = 6)
    gg <- ggplot(total.reads, aes(x = rownames(total.reads), y = sum)) +
    geom_bar(aes(fill = label), stat="identity")
    print(gg)
}

Okay, so we have sort of the same problem that we had before. Now we're looking at cells that might not even be cells.

In [None]:
# pool together the expression matrices
exprMatrix <- cbind(expr.matrices[["P2769"]],
                    expr.matrices[["P2770"]],
                    expr.matrices[["P2771"]])

# remove genes that have no expression
exprMatrix <- exprMatrix[rowSums(exprMatrix)>1, ]

# normalize by row sum
total <- colSums(exprMatrix)
exprMatrix <- t(exprMatrix)
exprMatrix <- exprMatrix / rowSums(exprMatrix)
exprMatrix <- t(exprMatrix)

In [None]:
# do the PCA
exprMatrix.pca <- prcomp(t(exprMatrix), scale = TRUE)

In [None]:
# prepare for plotting
pca <- as.data.frame(exprMatrix.pca$x)
pca$batch <- substring(colnames(exprMatrix), 0, 5)
pca$label <- sample.sheet$label
pca$total <- total

In [None]:
options(repr.plot.width = 6.5, repr.plot.height = 2)
ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color=total))  +
scale_colour_gradient(low="blue", high="red") + theme_bw()

It's clear that the lonely point in the far right is a complete outlier, maybe one of the cells that do not have expression at all.

In [None]:
options(repr.plot.width = 4, repr.plot.height = 3)
ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color=total))  +
scale_colour_gradient(low="blue", high="red") + theme_bw() + xlim(-30, 40)

In [None]:
options(repr.plot.width = 4, repr.plot.height = 3)
ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color=batch)) + theme_bw()  + xlim(-30, 40)

No significant batch effects are present, at least from this plot. Let's look at the cell identity.

In [None]:
options(repr.plot.width = 6, repr.plot.height = 4)
ggplot(pca, aes(PC1, PC2)) + geom_point(aes(color=label)) + theme_bw() + xlim(-20, 40)

So from here it is clear that the first principal component is still the one that captures whether the cells are dead or not, and the second principal component captures the global shift in gene expression patterns that occur pre- and post-treatment.

The treatment with SAHA is still the one that causes the largest shift in the global expression patterns. As is seen in this plot, the dead or dying cells are almost all treated with SAHA.