# 2018-11-29 Comparing contingency tables
In the previous analyses that I ran, I realized that there were great problems due to the fact that the values of the Pearson correlation between the genes' expression values and the values of the expression of the GFP reporter were highly sensitive to whether there was a single point in or outside of the analysis.

Here, I want to try a different approach: I want to compare the contingency tables of the genes: that is, whether a gene is on or off, simply.

In [None]:
library(ggplot2)
library(RColorBrewer)
theme_set(theme_bw())

In [None]:
# load the data
matrices.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data/matrices"
merged <- read.table(sprintf('%s/exprMatrix.csv', matrices.dir),
                     header = TRUE, row.names = 1,
                     sep = "\t", check.names = FALSE)

# load sample sheet
sampleSheet <- read.table(sprintf('%s/samplesheet.csv', matrices.dir),
                          header = TRUE,
                          row.names = 1)

# remove dead cells
sampleSheet <- sampleSheet[sampleSheet$status != "dead", ]

# load gene annotations file
gene.annotations <- sprintf("%s/gene_annotations.tsv", matrices.dir)
gene.data <- read.delim(gene.annotations, header = TRUE, sep = "\t",
                        row.names = 1, stringsAsFactors = FALSE)
gene.data <- subset(gene.data, rownames(gene.data) %in% rownames(merged))

In [None]:
# define treated cells
treated <- sampleSheet$status == "treated"

In [None]:
onoff <- merged[, treated] > 1
hiv.onoff <- merged["FILIONG01", treated] > 1

In [None]:
# threshold for significance of p-values
threshold <- 0.5

# init data frame that will contain the list of interesting genes
interesting <- data.frame()

# for the Fisher exact test not to fail we need to init contingency
# tables that have predefined levels
levs <- c(TRUE, FALSE)

for (TF.name in rownames(merged)) {
    if (TF.name == "FILIONG01") next
    tab <- table(factor(onoff[TF.name, ], levs), factor(hiv.onoff, levs))
    test <- fisher.test(tab)
    if (test$p.value < threshold){
        interesting <- rbind(interesting, data.frame(name = TF.name, p = test$p.value))
    }
}

# add row names and gene symbols for readibility
rownames(interesting) <- interesting$name
interesting$gene_symbol <- gene.data[rownames(interesting), "gene_symbol"]
interesting <- interesting[, -1]

# order by p-value
interesting <- interesting[order(interesting$p), ]

# show
interesting

We obtained a list of possible candidate genes associated to HIV expression. Let's look at their scatter plots.

In [None]:
# prepare data for plotting
X <- data.frame(expr = t(merged[rownames(interesting), treated]), hiv = t(merged["FILIONG01", treated]))
colnames(X) <- c(rownames(interesting), "hiv")

In [None]:
# let's now plot all the results
options(repr.plot.width = 2.5, repr.plot.height = 2)
for (TF.name in rownames(interesting)[1:10]) {
    gg <- ggplot(X, aes_string(TF.name, "hiv")) + geom_point()  +
    geom_smooth(method='lm') +
    labs(x = gene.data[TF.name, "gene_symbol"], y = "GFP expression", 
         title = sprintf("p = %.3e", interesting[TF.name, "pHIV"]))
    print(gg)
}

Clearly the scatter plots don't really show the information that the contingency tables show. Let's look at the contingency tables for the best candidates.

In [None]:
i <- 3
INTS1 <- "ENSG00000164880.15"
PUS10 <- "ENSG00000162927.13"
TF.name <- INTS1
# TF.name <- rownames(interesting)[i]
table(onoff[TF.name, ], hiv.onoff, dnn = c(gene.data[TF.name, "gene_symbol"], "HIV"))
t(merged[c(INTS1, PUS10, "FILIONG01"), treated])

In [None]:
table(onoff[INTS1, ], onoff[PUS10, ], dnn = c("INTS1", "PUS10"))

In [None]:
options(repr.plot.width = 5, repr.plot.height = 4)
gg <- ggplot(X, aes_string(INTS1, PUS10)) + geom_point(aes(color=log10(1+hiv)), size=3)  +
scale_colour_gradient(low="blue", high="red") +
labs(x = "INTS1", y = "PUS10")
print(gg)

This result is quite striking. Let's look at the distribution of these two proteins across the entire data set.

In [None]:
INTS1.expr <- as.data.frame(t(merged[INTS1, ]))
INTS1.expr$gene <- "INTS1"
INTS1.expr$label <- sampleSheet$label
colnames(INTS1.expr) <- c("expr", "gene", "label")
PUS10.expr <- as.data.frame(t(merged[PUS10, ]))
PUS10.expr$gene <- "PUS10"
PUS10.expr$label <- sampleSheet$label
colnames(PUS10.expr) <- c("expr", "gene", "label")
Y <- rbind(INTS1.expr, PUS10.expr)

In [None]:
ggplot(Y, aes(x = gene, y = expr, fill = label)) + geom_boxplot(outlier.size = 0.3) +
scale_y_log10() + labs(y = "Expression", title = "Non-zero expression values")

In [None]:
labels <- unique(Y$label)

# zero-expression fraction
zef <- data.frame()
for (label in labels) {
    INTS1.zef <- sum((merged[INTS1,sampleSheet$label == label] == 0)/sum(sampleSheet$label==label))
    PUS10.zef <- sum((merged[PUS10, sampleSheet$label == label] == 0)/sum(sampleSheet$label==label))
    zef <- rbind(zef, data.frame(zef = INTS1.zef, label = label, gene = "INTS1"))
    zef <- rbind(zef, data.frame(zef = PUS10.zef, label = label, gene = "PUS10"))
}

In [None]:
ggplot(zef, aes(x = gene, y = zef, fill = label)) +
geom_bar(position = position_dodge(), stat = "identity") +
labs(y = "Number", title = "Fraction of cells with zero expression")