# 2018-11-08 Other analyses part 1

After a meeting with Andreas, Mie, and Jordi, I was asked to do some other tests to understand a few more things.

1. Correlate cell cycle with gene expression patterns. If in G2/M phase the transcription shuts down, then it's not so interesting to see that there are fewer cells expressing HIV. If, on the other side, HIV levels correlate well with other marker genes that are active, then it's more interesting.
2. Pool together all the data and extract the gene expression modules directly from the whole data set. See whether there are any interesting additional effects that come from there.
3. Look individually at the genes in the significant modules, and do a search of candidate association with HIV via PubMed.

## Cell cycle revisited

Let's start by looking at the cell cycle. Let's look at patterns of gene expression in the various phases.

In [None]:
source("/home/rcortini/work/CRG/projects/sc_hiv/scripts/sc_hiv.R")
scHiv <- process.sc_hiv()

First we remove the genes that have zero expression in all the cells, and we define the two expression matrices for the phases G1 and G2M.

We then calculate the average expression in the two subgroups, and then plot the results as a box plot.

In [None]:
expressionByPhase <- function(scHiv, threshold) {
    # calculate the total expression of the genes and select the genes
    # that have average expression greater than threshold
    individual.gene.total.expr <- rowMeans(assays(scHiv)[["counts"]])
    active.genes <- names(individual.gene.total.expr[individual.gene.total.expr > threshold])

    # get the expression values for the cells in the two phases
    G1.idx <- which(scHiv$phases == 'G1')
    G1 <- assays(scHiv)[["counts"]][active.genes, G1.idx]
    G2M.idx <- which(scHiv$phases == 'G2M')
    G2M <- assays(scHiv)[["counts"]][active.genes, G2M.idx]

    # calculate the averages
    data.frame(expr  = c(rowMeans(G1),        rowMeans(G2M)),
               phase = c(rep("G1", nrow(G1)), rep("G2M", nrow(G2M))))
}

plotExpressionByPhase <- function(scHiv, threshold) {
    # get the expression values
    av.by.phase <- expressionByPhase(scHiv, threshold)
    
    # plot
    theme_set(theme_classic())
    g <- ggpubr::ggboxplot(av.by.phase, x = "phase", y = "expr",
                   add = "jitter", color = "phase", main = sprintf("Threshold = %.1f", threshold))
    g + ggpubr::stat_compare_means()# + coord_cartesian(ylim=c(6, 14))
}

In [None]:
options(repr.plot.width = 5, repr.plot.height = 6)
g1 <- plotExpressionByPhase(scHiv, 0)
g2 <- plotExpressionByPhase(scHiv, 3)
g3 <- plotExpressionByPhase(scHiv, 7)
g4 <- plotExpressionByPhase(scHiv, 10)
multiplot(g1, g2, g3, g4, cols = 2)

This analysis shows that there is indeed a significant difference between the expression values of the cells in G1 and those in G2M.

In [None]:
av.by.phase <- expressionByPhase(scHiv, 0)
G1 <- av.by.phase[av.by.phase$phase == "G1", ]$expr
G2M <- av.by.phase[av.by.phase$phase == "G2M", ]$expr

In [None]:
x <- seq(-0.1,10,0.001)
n <- length(x)
cumulative.phases <- data.frame(x = rep(x, 2),
                                cumdist = c(ecdf(G1)(x), ecdf(G2M)(x)),
                                phase = c(rep("G1", n), rep("G2M", n)))

In [None]:
options(repr.plot.width = 4,repr.plot.height = 2)
gg <- ggplot(cumulative.phases, aes(x = x, y = cumdist)) + 
      geom_line(aes(colour = phase), cex = 1.) +
      geom_hline(yintercept = 1, size = 0.2, linetype="dashed", color = "black") +
      labs(x = "Average Gene Expression", y = "Cumulative distribution")
gg

So the conclusion from this analysis is that cells that are in G1 phase tend to have less inactive genes than the ones in G2M phase. This is in line with the fact that HIV is also less active in G2M cells than it is in G1 cells. However, this analysis does not point at a meaningful biological conclusion.