# 2018-10-30 Automated clustering
This notebook is to use and test functions that perform automated clustering of the gene expression matrices.

In [None]:
# basic data
matrices.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data/matrices"
sample.names <- c("P2449", "P2458")

# init data structures that will hold our data
exprMatrices <- list()
sampleSheets <- list()

# load data
for (sample.name in sample.names) {
    
    # file names
    matrix.fname <- sprintf("%s/%s.tsv.gz", matrices.dir, sample.name)
    sampleSheet.fname <- sprintf("%s/monocle/%s.pd.tsv", matrices.dir, sample.name)

    # parse data
    exprMatrices[[sample.name]] <- read.table(matrix.fname, header = TRUE, row.names = 1,
                                sep = "\t", check.names = FALSE)
    sampleSheets[[sample.name]] <- read.delim(sampleSheet.fname, header = TRUE, row.names = 1)
}

# load gene annotations file
gene.annotations <- sprintf("%s/gene_annotations.tsv", matrices.dir)
gene.data <- read.delim(gene.annotations, header = TRUE, row.names = 1, sep = "\t")

In [None]:
# load our lovely script
source("/home/rcortini/work/CRG/projects/sc_hiv/scripts/GeneExpressionClustering.R")

## Clustering of gene expression
Prepare parameters for cluster extraction.

In [None]:
# cut for detection of outliers
cut <- list()
cut[["P2449"]] <- 14000
cut[["P2458"]] <- 8000

# soft threshold power for network extraction
softThresholdPower <- list()
softThresholdPower[["P2449"]] <- 5
softThresholdPower[["P2458"]] <- 6

In [None]:
# do all the network reconstruction and the module extraction for both samples
modules <- list()
nets <- list()
datExpr <- list()
for (sample.name in sample.names) {
    exprMatrix <- exprMatrices[[sample.name]]
    sampleSheet <- sampleSheets[[sample.name]]
    
    # filter the expression data
    datExpr[[sample.name]] <- PrepareDatExpr(exprMatrix, sampleSheet,
                                             ngenes = 5000,
                                             cut = cut[[sample.name]])
    
    # reconstruct the network
    nets[[sample.name]] <- ClusterGenes(datExpr[[sample.name]],
                            softThresholdPower = softThresholdPower[[sample.name]])
    
    # associate network motifs to HIV expression patterns
    modules[[sample.name]] <- AssociateClustersToHIV(datExpr[[sample.name]],
                                                     exprMatrix,
                                                     sampleSheet,
                                                     nets[[sample.name]],
                                                     aliveThreshold = 100000)
}

In [None]:
# create a data frame with the information on which gene is associated
# to which cluster (color) by the algorithm identifying the modules
colors <- data.frame(P2449 = nets[["P2449"]]$colors,
                     P2458 = nets[["P2458"]]$colors)
rownames(colors) <- colnames(datExpr)

Once this is done, let's have an idea about the overlap between the two things.

In [None]:
print(order(unique(colors$P2449)))
print(order(unique(colors$P2458)))

The number of modules is not the same in the two cases. Let's look at the correlation between the modules and the HIV expression in the two samples.

In [None]:
print("P2449")
modules[["P2449"]]$stats
print("P2458")
modules[["P2458"]]$stats

In the case of the P2458 sample, there is only one module that has a significant p-value, that is extremely worse than the case of the other sample. This points to the direction that the batch effects are actually not really batch effects: it's that the second sample is much noisier. I'll try to do one thing: use the modules identified in the first sample to try and predict the expression of HIV in the second sample.

In [None]:
modules.hybrid <- AssociateClustersToHIV(datExpr[["P2458"]],
                                         exprMatrices[["P2458"]],
                                         sampleSheets[["P2458"]],
                                         nets[["P2449"]],
                                         aliveThreshold = 100000)

In [None]:
modules.hybrid$stats

So here no good news either: the p-values are significantly less significant, and moreover the module with the greatest association is not one of the modules identified before.

## Individual gene correlation

Let's try another strategy: let's look at what are the most significantly associated *genes* in the two experiments.

In [None]:
# this function returns the list of values of the correlations of the genes to HIV
# activity
GenesCorToHIV <- function (sample.name) {
    # select the correct expression matrix
    myExprMatrix <- exprMatrices[[sample.name]]
    
    # select the treated cells...
    myExprMatrix <- myExprMatrix[, sampleSheets[[sample.name]]$label == "J-Lat+SAHA"]
    
    # ... that are alive
    myExprMatrix <- myExprMatrix[, colSums(myExprMatrix) > 100000]
    myExprMatrix <- t(myExprMatrix)
    
    # get the HIV expression values
    myCells <- rownames(myExprMatrix)
    hiv <- t(exprMatrices[[sample.name]]["FILIONG01", myCells])

    # do the correlation
    nSamples <- nrow(myExprMatrix)
    correlation <- cor(myExprMatrix, hiv, use = "p")
    pvalue <- corPvalueStudent(correlation, nSamples)
    
    # return
    r <- data.frame(cor = correlation, p = pvalue)
    names(r) <- c("cor", "p")
    r
}

We get the list of correlation values, then we order it in decreasing value of correlation coefficient. At the end we also attach the values of the gene symbols, so to get a list that will be easy to read.

In [None]:
most.significant <- list()
genes.cor.to.hiv <- list()
for (sample.name in sample.names) {
    genes.cor.to.hiv[[sample.name]] <- GenesCorToHIV(sample.name)
    most.significant.idx <- order(genes.cor.to.hiv[[sample.name]]$cor, decreasing = TRUE)
    most.significant[[sample.name]] <- genes.cor.to.hiv[[sample.name]][most.significant.idx, ]
    gene.names <- as.character(gene.data[rownames(most.significant[[sample.name]]), ])
    most.significant[[sample.name]]$gene_symbol <- gene.names
}

Let's now have a look at the list.

In [None]:
head(most.significant[["P2449"]])
head(most.significant[["P2458"]])

Apart from the obvious fact that the correlation is maximum for the FILIONG01 gene (duh), there are some candidates that have ridiculously low p-values, that would be worth exploring. There is a ubiquitin peptidase that looks as a promising candidate. For the P2458 sample, I have some serious doubts as to the fact that there is valuable information hidden there, it seems too noisy to me.