# 2018-10-26 Validating gene networks

In the previous chapter I used the expression matrices to obtain gene expression networks, in the form of "modules" of genes that express themselves in a coherent way. The trouble was that I didn't do validation of those modules, and this left me wondering whether doing any modelling based on those modules would make any sense.

Here, I want to follow the `WGCNA` tutorial to understand whether I was doing stuff that was actually meaningful. If so, I will proceed with doing modelling with those modules. If not, I will go back and try to tweak the parameters of the clustering algorithms to try and obtain something meaningful.

In [None]:
# basic data
matrices.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data/matrices"
sample.names <- c("P2449", "P2458")

# init data structures that will hold our data
exprMatrices <- list()
sampleSheets <- list()

# load data
for (sample.name in sample.names) {
    
    # file names
    matrix.fname <- sprintf("%s/%s.tsv.gz", matrices.dir, sample.name)
    sampleSheet.fname <- sprintf("%s/monocle/%s.pd.tsv", matrices.dir, sample.name)

    # parse data
    exprMatrices[[sample.name]] <- read.table(matrix.fname, header = TRUE, row.names = 1,
                                sep = "\t", check.names = FALSE)
    sampleSheets[[sample.name]] <- read.delim(sampleSheet.fname, header = TRUE, row.names = 1)
}

# load gene annotations file
gene.annotations <- sprintf("%s/gene_annotations.tsv", matrices.dir)
gene.data <- read.delim(gene.annotations, header = TRUE, row.names = 1, sep = "\t")

In [None]:
source("/home/rcortini/work/CRG/projects/sc_hiv/scripts/sc_hiv.R")

In [None]:
sample.name <- "P2449"
exprMatrix <- exprMatrices[[sample.name]]
sampleSheet <- sampleSheets[[sample.name]]

In [None]:
datExpr <- PrepareDataForClustering(exprMatrix, sampleSheet,
                                    cut = 14000,
                                    ngenes = 5000)

In [None]:
suppressMessages(PrepareClustering(datExpr))

In [None]:
net <- blockwiseModules(datExpr,
                        power             = 5,
                        TOMType           = "unsigned", 
                        inModuleSize      = 30,
                        reassignThreshold = 0,
                        mergeCutHeight    = 0.25,
                        numericLabels     = TRUE,
                        pamRespectsDendro = FALSE,
                        verbose           = 0)

In [None]:
VisualizeClustering(net)

In [None]:
# get the module labels, transform them into colors
moduleLabels <- net$colors
moduleColors <- labels2colors(net$colors)

# get the names of the genes we selected from the original ones
myGenes <- colnames(datExpr)


In [None]:
myExprMatrix <- exprMatrices[[sample.name]]

# select only the genes that we selected before
myExprMatrix <- myExprMatrix[myGenes, ]

# select only J-Lat treated cells
myExprMatrix <- myExprMatrix[, sampleSheets[[sample.name]]$label == "J-Lat+SAHA"]

# select only alive cells
myExprMatrix <- myExprMatrix[, colSums(myExprMatrix) > 100000]

# finally, transpose to be interfaced to WGCNA
myExprMatrix <- t(myExprMatrix)

In [None]:
# get the module eigengenes of the *new* data set: that is, we assign the
# expression profiles of the treated data set based on the gene modules of the
# untreated cells
MEs <- moduleEigengenes(myExprMatrix, moduleColors)$eigengenes
MEs <- orderMEs(MEs)

In [None]:
# get the names of the cells that we have selected, and extract the HIV profile
# of those cells
myCells <- rownames(myExprMatrix)
hiv <- t(exprMatrices[[sample.name]]["FILIONG01", myCells])

In [None]:
# parameters of our data set
nGenes <- ncol(myExprMatrix)
nSamples <- nrow(myExprMatrix)

In [None]:
# correlate the module eigengenes to the HIV expression patterns, and 
# calculate the corresponding p value
moduleHivCor <- cor(MEs, hiv, use = "p")
moduleHivPvalue <- corPvalueStudent(moduleHivCor, nSamples)

In [None]:
# look at the module statistics together: correlation and p-value
moduleStats <- data.frame(correlation = moduleHivCor, pvalue = moduleHivPvalue)
names(moduleStats) <- c("correlation", "p")
moduleStats

By looking at this data set, we can see that the **darkgreen** and the **darkturquoise** module eigengenes have a significant correlation to the HIV expression pattern in this data set. We now try to identify what are the relevant genes in those modules.

In [None]:
modNames <- substring(names(MEs), 3)

# evaluate gene module membership
geneModuleMembership <- as.data.frame(cor(myExprMatrix, MEs, use = "p"))
MMPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples))
names(geneModuleMembership) <- paste("MM", modNames, sep="")
names(MMPvalue) <- paste("p.MM", modNames, sep="")

# evaluate gene trait significance
geneTraitSignificance <- as.data.frame(cor(myExprMatrix, hiv, use = "p"));
GSPvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) <- paste("GS.", names(hiv), sep="");
names(GSPvalue) <- paste("p.GS.", names(hiv), sep="");

In [None]:
# we encapsulate the code do do a plot of Module Membership (MM) 
# versus Gene Significance (for HIV, GS)
ShowMMvsGS <- function (module) {
    column <- match(module, modNames);
    moduleGenes <- moduleColors==module;

    options(repr.plot.width = 4, repr.plot.height = 4)
    par(mfrow = c(1,1));
    verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
                       abs(geneTraitSignificance[moduleGenes, 1]),
                       xlab = paste("Module Membership in", module, "module"),
                       ylab = "Gene significance for HIV",
                       main = paste("Module membership vs. gene significance\n"),
                       cex.main = 1.0,
                       cex.lab = 1.0,
                       cex.axis = 0.8,
                       col = module)
}

# show the plots for the two interesting modules we identified
ShowMMvsGS("darkgreen")
ShowMMvsGS("darkturquoise")

## Gene Ontology enrichment analysis

The next step is to do Gene Ontology enrichment analysis. There is a function provided by the `WGCNA` package which is designed to do this in one go. However, the function takes as input the Entrez gene id, which I have to retrieve. I will use the `biomaRt` package to do this.

In [None]:
# first, remove the dots
myGenes.ensembleIds <- gsub("\\..*", "", myGenes)

In [None]:
# load the biomaRt library
library(biomaRt)

In [None]:
# load the data corresponding to human genome
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

In [None]:
# get the entrez gene ids
myGenes.entrez <- getBM(attributes = c("ensembl_gene_id_version", "entrezgene"),
                  filters = "ensembl_gene_id_version",
                  values = myGenes,
                  mart = mart)

In [None]:
# some of the genes in the corresponding list are duplicated, so we'll have to remove
# them
duplicated.genes <- duplicated(myGenes.entrez$ensembl_gene_id_version)
good.genes <- myGenes.entrez[!duplicated.genes, ]$ensembl_gene_id_version
good.gene.idx <- match(good.genes, colnames(myExprMatrix))
good.moduleColors <- moduleColors[good.gene.idx]
good.entrez <- myGenes.entrez$entrezgene[good.gene.idx]

In [None]:
# do the GO enrichment analysis
GOenr <- GOenrichmentAnalysis(good.moduleColors, good.entrez, organism = "human", nBestP = 10);

In [None]:
# extract the "most interesting element" of the return object
tab <- GOenr$bestPTerms[[4]]$enrichment

In [None]:
# write the information on an output file
write.table(tab, 
            file = sprintf("%s/G0EnrichmentAnalysis-%s.csv", matrices.dir, sample.name),
            sep = ",",
            quote = TRUE,
            row.names = FALSE)