# 2018-10-24-Gene_network_analysis

In this notebook I want to try to use the `WGCNA` R package to try to construct the gene network modules for the experiments in my data set.

I'll try to see what happens if I only include the Jurkat, the J-Lat treated or untreated cells.

## Loading data

In [None]:
# load the WGCNA library and allow multithreading
library(WGCNA)
allowWGCNAThreads()

In [None]:
# basic data
matrices.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data/matrices"
sample.names <- c("P2449", "P2458")

# init data structures that will hold our data
exprMatrices <- list()
sampleSheets <- list()

# load data
for (sample.name in sample.names) {
    
    # file names
    matrix.fname <- sprintf("%s/%s.tsv.gz", matrices.dir, sample.name)
    sampleSheet.fname <- sprintf("%s/monocle/%s.pd.tsv", matrices.dir, sample.name)

    # parse data
    exprMatrices[[sample.name]] <- read.table(matrix.fname, header = TRUE, row.names = 1,
                                sep = "\t", check.names = FALSE)
    sampleSheets[[sample.name]] <- read.delim(sampleSheet.fname, header = TRUE, row.names = 1)
}

# load gene annotations file
gene.annotations <- sprintf("%s/gene_annotations.tsv", matrices.dir)
gene.data <- read.delim(gene.annotations, header = TRUE, row.names = 1, sep = "\t")

## Prepare clustering functions
Here I prepare a few functions (directly taken from the WGCNA tutorial) that will allow to do the clustering of the gene expression profiles.

In [None]:
# This function prepares the data structure to be fed to the next function
PrepareDataForClustering <- function (exprMatrix, sampleSheet,
                          ngenes = 3600,
                          cut = 1000) {
    # select the group of genes from the untreated J-Lat cells
    jlat.untreated <- exprMatrix[, sampleSheet$label == 'J-Lat+DMSO']
    
    # establish which are the most highly varying genes, based on a simple
    # criterion of maximum variance/mean.
    gene.variances <- apply(jlat.untreated, 1, var)
    gene.means <- apply(jlat.untreated, 1, mean)
    gene.variability <- gene.variances/gene.means
    
    # get the names of the genes that have the greatest biological variation, 
    # excluding the FILIONG01 gene (not really necessary)
    selected <- order(gene.variability, decreasing = TRUE)[1:ngenes]
    most.variable.genes <- rownames(jlat.untreated[selected, ])
    most.variable.genes <- most.variable.genes[most.variable.genes != 'FILIONG01']
    
    # extract a data frame with the values of the expressions for each of the genes
    # with the highest biological variation
    datExpr0 <- as.data.frame(t(jlat.untreated[most.variable.genes, ]))
    
    # do quality control
    gsg <- goodSamplesGenes(datExpr0, verbose = 3);
    if (!gsg$allOK) {
        stop("Do proper quality control on genes!") 
    }
    
    # plot size
    options(repr.plot.width = 10, repr.plot.height = 6)

    # detect outliers
    sampleTree <- hclust(dist(datExpr0), method = "average");
    par(cex = 0.6);
    par(mar = c(0,4,2,0))
    plot(sampleTree,
         main     = "Sample clustering to detect outliers",
         sub      = "",
         xlab     = "",
         cex.lab  = 1.5,
         cex.axis = 1.5,
         cex.main = 2)

    # Plot a line to show the cut
    abline(h = cut, col = "red");
    
    # Determine cluster under the line
    clust <- cutreeStatic(sampleTree, cutHeight = cut, minSize = 10)
    table(clust)
    
    # clust 1 contains the samples we want to keep.
    keepSamples <- (clust == 1)
    datExpr0[keepSamples, ]
}

In [None]:
# this function outputs a plot that allows to choose the best value of the
# soft thresholding power
PrepareClustering <- function (datExpr) {
    # Choose a set of soft-thresholding powers
    powers <- c(c(1:10), seq(from = 12, to=20, by=2))

    # Call the network topology analysis function
    sft <- pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
    
    # number of genes and number of samples
    nGenes <- ncol(datExpr)
    nSamples <- nrow(datExpr)

    # Plot the results:
    par(mfrow = c(1,2))
    cex1 = 0.9
    options(repr.plot.width = 10, repr.plot.height = 6)

    # Scale-free topology fit index as a function of the soft-thresholding power
    plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
         xlab = "Soft Threshold (power)",
         ylab = "Scale Free Topology Model Fit,signed R^2",
         type = "n",
         main = paste("Scale independence"))

    text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
         labels = powers,
         cex    = cex1,
         col    = "red");

    # this line corresponds to using an R^2 cut-off of h
    abline(h = 0.90, col = "red")

    # Mean connectivity as a function of the soft-thresholding power
    plot(sft$fitIndices[,1], sft$fitIndices[,5],
         xlab = "Soft Threshold (power)",
         ylab = "Mean Connectivity",
         type = "n",
         main = paste("Mean connectivity"))

    text(sft$fitIndices[,1], sft$fitIndices[,5],
         labels = powers,
         cex    = cex1,
         col    = "red")
}

In [None]:
VisualizeClustering <- function (net) {
    # plot size
    options(repr.plot.width = 10, repr.plot.height = 6)

    # Convert labels to colors for plotting
    mergedColors <- labels2colors(net$colors)

    # Plot the dendrogram and the module colors underneath
    plotDendroAndColors(net$dendrograms[[1]],
                        mergedColors[net$blockGenes[[1]]],
                        "Module colors",
                        dendroLabels = FALSE,
                        hang = 0.03,
                        addGuide = TRUE,
                        guideHang = 0.05)
}

In [None]:
# prepare data structures for further analysis
datExpr <- list()
net <- list()

## P2449 clustering

In [None]:
sample.name <- "P2449"
exprMatrix <- exprMatrices[[sample.name]]
sampleSheet <- sampleSheets[[sample.name]]

In [None]:
datExpr[[sample.name]] <- PrepareDataForClustering(exprMatrix, sampleSheet, cut = 14000, ngenes = 5000)

In [None]:
PrepareClustering(datExpr[[sample.name]])

In [None]:
net[[sample.name]] <- blockwiseModules(datExpr[[sample.name]],
                        power             = 5,
                        TOMType           = "unsigned", 
                        inModuleSize      = 30,
                        reassignThreshold = 0,
                        mergeCutHeight    = 0.25,
                        numericLabels     = TRUE,
                        pamRespectsDendro = FALSE,
                        verbose           = 3)

In [None]:
table(net[[sample.name]]$colors)

In [None]:
VisualizeClustering(net[[sample.name]])

## P2458 clustering

In [None]:
sample.name <- "P2458"
exprMatrix <- exprMatrices[[sample.name]]
sampleSheet <- sampleSheets[[sample.name]]

In [None]:
datExpr[[sample.name]] <- PrepareDataForClustering(exprMatrix, sampleSheet, cut = 6000,
                                                  ngenes = 5000)

In [None]:
PrepareClustering(datExpr[[sample.name]])

In [None]:
net[[sample.name]] <- blockwiseModules(datExpr[[sample.name]],
                        power             = 6,
                        TOMType           = "unsigned", 
                        inModuleSize      = 30,
                        reassignThreshold = 0,
                        mergeCutHeight    = 0.25,
                        numericLabels     = TRUE,
                        pamRespectsDendro = FALSE,
                        verbose           = 3)

In [None]:
table(net[[sample.name]]$colors)

In [None]:
VisualizeClustering(net[[sample.name]])

## Projection of cells onto clustered space
Let's now focus on the "P2449" sample which gives cleaner results. Once the gene modules have been identified, we can think of projecting each of the cells onto a space of much lower dimensions, by assessing the activity of each of the genes of the module.

In [None]:
mynet <- net[["P2449"]]
myColors <- mynet$colors
myGenes <- colnames(datExpr[["P2449"]])

I need to select the data from the **treated** J-Lat cells, but only the ones that are alive and only the genes that correspond to the genes I selected before.

In [None]:
myExprMatrix <- exprMatrices[["P2449"]]

# select only the genes that we selected before
myExprMatrix <- myExprMatrix[mygenes, ]

# select only J-Lat treated cells
myExprMatrix <- myExprMatrix[, sampleSheets[["P2449"]]$label == "J-Lat+SAHA"]

# select only alive cells
myExprMatrix <- myExprMatrix[, colSums(myExprMatrix) > 100000]

# finally, transpose to be interfaced to WGCNA
myExprMatrix <- t(myExprMatrix)

Now I can invoke the `moduleEigengenes` function from the package to get a projection of the cells onto the space defined by the modules that we identified earlier.

In [None]:
MEs <- moduleEigengenes(myExprMatrix, colors)

Based on these module eigengenes, we can now do the modelling.

In [None]:
myCells <- rownames(myExprMatrix)
hiv <- as.numeric(exprMatrices[["P2449"]]["FILIONG01", myCells])
eigengenes <- as.matrix(MEs$eigengenes)
model <- lm(formula = hiv ~ eigengenes)
summary(model)

Coming to think about it, maybe fitting a model that contains 34 variables is not such a great idea. Especially because that model contains modules that have not been tested for biological significance. I still need to check that those lovely dendrograms that I obtain do correspond to something meaningful.