# 2018-11-28 Transcription factors
So far we analysed the genes without knowing anything about them (with few exceptions). Now I want to study what happens to transcription factors, so assess whether there are any interesting results.

In [None]:
# biomaRt for obtaining information on genes
library(biomaRt)

# ggplot stuff
library(ggplot2)
library(RColorBrewer)
theme_set(theme_bw())

# DESeq
library(DESeq)

# extra goodies
library(Rfast)


In [None]:
# load the data
matrices.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data/matrices"
merged <- read.table(sprintf('%s/exprMatrix.csv', matrices.dir),
                     header = TRUE, row.names = 1,
                     sep = "\t", check.names = FALSE)

# load sample sheet
sampleSheet <- read.table(sprintf('%s/samplesheet.csv', matrices.dir),
                          header = TRUE,
                          row.names = 1)

# remove dead cells
sampleSheet <- sampleSheet[sampleSheet$status != "dead", ]

# load gene annotations file
gene.annotations <- sprintf("%s/gene_annotations.tsv", matrices.dir)
gene.data <- read.delim(gene.annotations, header = TRUE, sep = "\t",
                        row.names = 1, stringsAsFactors = FALSE)
gene.data <- subset(gene.data, rownames(gene.data) %in% rownames(merged))

I downloaded the list of all the human transcription factors from http://humantfs.ccbr.utoronto.ca/download.php Now I'll load that file and let's see how to match those names to the ones present in the gene list.

In [None]:
# load list
TF.list <- read.table(file = sprintf('%s/../TFs_Ensembl_v_1.01.txt', matrices.dir))
TF.list <- as.character(TF.list$V1)

In [None]:
# get gene short names and add the column to the list
gene.short.names <- gsub("\\..*", "", rownames(gene.data))
gene.data$gene_short_name <- gene.short.names

In [None]:
# get the list of transcription factors that are present in our list
TFs.short.names <- intersect(gene.short.names, TF.list)
TFs <- rownames(subset(gene.data, gene_short_name %in% TFs.short.names))

So now we have the names of the genes that we are interested in.

## Differential expression analysis

We begin the analysis with looking at which of the transcription factors are differentially expressed before and after treatment.

In [None]:
# group the cell types together as factors
groups <- factor(sampleSheet$label,
                 levels = c("Jurkat", "J-Lat+DMSO", "J-Lat+SAHA"))

# cast to integer
merged.int <- as.data.frame(lapply(merged, as.integer))
rownames(merged.int) <- rownames(merged)

# this is the basic data structure that DESeq understands
cds <- newCountDataSet(merged.int, groups)

# estimate size factors
cds <- estimateSizeFactors(cds)

# estimate dispersion
cds <- estimateDispersions(cds, sharingMode="gene-est-only")

In [None]:
# do the differential expression analysis
de.test <- nbinomTest(cds, "J-Lat+DMSO", "J-Lat+SAHA")

In [None]:
# define treated cells
treated <- sampleSheet$status=="treated" & merged["FILIONG01", ] < 5000

In [None]:
# subset the TFs
TF.de.test <- subset(de.test, id %in% TFs)

# add also HIV correlation
TF.de.test$corHIV <- cor(t(merged[TFs, treated]), t(merged["FILIONG01", treated]), use="p")

# filter out significant genes based on FDR adjusted p-values
TF.de.test <- TF.de.test[!is.infinite(TF.de.test$log2FoldChange) & 
                       !is.nan(TF.de.test$log2FoldChange),]

# order by p-value
TF.de.test <- TF.de.test[order(TF.de.test$pval),]

# use ids as row names
rownames(TF.de.test) <- TF.de.test$id
TF.de.test <- TF.de.test[, -1]

# and calculate the p-value for the HIV correlation
TF.de.test$pHIV <- 0.0
for (i in 1:nrow(TF.de.test)) {
    TF.name <- rownames(TF.de.test)[i]
    TF.de.test$pHIV[i] <- cor.test(t(merged[TF.name, treated]), t(merged["FILIONG01", treated]),
                                  method = "pearson")$p.value
}

# add gene name for readibility
TF.de.test$geneName <- gene.data[rownames(TF.de.test),]$gene_symbol

In [None]:
threshold <- 0.05

# add categories based on significance
TF.de.test$significance <- factor(rep("non-significant", nrow(TF.de.test)),
                                 levels = c("non-significant", "HIV", "DES", "both"))

# significant for differential expression
TF.de.test$significance[TF.de.test$pval < threshold] <- "DES"

# significant for HIV correlation
TF.de.test$significance[TF.de.test$pHIV < threshold] <- "HIV"

# significant for both
TF.de.test$significance[TF.de.test$pval < threshold & 
                       TF.de.test$pHIV < threshold] <- "both"

In [None]:
options(repr.plot.width = 5, repr.plot.height = 3)
ggplot(TF.de.test, aes(pval, pHIV)) + geom_point(aes(color = significance)) +
scale_x_continuous(trans="log10") +
scale_y_continuous(trans="log10") +
geom_hline(yintercept = threshold, linetype = "dashed", color = "red") +
geom_vline(xintercept = threshold, linetype = "dashed", color = "red")

We identified transcription factors that are differentially expressed between non-treated and treated cells and that are significantly correlated to HIV. Let's have a look at the list of candidates.

In [None]:
significant.TFs <- subset(TF.de.test, significance == "both")
X <- data.frame(expr = t(merged[rownames(significant.TFs), treated]), hiv = t(merged["FILIONG01", treated]))
colnames(X) <- c(rownames(significant.TFs), "hiv")

In [None]:
# let's now plot all the results
options(repr.plot.width = 2.5, repr.plot.height = 2)
for (TF.name in rownames(significant.TFs)) {
    gg <- ggplot(X, aes_string(TF.name, "hiv")) + geom_point()  +
    geom_smooth(method='lm') +
    labs(x = gene.data[TF.name, "gene_symbol"], y = "GFP expression", 
         title = sprintf("p = %.3e", significant.TFs[TF.name, "pHIV"]))
    print(gg)
}

These results are interesting but not spectacular. We should really think of a way of analysing how on-off results compare, more than anything else.