# 2018-11-15 Proximal genes
One question is whether the activity of genes that reside in the physical proximity of the HIV integration has any effect on whether HIV will be reactivated or not.

Mie passed me the information on where the integration resides. The paper "Chromatin Reassembly Factors Are Involved in Transcriptional Interference Promoting HIV Latency" by Gallastegui et al, J Virology 2011 contains the information. Mie said that the cells used are J-Lat A2 clones. The first paragraph of the "Results" section in that paper at some point reads:

>J-Lat A2 cells contain the HIV construct at intron 8 of the UTX gene (ChXp11.3), in a configuration opposite to the transcriptional orientation of this gene (Fig. 1C) (28)

And the figure 1C is the following
![J-Lat A2 integration site](../figures/J-Lat-A2-integration.png)

So from the web page of GeneCards I figured out that UTX is better known as KDM6A, and its [description](https://www.genecards.org/cgi-bin/carddisp.pl?gene=KDM6A) includes the information that it is mapped to chrX:44,732,423-44,971,847(GRCh37/hg19).

So now I have a target region in the genome. I'd like to figure out which are the genes that are upstream or downstream of this region of interest, and assess whether there are any visible effects of the activity of those genes on the activity of the HIV promoter.

In [None]:
library(GenomicRanges)

In [None]:
library(Homo.sapiens)

In [None]:
library(dplyr)

In [None]:
library(biomaRt)

In [None]:
library(ggplot2)

In [None]:
# define the region of interest
chrom <- "chrX"
start <- 44732423
end <- 44971847
amplitude <- 100000
region <- data.frame(chrom = "chrX",
                     start = start-amplitude,
                     end   = end+amplitude)

# use the "GenomicRanges" package to define an object that we can use
region.gr <- makeGRangesFromDataFrame(region)

In [None]:
# now use the `subsetByOverlaps` function to determine which are the genes
# in the region defined
genes <- subsetByOverlaps(genes(TxDb.Hsapiens.UCSC.hg19.knownGene), region.gr)

In [None]:
# load the data corresponding to human genome
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

In [None]:
# get the entrez gene ids
gene.symbols <- getBM(attributes = c("entrezgene", "ensembl_gene_id_version"),
                  filters = "entrezgene",
                  values = genes$gene_id,
                  mart = mart)

Now that we have the list of genes that we are interested in, let's go back and load the data of the scRNA-seq project.

In [None]:
# basic data
matrices.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data/matrices"
sample.names <- c("P2449", "P2458")

# init data structures that will hold our data
exprMatrices <- list()
sampleSheets <- list()

# load data
for (sample.name in sample.names) {
    
    # file names
    matrix.fname <- sprintf("%s/%s.tsv.gz", matrices.dir, sample.name)
    sampleSheet.fname <- sprintf("%s/monocle/%s.pd.tsv", matrices.dir, sample.name)

    # parse data
    exprMatrices[[sample.name]] <- read.table(matrix.fname, header = TRUE, row.names = 1,
                                sep = "\t", check.names = FALSE)
    sampleSheets[[sample.name]] <- read.delim(sampleSheet.fname, header = TRUE, row.names = 1)
}

# load gene annotations file
gene.annotations <- sprintf("%s/gene_annotations.tsv", matrices.dir)
gene.data <- read.delim(gene.annotations, header = TRUE, row.names = 1, sep = "\t")

In [None]:
# get the genes that are in the original list of genes
gene.list <- intersect(rownames(gene.data), gene.symbols$ensembl_gene_id_version)

In [None]:
# prepare a data frame with the values of the expression of the genes that we selected
# and the expression of HIV

# pool together the expression matrices
exprMatrix <- cbind(exprMatrices[["P2449"]], exprMatrices[["P2458"]])

# select only treated cells
sampleSheet <- cbind(sampleSheets[["P2449"]], sampleSheets[["P2458"]])
jlat.treated <- sampleSheet$label == "J-Lat+SAHA"
exprMatrix <- exprMatrix[, jlat.treated]

# select only alive cells
totalExpression <- colSums(exprMatrix)
alive <- totalExpression > 100000
exprMatrix <- exprMatrix[, alive]

# save the HIV values
HIV <- t(exprMatrix['FILIONG01', ])

# select only genes from our list
exprMatrix <- exprMatrix[gene.list, ]

In [None]:
dat <- data.frame(total = colSums(exprMatrix), HIV = HIV)
dat$UTX <- t(exprMatrix["ENSG00000147050.14", ])

In [None]:
options(repr.plot.width = 3, repr.plot.height = 3)
ggplot(dat, aes(total, FILIONG01)) + geom_point()

In [None]:
options(repr.plot.width = 3, repr.plot.height = 3)
ggplot(dat, aes(HIV, UTX)) + geom_point()

So in whatever way we look at the data, we find poor correlation between HIV production and proximal genes' activities.