# Prediction of Transcription Factories: A Machine Learning Blueprint

Transcription Factories are those purported sites that have been identified as the breeding grounds for transcripts. Their relevance heightens when the spatial organization of the genome is considered. Since transcription factories attract distant segments of the genome together for inter-mingling and possibly carrying out regulation, true picture emerges of the parts that co-localize and co-express. This has been a major lacuna in the dogma of the current suite of enrichment tools for genomic regions.

## Setting up Target Genomic Regions

In cognizance to [GREG](http://www.dx.doi.org/10.1093/database/baz162), we define genomic regions with a 2 kilobase bin size. These shall be the pivots corresponding to which the features with read counts shall be hinged. Additionally, we want to make our BED file [1-based](http://genome.ucsc.edu/FAQ/FAQformat.html#format1), to align with the format in GREG. 

From the chromosome size definitions available in the loaded package, we create an index as follows. This shall be the premises on which our genomic bins shall be carved out.

The function **tileGenome** allows to structure a definitive GenomicRanges object.

In [19]:
export.bed(Bins2k, "temp.bed")

In [16]:
targetBED <- as.data.frame(Bins2k)
targetBED <- targetBED[, 1:3]
colnames(targetBED) <- c("Chrom", "Start", "End")
head(targetBED)
write.table(targetBED, file = "hg19_2K_bins.bed")

Unnamed: 0_level_0,Chrom,Start,End
Unnamed: 0_level_1,<fct>,<int>,<int>
1,chr1,1,2000
2,chr1,2001,4000
3,chr1,4001,6000
4,chr1,6001,8000
5,chr1,8001,10000
6,chr1,10001,12000


These are the regions where the read counts need to be mapped.

## Feature Files 

For exemplifying, let's pick up ChIP-Seq data for epigenetic marks - H3K4me1, H3K4me2, H3K4me3, and H3K27ac for the H1 hESC cell line.(WIP; More instances to follow)

![](siftingBAM.jpg)

The BAM file is indexed first.

In [6]:
system("samtools index -b /Users/soumyajauhari/Desktop/Machine_Learning/Machine_Learning_Deep_Learning/data/H1_Cell_Line/H3K27ac/ENCFF663SAM.bam")
system("samtools index -b /Users/soumyajauhari/Desktop/Machine_Learning/Machine_Learning_Deep_Learning/data/H1_Cell_Line/H3K4me1/ENCFF441KOL.bam")
system("samtools index -b /Users/soumyajauhari/Desktop/Machine_Learning/Machine_Learning_Deep_Learning/data/H1_Cell_Line/H3K4me2/ENCFF799BDH.bam")
system("samtools index -b /Users/soumyajauhari/Desktop/Machine_Learning/Machine_Learning_Deep_Learning/data/H1_Cell_Line/H3K4me3/ENCFF340UJK.bam")

Now that we have the index files along, we proceed towards getting read counts for our defined bin-size.  We achieve this via deeptools suite, a function called *multiBAMSummary*. 

In [20]:
## The bash script can be viewed at the terminal.

system("bash deepToolsmultiBAMSummary.sh")

In [55]:
countMatrix <- read.table("readCountsFromBED.tab", header = FALSE, sep="\t", quote="")

In [56]:
## Renaming columns of read counts on the order of execution

colnames (countMatrix) <- c("Chrom", "Start", "End", "H3K4me1", "H3K4me2", "H3K4me3", "H3K27ac")
head(countMatrix)

Unnamed: 0_level_0,Chrom,Start,End,H3K4me1,H3K4me2,H3K4me3,H3K27ac
Unnamed: 0_level_1,<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,chr1,0,2000,0,0,0,0
2,chr1,2000,4000,0,0,0,0
3,chr1,4000,6000,0,0,0,0
4,chr1,6000,8000,0,0,0,0
5,chr1,8000,10000,1,0,0,0
6,chr1,10000,12000,1,0,0,0


In [57]:
## To make it consistent with GREG and 1-based UCSC format.

countMatrix$Start <- countMatrix$Start + 1 
head(countMatrix)

Unnamed: 0_level_0,Chrom,Start,End,H3K4me1,H3K4me2,H3K4me3,H3K27ac
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,chr1,1,2000,0,0,0,0
2,chr1,2001,4000,0,0,0,0
3,chr1,4001,6000,0,0,0,0
4,chr1,6001,8000,0,0,0,0
5,chr1,8001,10000,1,0,0,0
6,chr1,10001,12000,1,0,0,0


In [58]:
## The total number of reads for the corresponding epigenetic marks are worthy to note and reason the need for normalization.

cat("The row count for H3K4me1 reads is", sum(countMatrix$H3K4me1),"\n")
cat("The row count for H3K4me2 reads is", sum(countMatrix$H3K4me2),"\n")
cat("The row count for H3K4me3 reads is", sum(countMatrix$H3K4me3),"\n")
cat("The row count for H3K27ac reads is", sum(countMatrix$H3K27ac))

The row count for H3K4me1 reads is 8956558 
The row count for H3K4me2 reads is 11751166 
The row count for H3K4me3 reads is 12576587 
The row count for H3K27ac reads is 13818969

In [45]:
## We plan to normalize count data for BPM = Bins Per Million mapped reads, same as TPM in RNA-seq.
## A simple function performs this task for us.

source("bpmNormalize.R")
countMatrix$H3K4me1 <- bpmNormalize(countMatrix$H3K4me1)
countMatrix$H3K4me2 <- bpmNormalize(countMatrix$H3K4me2)
countMatrix$H3K4me3 <- bpmNormalize(countMatrix$H3K4me3)
countMatrix$H3K27ac <- bpmNormalize(countMatrix$H3K27ac)

## Let's check out the transformed matrix.
head(countMatrix)

Unnamed: 0_level_0,Chrom,Start,End,H3K4me1,H3K4me2,H3K4me3,H3K27ac
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,chr1,1,2000,0.0,0,0,0
2,chr1,2001,4000,0.0,0,0,0
3,chr1,4001,6000,0.0,0,0,0
4,chr1,6001,8000,0.0,0,0,0
5,chr1,8001,10000,0.11165,0,0,0
6,chr1,10001,12000,0.11165,0,0,0


In [54]:
## A great test for visualizing normalization is that all the columns shall now be proportionate.

cat("The row count for H3K4me1 reads is", sum(countMatrix$H3K4me1),"\n")
cat("The row count for H3K4me2 reads is", sum(countMatrix$H3K4me2),"\n")
cat("The row count for H3K4me3 reads is", sum(countMatrix$H3K4me3),"\n")
cat("The row count for H3K27ac reads is", sum(countMatrix$H3K27ac))

The row count for H3K4me1 reads is 1000000 
The row count for H3K4me2 reads is 1000000 
The row count for H3K4me3 reads is 1000000 
The row count for H3K27ac reads is 1000000

Now, we're good to proceed. We shall tap *computeMatrix* function from **deepTools** for preparing input for *plotProfiles* function from the same suite.