## Tools Installation 

The following tools need to be installed before continuing with the protocol.

1. [bedtools](https://bedtools.readthedocs.io/en/latest/)

## Dataset Preparation

<p>The overall selection of data and the analysis protocol has been loosely borrowed from Kim et al. (2016). Typically for a machine learning problem, we shall have a dataset (consolidated, with class and variable/ feature definitions) and that'll be bifurcated (typically in 3:7 or 2:8 proportions) to be used as testing and training sets respectively. Contrarily, in this study both the categories have been sourced differently as you'll see.
In this exercise, we shall have positive and negative training examples for the training dataset but only positive examples for the test dataset. And that is perfectively fine.</p>
<p>The H1 cell line data for H3K27, H3K4me1, HeK4me2, and H3K4me3 have been sourced from the [ENCODE project](https://www.encodeproject.org/) featured by [Ren Lab](http://renlab.sdsc.edu/renlab_website/bing/). The version of the genome considered is **hg19**.</p>
<p>The *bam* data is downloaded and needs to be converted to bed/bw files for further processing.</p>

In [10]:
# The individual bam files are indexed.

system("bash ./terminalScripts/samtoolsRun.sh /Users/soumyajauhari/Desktop/Machine_Learning/Machine_Learning_Deep_Learning/data/H1_Cell_Line/H3K27ac/ENCFF663SAM.bam \
/Users/soumyajauhari/Desktop/Machine_Learning/Machine_Learning_Deep_Learning/data/H1_Cell_Line/H3K4me1/ENCFF441KOL.bam \
/Users/soumyajauhari/Desktop/Machine_Learning/Machine_Learning_Deep_Learning/data/H1_Cell_Line/H3K4me2/ENCFF799BDH.bam \
/Users/soumyajauhari/Desktop/Machine_Learning/Machine_Learning_Deep_Learning/data/H1_Cell_Line/H3K4me3/ENCFF340UJK.bam")

After having the BAM files indexed, we proceed towards binning them into the desired intervals of 2Kb (since that is the aggregate size of the enhancers) ans for consistency the non-enhancer segaments shall be of the same size as well. We achieve this via **deeptools** suite, a function called [multiBAMSummary](https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html). 

In [11]:
## The bash script can be viewed at the terminal.

system("bash ./terminalScripts/deepToolsmultiBAMSummaryFeatures.sh")

In [20]:
## Loading the output from multiBAMSummary into R.

countMatrix <- read.table("readCountsFeatures.tab", header = FALSE, sep="\t", quote="")

In [21]:
## Renaming columns of read counts on the order of execution

colnames (countMatrix) <- c("chrom", "start", "end", "H3K27ac","H3K4me1", "H3K4me2", "H3K4me3")
head(countMatrix)

Unnamed: 0_level_0,chrom,start,end,H3K27ac,H3K4me1,H3K4me2,H3K4me3
Unnamed: 0_level_1,<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,chr1,0,2000,0,0,0,0
2,chr1,2000,4000,0,0,0,0
3,chr1,4000,6000,0,0,0,0
4,chr1,6000,8000,0,0,0,0
5,chr1,8000,10000,0,1,0,0
6,chr1,10000,12000,0,1,0,0


In [22]:
## To make it consistent with GREG and 1-based UCSC format.

countMatrix$start <- countMatrix$start + 1 
head(countMatrix)

Unnamed: 0_level_0,chrom,start,end,H3K27ac,H3K4me1,H3K4me2,H3K4me3
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,chr1,1,2000,0,0,0,0
2,chr1,2001,4000,0,0,0,0
3,chr1,4001,6000,0,0,0,0
4,chr1,6001,8000,0,0,0,0
5,chr1,8001,10000,0,1,0,0
6,chr1,10001,12000,0,1,0,0


In [23]:
## The total number of reads for the corresponding epigenetic marks are worthy to note and reason the need for normalization.

cat("The sum of raw counts for H3K4me1 reads is", sum(countMatrix$H3K4me1),"\n")
cat("The sum of raw counts for H3K4me2 reads is", sum(countMatrix$H3K4me2),"\n")
cat("The sum of raw counts for H3K4me3 reads is", sum(countMatrix$H3K4me3),"\n")
cat("The sum of raw counts for H3K27ac reads is", sum(countMatrix$H3K27ac))

The sum of raw counts for H3K4me1 reads is 8956558 
The sum of raw counts for H3K4me2 reads is 11751166 
The sum of raw counts for H3K4me3 reads is 12576587 
The sum of raw counts for H3K27ac reads is 13818969

In [24]:
## We plan to normalize count data for BPM = Bins Per Million mapped reads, same as TPM in RNA-seq.
## A simple function performs this task for us.

source("./projectFunctions/bpmNormalize.R")
countMatrix$H3K4me1 <- bpmNormalize(countMatrix$H3K4me1)
countMatrix$H3K4me2 <- bpmNormalize(countMatrix$H3K4me2)
countMatrix$H3K4me3 <- bpmNormalize(countMatrix$H3K4me3)
countMatrix$H3K27ac <- bpmNormalize(countMatrix$H3K27ac)

## Let's check out the transformed matrix.
head(countMatrix)

Unnamed: 0_level_0,chrom,start,end,H3K27ac,H3K4me1,H3K4me2,H3K4me3
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
1,chr1,1,2000,0,0.0,0,0
2,chr1,2001,4000,0,0.0,0,0
3,chr1,4001,6000,0,0.0,0,0
4,chr1,6001,8000,0,0.0,0,0
5,chr1,8001,10000,0,0.11165,0,0
6,chr1,10001,12000,0,0.11165,0,0


In [25]:
## A great test for visualizing normalization is that all the columns shall now be proportionate.

cat("The sum of normalized counts for H3K4me1 reads is", sum(countMatrix$H3K4me1),"\n")
cat("The sum of normalized counts for H3K4me2 reads is", sum(countMatrix$H3K4me2),"\n")
cat("The sum of normalized counts for H3K4me3 reads is", sum(countMatrix$H3K4me3),"\n")
cat("The sum of normalized counts for H3K27ac reads is", sum(countMatrix$H3K27ac))

The sum of normalized counts for H3K4me1 reads is 1e+06 
The sum of normalized counts for H3K4me2 reads is 1e+06 
The sum of normalized counts for H3K4me3 reads is 1e+06 
The sum of normalized counts for H3K27ac reads is 1e+06

Cool. The above resultant normalized counts are the genome-wide coverages of the respective histone marks in the given cell (H1), for the fixed 2Kb regions.

<img src="./props/Data_Schema.jpg">

### Testing Data

For testing purposes, we consider data only for enhancers. Even with this data alone, we shall be able to evaluate the model's veracity of predicting positive examples. 

The individual BPM levels of the histone marks are stacked together over the 2Kb intervals that represent the probable enhancers.

In [26]:
## Importing the coverage counts.

countMatrix$Class <- "enhancer"
head(countMatrix)


## Identifying examples of standard chromosomes only and filtering the residuals.

chromosomes <- c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14",
                 "chr15", "chr16", "chr17", "chr18", "chr19", "chr20","chr21", "chr22", "chrX", "chrY")
countMatrix<- as.data.frame(countMatrix[countMatrix$chrom %in% chromosomes, ])

Unnamed: 0_level_0,chrom,start,end,H3K27ac,H3K4me1,H3K4me2,H3K4me3,Class
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,chr1,1,2000,0,0.0,0,0,enhancer
2,chr1,2001,4000,0,0.0,0,0,enhancer
3,chr1,4001,6000,0,0.0,0,0,enhancer
4,chr1,6001,8000,0,0.0,0,0,enhancer
5,chr1,8001,10000,0,0.11165,0,0,enhancer
6,chr1,10001,12000,0,0.11165,0,0,enhancer


As highlighted above, the counts are the BPM levels; *BPM = Bins Per Million mapped reads, same as TPM in RNA-seq*. BPM (per bin) = number of reads per bin / sum of all reads per bin (in millions).

The deep learning model has two basic requisites with the input data.
1. The data has to be *numeric* in type.
2. It has to range from 0 to 1. So if it isn't already, some sort of normalization procedure can help do that. The preferred one is the min-max normalization.

<p> We shall tranform the data while applying machine learning algorithms </p>

### Positive Class

Importing relevant data that has been preprocessed explicitly. This is for the positive class labels.

Since the positive class labels have to be featured around 2Kb intervals as well, we need to synchronise the data in accordance. The individual BAM files for the DHS and EP300 binding sites that comprehend enhancers are downloaded for the relevant cell type and sorted. The DNase-Seq data is from John Stamatoyannopoulos's laboratory at the University of Washington. 

In [None]:
install.packages("curl")
library(curl)

curl_download("https://www.encodeproject.org/files/ENCFF923SKV/@@download/ENCFF923SKV.bam", "ENCFF923SKV.bam") # DNase-Seq data
curl_download("https://www.encodeproject.org/files/ENCFF832OFG/@@download/ENCFF832OFG.bam", "ENCFF832OFG.bam") # EP300

Same as before, we shall index the BAM files and then proceed towards creating a matrix of read counts for this data.

In [2]:
system("bash ./terminalScripts/samtoolsRun.sh ENCFF923SKV.bam ENCFF832OFG.bam")

In [None]:
system("bash ./terminalScripts/deepToolsmultiBAMSummaryPositiveClass.sh")

Let us import the coverage matrix now.

In [4]:
## Loading the output from multiBAMSummary into R.

countMatrixPositiveClass <- read.table("readCountsPositiveClass.tab", header = FALSE, sep="\t", quote="")

In [5]:
## Renaming columns of read counts on the order of execution

colnames (countMatrixPositiveClass) <- c("chrom", "start", "end", "EP300", "DHS")
head(countMatrixPositiveClass)

Unnamed: 0_level_0,chrom,start,end,EP300,DHS
Unnamed: 0_level_1,<fct>,<int>,<int>,<dbl>,<dbl>
1,chr1,0,2000,0,0
2,chr1,2000,4000,0,0
3,chr1,4000,6000,0,0
4,chr1,6000,8000,0,0
5,chr1,8000,10000,3,0
6,chr1,10000,12000,5,16


In [6]:
## We plan to normalize count data for BPM = Bins Per Million mapped reads, same as TPM in RNA-seq.
## A simple function performs this task for us.

source("./projectFunctions/bpmNormalize.R")
countMatrixPositiveClass$EP300 <- bpmNormalize(countMatrixPositiveClass$EP300)
countMatrixPositiveClass$DHS <- bpmNormalize(countMatrixPositiveClass$DHS)

## Let's check out the transformed matrix.
head(countMatrixPositiveClass)

Unnamed: 0_level_0,chrom,start,end,EP300,DHS
Unnamed: 0_level_1,<fct>,<int>,<int>,<dbl>,<dbl>
1,chr1,0,2000,0.0,0.0
2,chr1,2000,4000,0.0,0.0
3,chr1,4000,6000,0.0,0.0
4,chr1,6000,8000,0.0,0.0
5,chr1,8000,10000,0.2151464,0.0
6,chr1,10000,12000,0.3585773,0.648919


Just missed augmenting the "start" index by 1. We can do it now.

In [7]:
countMatrixPositiveClass$start <- countMatrixPositiveClass$start +1
head(countMatrixPositiveClass)

Unnamed: 0_level_0,chrom,start,end,EP300,DHS
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dbl>,<dbl>
1,chr1,1,2000,0.0,0.0
2,chr1,2001,4000,0.0,0.0
3,chr1,4001,6000,0.0,0.0
4,chr1,6001,8000,0.0,0.0
5,chr1,8001,10000,0.2151464,0.0
6,chr1,10001,12000,0.3585773,0.648919


Now, to define enhancer regions we identify those entries in *countMatrixPositiveClass* that have valid coverges for both the marks, i.e. non-zero.

In [8]:
# countMatrixPositiveClass <- countMatrixPositiveClass[countMatrixPositiveClass$EP300 !=0 & countMatrixPositiveClass$DHS!=0, ]

#### Working with DNase Hypersensitivity Sites (DHS) data

A few functional packages shall be necessary here. 

In [9]:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("rtracklayer", "GenomicRanges"))

Bioconductor version 3.9 (BiocManager 1.30.10), R 3.6.0 (2019-04-26)

Installing package(s) 'rtracklayer', 'GenomicRanges'




The downloaded binary packages are in
	/var/folders/hm/c3_fjypn62v5xh5b5ygv267m0000gn/T//RtmpeU6mTf/downloaded_packages


Old packages: 'Rdpack', 'broom', 'BH', 'DBI', 'SparseM', 'caTools', 'digest',
  'e1071', 'quantreg', 'repr', 'rsconnect'



The corresponding file is sifted for standard chromosomes.

In [10]:
## Calling libraries

library(rtracklayer)
library(GenomicRanges)

dhs_rev <- import.bed("./data/H1_Cell_Line/GSM878621_H1_DNase_sorted.bed")
chromosomes <- c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14",
                 "chr15", "chr16", "chr17", "chr18", "chr19", "chr20","chr21", "chr22", "chrX", "chrY")
dhs_rev_chr <- dhs_rev[seqnames(dhs_rev) %in% chromosomes]

Our quest for enhancer prediction would likely require us to examine 2000bp regions. The bins are resized in accordance. 

In [13]:
dhs_rev_chr_resize <- resize(dhs_rev_chr, width = 2000)
dhs_final <- as.data.frame(dhs_rev_chr_resize)
export.bed(dhs_final[,1:3], "dhs_final.bed")

ERROR: Error in resize(dhs_rev_chr, width = 2000): object 'dhs_rev_chr' not found


We are considering the basic attributes of genomic regions, i.e. chromosomal locations marked by chromosome, start index, and end index. To ensure sanctity of the data, let's check if things are placed correctly.

In [None]:
# for data integrity

system("awk '$2>0 && $3>$2 {print $1 "\t" $2 "\t" $3}' dhs_final.bed > dhs_finale.bed")
system("sortBed -i dhs_finale.bed > dhs_finale_sorted.bed")

Looks good for now. Next, we treat the p300 binding sites' data.

#### Working with p300 bindings on the genome

In [23]:
## We shall aggregately adhere to the same protocol for data treatment.

p300_rev <- import.bed("./data/H1_Cell_Line/GSM831036_H1_P300_sorted.bed")
BiocManager::install("diffloop")
library(diffloop)
p300_rev <- addchr(p300_rev) # Add prefix "chr" to the seqnames #
p300_rev_chr <- p300_rev[seqnames(p300_rev) %in% chromosomes]

p300_rev_chr_resize <- resize(p300_rev_chr, width = 2000)
p300_rev_final <- as.data.frame(p300_rev_chr_resize)
p300_rev_final <- unique(p300_rev_final[,c("seqnames","start","end")]) # removing redundant entries #
export.bed(p300_rev_final, "p300_rev_final.bed")

system("sortBed -i p300_rev_final.bed > p300_rev_final_sorted.bed")
system("awk '{print $1 "\t" $2 "\t" $3}' p300_rev_final_sorted.bed > p300_rev_finale.bed")

Now that we have the p300 bindings genomewide, we shall filter them for being distant to the promoter/TSS regions. Note that we are solely interested in the ones that are distal to the TSS; **distal as in non-proximal!**
GRanges objects are workabkle for set operations.

In [None]:
p300_rev_finale <- import.bed("p300_rev_finale.bed")

#### Transcription Start Sites  (TSS)

According to Wikipedia, an enhancer is a short (50-1500 bp) region of the DNA that can be bound by proteins. They can be located quite far from the promoter sequences of the genes that house the TSS. The transcription start sites' indices (start and end positions) are 'constant' throughout the genome. The gene positioning is the same across, rather the discrepenacy in distinct cell types is with the set of genes that get regulated. One source of downloading the TSS data is 'Ensembl Biomart'. <br>
> Step 1: Choose 'Human genes' under 'Dataset' tab on the left pane. <br> <br>
> Step 2: Under 'Attributes', select <br>
    (i) Chromosome/ scaffold name <br>
    (ii) Transcript start (bp) <br>
    (iii) Transcript end (bp) <br> <br>
> Step 3: Click on 'Results' button and download appropriately. <br>  
The other sources are refTSS and DBTSS databases. <br>

In [None]:
## Let's pull the TSS sites

tss_sites <- read.table("./data/H1_Cell_Line/TSS_Indices_Human_Genome.txt", sep = "\t", header = TRUE)
tss_sites$Chromosome.scaffold.name <- paste0("chr", tss_sites$Chromosome.scaffold.name)
tss_sites <- as.data.frame(tss_sites[tss_sites$Chromosome.scaffold.name  %in% chromosomes, ])
tss_sites <- tss_sites[order(tss_sites$Chromosome.scaffold.name),]
colnames(tss_sites) <- c("chrom", "start", "end")
tss_rev <- GRanges(tss_sites)

## We are going to shove off TSS sites that intersect with p300 binding sites. This will give us the distal p300 sites.

p300_nonTSS <- setdiff(p300_rev_finale,tss_rev)
export.bed(p300_nonTSS, "p300_nonTSS.bed")
system("awk '{print $1 "\t" $2 "\t" $3}' p300_nonTSS.bed > p300_nonTSS_final.bed")

## Finally, saving as a intersection of distal p300 and DHS sites

system("intersectBed -a dhs_finale_sorted.bed -b p300_nonTSS_final.bed > h1_distalp300_dhs_intersect.bed")

As illustrated above, the DHS sites represent the open chromatin regions that are susceptible to protein bindings. Concurrently, p300 is a protein complex that is ranked high as a biomarker for enhancer regions. The probability that an intersection of p300 binding sites and open chromatin regions envisage enhancers is expounded here.

In [None]:
## Add class to the data: "enhancer"

h1_distalp300_dhs_intersect <- read.table("h1_distalp300_dhs_intersect.bed", sep = "\t")
h1_distalp300_dhs_intersect$V4 <- "Enhancer"
write.table(h1_distalp300_dhs_intersect,"h1_distalp300_dhs_intersect_class.bed",sep="\t",row.names=FALSE, quote = FALSE)

Now, for the negative class labels, i.e. non-enhancers.

### Negative Class Labels

For negative class, i.e. non-enhancers, we intend on a compendium of an intersection of transcription start sites and DHS sites, alongwith random tracks from the genome that are distal to p300 and TSS sites, serving as a background. 

In [None]:
# Here, we intend to find the intersection of TSS with DHS sites.
dhs_min <- import.bed("dhs_finale_sorted.bed")
tss_dhs <- intersect(tss_rev, dhs_min)

#### Random Sites in the Human Genome

In [9]:
## The other aspect for mapping negative class labels (non-enhancers) entails sourcing chromosomal lengths in the human genome for generating random tracks

hg38_chrom_sizes <- read.table(url("https://genome.ucsc.edu/goldenPath/help/hg38.chrom.sizes"), sep = "\t", header = FALSE, col.names = c("chrom", "size"))
hg38_chrom_sizes <- as.data.frame(hg38_chrom_sizes[hg38_chrom_sizes$chrom %in% chromosomes, ])

## Saving file for generating random tracks via 'bedtools random' function ##
write.table(hg38_chrom_sizes,"./data/H1_Cell_Line/hg38.genome", sep="\t", row.names=FALSE, quote = FALSE)

In [None]:
## Generating random tracks

system("bedtools random -l 2000 -g ./data/H1_Cell_Line/hg38.genome > ./data/H1_Cell_Line/hg38_random_tracks.bed")

## Recalling

hg38_random_tracks <- read.table("./data/H1_Cell_Line/hg38_random_tracks.bed", sep = "\t", header = FALSE)
hg38_random_tracks <- hg38_random_tracks[,c(1,2,3)]
hg38_random_tracks_ordered <- hg38_random_tracks[order(hg38_random_tracks[,1],hg38_random_tracks[,2]),]

write.table(hg38_random_tracks_ordered,"./data/H1_Cell_Line/hg38_random_tracks_sorted_required.bed", sep="\t", row.names=FALSE, quote = FALSE)
colnames(hg38_random_tracks_ordered) <- c("chrom", "start", "end")
hg38_random <- GRanges(hg38_random_tracks_ordered)


The option -l in *bedtools random* allows user to specifiy the interval size of the tracks that are to be randomly generated from the genome file. However this is optional, but in our case we have to stay in sync with the interval profiles tha twe selected for combining the score matrix for histone marks. <br>

In [None]:
system("mergeBed -i hg_random_tracks_sorted_required.bed > hg_random_tracks_sorted_required_merged.bed")

In [None]:
system("awk '{if (NR!=1) {print}}' hg_random_tracks_sorted_required_merged.bed > hg_random_tracks_sorted_required_merged_header_removed.bed")

In [4]:
## Choosing random sites distal to TSS and p300 bindings

## The strategy is to combine random tracks distal to the TSS or p300 binding sites. Now, let us create a 
## combination of TSS and p300 binding sites and then subtract the random sites from these, thus giving us
## the residuals.


## Combine the intervals and not 'merge' them
p300_tss <- union(p300_rev_finale, tss_rev)


## Inferring random tracks distal to p300 bindings and TSS sites.
random_nonp300TSS <- setdiff(hg38_random,p300_tss)


## Merging the aforementioned random tracks and the TSS, DHS intersection.
random_tss_dhs <- union(random_nonp300TSS, tss_dhs)

## Output file.
write.table(random_tss_dhs,"./data/H1_Cell_Line/random_tss_dhs.bed", sep="\t", row.names=FALSE, quote = FALSE)
system("awk '{if (NR!=1) {print}}' ./data/H1_Cell_Line/random_tss_dhs.bed > ./data/H1_Cell_Line/random_tss_dhs_header_removed.bed")

## Import resultant files from intersection.
negative_class <- read.table("./data/H1_Cell_Line/random_tss_dhs_header_removed.bed", sep = "\t", header = FALSE)
negative_class$V4 <- "Non-Enhancer"
negative_class$V5 <- c() # removing strand information
write.table(negative_class,"./data/H1_Cell_Line/negative_class.bed", sep="\t", row.names=FALSE, quote = FALSE)

positive_class <- read.table("h1_distalp300_dhs_intersect_class.bed", sep = "\t", header = FALSE)

## Saving auxilliary informaiton
save(p300_rev, dhs_rev, file="aux.RData")

Again as before, we remove the header and and sort the file "tss_and_p300.bed". The resulting file is named "tss_and_p300_header_removed_sorted.bed". <br> Additionally, the resultant file after intersecting the TSS and DHS sites from H1 cell line is "tss_h1_dhs_intersect_sorted_merged.bed". <br>
**The filenames are so chosen to reflect the order and type of manipulations that have been applied.**

Now to consolidate the negative class data, we have to : <br>
> 1. Find the random sites that are distal to known p300 and TSS regions. <br>
> 2. Find the intersection of TSS and DHS sites. <br>
> 3. Finally, club both of these together to represent a comprehensive non-enhancer region space.

In [None]:
## All the regions in the random sites of the human genome but not coinciding with the p300 and tss sites.
system("intersectBed -v -a hg19_random_tracks_sorted_required_header_removed.bed -b tss_and_p300_header_removed_sorted.bed > true_random_to_p300_and_TSS.bed")

In [None]:
system("cat true_random_to_p300_and_TSS.bed tss_h1_dhs_intersect_sorted_merged.bed > negative_class.bed")

In [20]:
## Import resultant files from intersection.
negative_class <- read.table("./data/H1_Cell_Line/negative_class.bed", sep = "\t", header = FALSE)
negative_class$V4 <- "Non-Enhancer"
write.table(negative_class,"./data/H1_Cell_Line/negative_class.bed", sep="\t", row.names=FALSE, quote = FALSE)

positive_class <- read.table("./data/H1_Cell_Line/h1_p300_dhs_intersect_class_header_removed.bed", sep = "\t", 
                             header = FALSE)

In [22]:
## Merging Data on the basis of overlapping intervals

## Positive class data (labels)
head(positive_class)

V1,V2,V3,V4
<fct>,<int>,<int>,<fct>
chr1,10101,10140,enhancer
chr1,10148,10209,enhancer
chr1,10235,10290,enhancer
chr1,10444,10580,enhancer
chr1,11362,11397,enhancer
chr1,12302,12333,enhancer


In [23]:
## Negative class (Labels)
head(negative_class)


V1,V2,V3,V4
<fct>,<int>,<int>,<chr>
chr1,847,947,Non-Enhancer
chr1,8952,9052,Non-Enhancer
chr1,39831,39931,Non-Enhancer
chr1,40608,40708,Non-Enhancer
chr1,43842,43942,Non-Enhancer
chr1,47908,48008,Non-Enhancer


In [1]:
## Score Matrix (Input data)
input_score_data <- merged_bw
input_score_data<- as.data.frame(input_score_data[input_score_data$chrom %in% chromosomes, ])


## Replacing NAs with 0s(zeros) in the reads' columns. Since the score data is not available, imputing empty cells with
## zero entries engenders mathematical convenience.

input_score_data$H3K27ac[is.na(input_score_data$H3K27ac)] <- 0 
input_score_data$H3K4me3[is.na(input_score_data$H3K4me3)] <- 0 
input_score_data$H3K4me2[is.na(input_score_data$H3K4me2)] <- 0 
input_score_data$H3K4me1[is.na(input_score_data$H3K4me1)] <- 0 

## Converting data to GRanges objects
positive_class <- positive_class[-1,]
positive_class_labels <- GRanges(seqnames = positive_class$V1, ranges = IRanges(start = positive_class$V2, 
                                                                       end = positive_class$V3))
mcols(positive_class_labels) <- DataFrame(class= "enhancer")


negative_class_labels <- GRanges(seqnames = negative_class$V1, ranges = IRanges(start = negative_class$V2, 
                                                                       end = negative_class$V3))
mcols(negative_class_labels) <- DataFrame(class= "non-enhancer")


input_score <- GRanges(seqnames = input_score_data$chrom, ranges = IRanges(start = input_score_data$start,
                                                                        end = input_score_data$end))
mcols(input_score) <- DataFrame(reads_h3k27ac = input_score_data$H3K27ac, reads_h3k4me3 = input_score_data$H3K4me3,
                                reads_h3k4me2 = input_score_data$H3K4me2, reads_h3k4me1 = input_score_data$H3K4me1)

## Performing merge to figure out the score and class matrix.
## Appending negative and positive class intervals together.

positive_and_negative_class_intervals <- merge(as.data.frame(positive_class_labels), as.data.frame(negative_class_labels), all=TRUE) ## positive and negative classes


Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colnames, dirname, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
    pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The followin

In [None]:
## Exporting 'positive_and_negative_class_intervals' and 'input_score' as bed files to merge.**

write.table(positive_and_negative_class_intervals, "./data/class_labels.bed", sep ='\t', quote = FALSE, row.names = FALSE)
write.table(input_score, "./data/score.bed", sep ='\t', quote = FALSE, row.names = FALSE)

In [None]:
## Processing for syntax and merging

system("awk '{if (NR!=1) {print}}' ./data/class_labels.bed > ./data/class_labels_header_removed.bed")
system("awk '{if (NR!=1) {print}}' ./data/score.bed > ./data/score_header_removed.bed")

system("bedtools intersect -wa -wb -a ./data/score_header_removed.bed -b ./data/class_labels_header_removed.bed > ./data/score_labels.bed")

## While merging the intervals for class labels (positive and negative) and RPKM normalized scores, there are occassions where one interval of the latter may
## encompass several intervals from the former. This is reflected in the 'bedtools intersect' operation.

The dataset we finally implement for the machine learning model is the following.

In [11]:
## Importing the merged file
score_labels <- read.table("./data/score_labels.bed", sep = "\t", header = FALSE, stringsAsFactors=FALSE)

In [12]:
str(score_labels)

'data.frame':	7417279 obs. of  15 variables:
 $ V1 : chr  "chr1" "chr1" "chr1" "chr1" ...
 $ V2 : int  0 0 0 8000 8000 8000 8000 8000 8000 8000 ...
 $ V3 : int  8000 8000 8000 12000 12000 12000 12000 12000 12000 12000 ...
 $ V4 : int  8001 8001 8001 4001 4001 4001 4001 4001 4001 4001 ...
 $ V5 : chr  "*" "*" "*" "*" ...
 $ V6 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ V7 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ V8 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ V9 : num  0 0 0 0.0569 0.0569 ...
 $ V10: chr  "chr1" "chr1" "chr1" "chr1" ...
 $ V11: int  1149 4721 7284 7284 10101 10148 10235 10444 11362 11869 ...
 $ V12: int  3149 6721 8030 8030 10140 10209 10290 10580 11397 12525 ...
 $ V13: int  2001 2001 747 747 40 62 56 137 36 657 ...
 $ V14: chr  "*" "*" "*" "*" ...
 $ V15: chr  "non-enhancer" "non-enhancer" "non-enhancer" "non-enhancer" ...


We notice that our effective data is "sparse", i.e. there are less non-zero entries and a larger protion of the data is occupied by zeroes. A good strategy for optimization in terms of memory and also data processing is the usage of **sparse matrix** feature available from Matrix library. 

Let us first isolate the actual input data from the entire dataframe and convert it into matrix format. 

In [13]:
## Picking relevant columns
final_data <- score_labels[,c(6:9,15)]
colnames(final_data)=c("reads_h3k27ac","reads_h3k4me3","reads_h3k4me2","reads_h3k4me1","class")

In [None]:
## Saving this data
saveRDS(final_data,"./data/ep_data.rds")

## Sample data
final_data_sample <- final_data[sample(nrow(final_data), 10000), ]
saveRDS(final_data_sample,"./data/ep_data_sample.rds")

## Saving relevant files
save(positive_class_labels, tss_rev, p300_rev_finale, input_score, hg38_random, dhs_min, file = "relevant_files.RData")

## References

> Kim, S. G., Harwani, M., Grama, A., & Chaterji, S. (2016). EP-DNN : A Deep Neural Network- Based Global Enhancer Prediction Algorithm. Nature Publishing Group, (November), 1–13. https://doi.org/10.1038/srep38433