## Dataset Preparation

The overall selection of data and the analysis protocol has been loosely borrowed from Kim et al. (2016). Typically for a machine learning problem, we shall have a dataset (consolidated, with class and variable/ feature definitions) and that'll be bifurcated (typically in 3:7 or 2:8 proportions) to be used as testing and training sets respectively. Contrarily, in this study both the categories have been sourced differently as you'll see.
In this exercise, we shall have positive and negative training examples for the training dataset but only positive examples for the test dataset. And that is perfectively fine.

<img src="./props/Data_Schema.jpg">

### Testing Data

For testing data, we consider data only for enhancers. Even with this data, we shall be able to evaluate the model's veracity of predicting positive examples. 

As noted, the primary step is to source the BAM files, index them, and finally tranform into BEDGRAPH files. The column naming is done appropriately. 

In [2]:
h3k27ac <- read.csv("./data/H1_Cell_Line/H3K27ac/ENCFF663SAM.bw", sep = '\t', header = FALSE)
colnames(h3k27ac) <- c("chrom","start","end","peaks")
h3k4me3 <- read.csv("./data/H1_Cell_Line/H3K4me3/ENCFF340UJK.bw", sep = '\t', header = FALSE)
colnames(h3k4me3) <- c("chrom","start","end","peaks")
h3k4me2 <- read.csv("./data/H1_Cell_Line/H3K4me2/ENCFF799BDH.bw", sep = '\t', header = FALSE)
colnames(h3k4me2) <- c("chrom","start","end","peaks")
h3k4me1 <- read.csv("./data/H1_Cell_Line/H3K4me1/ENCFF441KOL.bw", sep = '\t', header = FALSE)
colnames(h3k4me1) <- c("chrom","start","end","peaks")

These files (with extension **.bw**) are eventually imported to the online Galaxy interface and executed for the *bedtools Merge* function. The four files showcasing various histone markers for enhancer data are merged together on the basis of overlapping intervals to get a matrix of peak counts that act as features to the 'enhancer' class. 

The resultant file is eventually imported back to R for further processing.

In [2]:
## Importing the merged BEDGRAPH/BW file

merged_bw <- read.csv("./data/H1_Cell_Line/bedtools_Merge_2000.bedgraph", sep = '\t', header = FALSE)
colnames(merged_bw) <- c("chrom", "start", "end", "peaks_h3k27ac", "peaks_h3k4me3", "peaks_h3k4me2", "peaks_h3k4me1")
head(merged_bw)
merged_bw$class <- "enhancer"
head(merged_bw)


## Identifying examples of standard chromosomes only and filtering the residuals.

chromosomes <- c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14",
                 "chr15", "chr16", "chr17", "chr18", "chr19", "chr20","chr21", "chr22", "chrX", "chrY")
merged_bw<- as.data.frame(merged_bw[merged_bw$chrom %in% chromosomes, ])

chrom,start,end,peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1
<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,0,8000,0.0,0,0,0
chr1,8000,12000,0.0567974,0,0,0
chr1,12000,16000,0.0,0,0,0
chr1,16000,18000,0.0567974,0,0,0
chr1,18000,40000,0.0,0,0,0
chr1,40000,42000,0.0567974,0,0,0


chrom,start,end,peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1,class
<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
chr1,0,8000,0.0,0,0,0,enhancer
chr1,8000,12000,0.0567974,0,0,0,enhancer
chr1,12000,16000,0.0,0,0,0,enhancer
chr1,16000,18000,0.0567974,0,0,0,enhancer
chr1,18000,40000,0.0,0,0,0,enhancer
chr1,40000,42000,0.0567974,0,0,0,enhancer


Since we are soleley concerned with the score values, we shall hack off the information on intervals and chromosome names. The resultant dataset shall have class labels (enhancers) and scores spanning from columns 4 to 8.

In [4]:
## Deriving test data as input to the deep learning model.

test <- merged_bw[,c(4:8)]

Let's check out the dataset. 

In [5]:
head(test)

peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1,class
<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.0,0,0,0,enhancer
0.0567974,0,0,0,enhancer
0.0,0,0,0,enhancer
0.0567974,0,0,0,enhancer
0.0,0,0,0,enhancer
0.0567974,0,0,0,enhancer


The deep learning model has two basic requisites with the input data.
1. The data has to be *numeric* in type.
2. It has to range from 0 to 1. So if it isn't already, some sort of normalization procedure can help do that. The preferred one is the min-max normalization.

In [6]:
# Transforming to numeric(double).

test$peaks_h3k27ac <- as.double(as.character(test$peaks_h3k27ac))
test$peaks_h3k4me3 <- as.double(as.character(test$peaks_h3k4me3))
test$peaks_h3k4me2 <- as.double(as.character(test$peaks_h3k4me2))
test$peaks_h3k4me1 <- as.double(as.character(test$peaks_h3k4me1))


In [7]:
## For fair results, let's make sure that no "NA" values exist.
na.omit(test)

## Min-Max normalization
test$peaks_h3k27ac <- (test$peaks_h3k27ac-min(test$peaks_h3k27ac, na.rm = T))/(max(test$peaks_h3k27ac, na.rm = T)-min(test$peaks_h3k27ac, na.rm = T))
test$peaks_h3k4me3 <- (test$peaks_h3k4me3-min(test$peaks_h3k4me3, na.rm = T))/(max(test$peaks_h3k4me3, na.rm = T)-min(test$peaks_h3k4me3, na.rm = T))
test$peaks_h3k4me2 <- (test$peaks_h3k4me2-min(test$peaks_h3k4me2, na.rm = T))/(max(test$peaks_h3k4me2, na.rm = T)-min(test$peaks_h3k4me2, na.rm = T))
test$peaks_h3k4me1 <- (test$peaks_h3k4me1-min(test$peaks_h3k4me1, na.rm = T))/(max(test$peaks_h3k4me1, na.rm = T)-min(test$peaks_h3k4me1, na.rm = T))


peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1,class
<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.0000000,0.0000000,0.0000000,0.0000000,enhancer
0.0567974,0.0000000,0.0000000,0.0000000,enhancer
0.0000000,0.0000000,0.0000000,0.0000000,enhancer
0.0567974,0.0000000,0.0000000,0.0000000,enhancer
0.0000000,0.0000000,0.0000000,0.0000000,enhancer
0.0567974,0.0000000,0.0000000,0.0000000,enhancer
0.0000000,0.0000000,0.0000000,0.0000000,enhancer
0.0567974,0.0000000,0.0000000,0.0000000,enhancer
0.0000000,0.0000000,0.0000000,0.0000000,enhancer
0.0567974,0.0432946,0.0000000,0.0000000,enhancer


Importing relevant data that has been preprocessed explicitly. This is for the positive class labels.

BED files for p300 and DHS are (in that order): <br>
>1. pruned for standard chromosome entries <br>
>2. sorted <br>
>3. individually, bedtools "merged" for overlapping intervals <br>
>4. finally, the intersecting records are noted with the following expression, <br>
> $ intersectBed -a h1_p300_merged.bed -b h1_dhs_merged.bed > h1_p300_dhs_intersect.bed 

In [7]:
## Importing relevant data ##
## Positive Class Labels ##

dhs <- read.csv("./data/H1_Cell_Line/GSM878621_H1_DNase_sorted.bed", sep = '\t', header = FALSE)
p300 <- read.csv("./data/H1_Cell_Line/GSM831036_H1_P300_sorted.bed", sep = '\t', header = FALSE)

## We see that the chromosome names in p300 data are just numbers. Let's add "chr" in the beginning for consistency. 
p300$V1 <- paste0("chr",p300$V1)

## Selecting useful columns: chrom, start, end.
p300 <- p300[,c(1,2,3)]
dhs <- dhs[,c(1,2,3)]

## Sifting standard chromosomes.
colnames(dhs) <- c("chrom","start","end")
colnames(p300) <- c("chrom","start","end")

## Valid chromosomes.
dhs <- as.data.frame(dhs[dhs$chrom %in% chromosomes, ])
p300 <- as.data.frame(p300[p300$chrom %in% chromosomes, ])

## Saving files

write.table(p300,"./data/H1_Cell_Line/h1_p300.bed", sep="\t", row.names=FALSE, quote = FALSE)
write.table(dhs,"./data/H1_Cell_Line/h1_dhs.bed", sep="\t", row.names=FALSE, quote = FALSE)

## Add class to the data: "enhancer"

h1_p300_dhs_intersect <- read.table("./data/H1_Cell_Line/h1_p300_dhs_intersect.bed", sep = "\t")
h1_p300_dhs_intersect$V4 <- "Enhancer"
write.table(h1_p300_dhs_intersect,"./data/H1_Cell_Line/h1_p300_dhs_intersect_class.bed",sep="\t",row.names=FALSE, quote = FALSE)


Now, for the negative class labels, i.e. non-enhancers.

#### Transcription Start Sites  (TSS)

According to Wikipedia, an enhancer is a short (50-1500 bp) region of the DNA that can be bound by proteins. They can be located quite far from the promoter sequences of the genes that house the TSS. The transcription start sites' indices (start and end positions) are 'constant' throughout the genome. The gene positioning is the same across, rather the discrepenacy in distinct cell types is with the set of genes that get regulated. One source of downloading the TSS data is 'Ensembl Biomart'. <br>
> Step 1: Choose 'Human genes' under 'Dataset' tab on the left pane. <br> <br>
> Step 2: Under 'Attributes', select <br>
    (i) Chromosome/ scaffold name <br>
    (ii) Transcript start (bp) <br>
    (iii) Transcript end (bp) <br> <br>
> Step 3: Click on 'Results' button and download appropriately. <br>  
The other sources are refTSS and DBTSS databases. <br>

In [13]:
## Negative Class Labels ##
# TSS sites #

tss_sites <- read.table("./data/H1_Cell_Line/TSS_Indices_Human_Genome.txt", sep = "\t", header = TRUE)
tss_sites$Chromosome.scaffold.name <- paste0("chr", tss_sites$Chromosome.scaffold.name)
tss_sites <- as.data.frame(tss_sites[tss_sites$Chromosome.scaffold.name  %in% chromosomes, ])
tss_sites <- tss_sites[order(tss_sites$Chromosome.scaffold.name),]

## Export TSS sites to create an overlap with DHS sites

write.table(tss_sites,"./data/H1_Cell_Line/tss_sites.bed", sep="\t", row.names=FALSE, quote = FALSE)

#### Random Sites in the Human Genome

In [9]:
## Sourcing chromosomal lengths in the human genome for generating random tracks

hg19_chrom_sizes <- read.table(url("https://genome.ucsc.edu/goldenPath/help/hg19.chrom.sizes"), sep = "\t", 
                               header = FALSE, col.names = c("chrom", "size"))
hg19_chrom_sizes <- as.data.frame(hg19_chrom_sizes[hg19_chrom_sizes$chrom %in% chromosomes, ])

## Saving file for generating random tracks via 'bedtools random' function ##
write.table(hg19_chrom_sizes,"./data/H1_Cell_Line/hg19.genome", sep="\t", row.names=FALSE, quote = FALSE)

The option -l allows user to specifiy the interval size of the tracks that are to be randomly generated from the genome file. However this is optional, but in our case we have to stay in sync with the interval profiles tha twe selected for combining the score matrix for histine marks. <br>
> $ bedtools random -l 2000 -g hg19.genome > hg19_random_tracks.bed

In [5]:
## Recalling

hg19_random_tracks <- read.table("./data/H1_Cell_Line/hg19_random_tracks.bed", sep = "\t", header = FALSE)
hg19_random_tracks <- hg19_random_tracks[,c(1,2,3)]

# ordering on the basis of chromosome name and start indices.
hg19_random_tracks_ordered <- hg19_random_tracks[order(hg19_random_tracks[,1],hg19_random_tracks[,2]),]

write.table(hg19_random_tracks_ordered,"./data/H1_Cell_Line/hg19_random_tracks_sorted_required.bed", sep="\t", 
            row.names=FALSE, quote = FALSE)

> $ mergeBed -i hg_random_tracks_sorted_required.bed > hg_random_tracks_sorted_required_merged.bed

> $ awk '{if (NR!=1) {print}}' hg_random_tracks_sorted_required_merged.bed > hg_random_tracks_sorted_required_merged_header_removed.bed

In [4]:
## Choosing random sites distal to TSS

## The strategy is to combine random tracks distal to the TSS or p300 binding sites. Now, let us create a 
## combination of TSS and p300 binding sites and then subtract the random sites from these, thus giving us
## the residuals.

tss_final <- read.table("./data/H1_Cell_Line/tss_sites_header_removed.bed", sep = "\t", header = FALSE)
p300_final <- read.table("./data/H1_Cell_Line/h1_p300_merged.bed", sep = "\t", header = FALSE)

## Combine the intervals and not 'merge' them
tss_and_p300 <- rbind(tss_final,p300_final)

## sort on the basis of chromosome names.
tss_and_p300 <- tss_or_p300[order(tss_or_p300[,1]),]

## Output file.
write.table(tss_and_p300,"./data/H1_Cell_Line/tss_and_p300.bed", sep="\t", row.names=FALSE, quote = FALSE)

Again as before, we remove the header and and sort the file "tss_and_p300.bed". The resulting file is named "tss_and_p300_header_removed_sorted.bed". <br> Additionally, the resultant file after intersecting the TSS and DHS sites from H1 cell line is "tss_h1_dhs_intersect_sorted_merged.bed". <br>
**The filenames are so chosen to reflect the order and type of manipulations that have been applied.**

Now to consolidate the negative class data, we have to : <br>
> 1. Find the random sites that are distal to known p300 and TSS regions. <br>
> 2. Find the intersection of TSS and DHS sites. <br>
> 3. Finally, club both of these together to represent a comprehensive non-enhancer region space.

All the regions in the random sites of the human genome but not coinciding with the p300 and tss sites. <br>
> $ intersectBed -v -a hg19_random_tracks_sorted_required_header_removed.bed -b tss_and_p300_header_removed_sorted.bed > true_random_to_p300_and_TSS.bed

> $ cat true_random_to_p300_and_TSS.bed tss_h1_dhs_intersect_sorted_merged.bed > negative_class.bed

In [20]:
## Import resultant files from intersection.
negative_class <- read.table("./data/H1_Cell_Line/negative_class.bed", sep = "\t", header = FALSE)
negative_class$V4 <- "Non-Enhancer"
write.table(negative_class,"./data/H1_Cell_Line/negative_class.bed", sep="\t", row.names=FALSE, quote = FALSE)

positive_class <- read.table("./data/H1_Cell_Line/h1_p300_dhs_intersect_class_header_removed.bed", sep = "\t", 
                             header = FALSE)

In [22]:
## Merging Data on the basis of overlapping intervals

## Positive class data (labels)
head(positive_class)

V1,V2,V3,V4
<fct>,<int>,<int>,<fct>
chr1,10101,10140,enhancer
chr1,10148,10209,enhancer
chr1,10235,10290,enhancer
chr1,10444,10580,enhancer
chr1,11362,11397,enhancer
chr1,12302,12333,enhancer


In [23]:
## Negative class (Labels)
head(negative_class)


V1,V2,V3,V4
<fct>,<int>,<int>,<chr>
chr1,847,947,Non-Enhancer
chr1,8952,9052,Non-Enhancer
chr1,39831,39931,Non-Enhancer
chr1,40608,40708,Non-Enhancer
chr1,43842,43942,Non-Enhancer
chr1,47908,48008,Non-Enhancer


In [24]:
## Score Matrix (Input data)
input_score_data <- read.table("./data/H1_Cell_Line/bedtools_Merge_2000.bedgraph", sep = "\t", header = FALSE)
input_score_data<- as.data.frame(input_score_data[input_score_data$V1 %in% chromosomes, ])


## Converting data to GRanges objects
library(GenomicRanges)
positive_class_labels <- GRanges(seqnames = positive_class$V1, ranges = IRanges(start = positive_class$V2, 
                                                                       end = positive_class$V3))
mcols(positive_class_labels) <- DataFrame(class= "Enhancer")


negative_class_labels <- GRanges(seqnames = negative_class$V1, ranges = IRanges(start = negative_class$V2, 
                                                                       end = negative_class$V3))
mcols(negative_class_labels) <- DataFrame(class= "Non-Enhancer")


input_score <- GRanges(seqnames = input_score_data$V1, ranges = IRanges(start = input_score_data$V2,
                                                                        end = input_score_data$V3))
mcols(input_score) <- DataFrame(peaks_h3k27ac = input_score_data$V4, peaks_h3k4me3 = input_score_data$V5,
                                peaks_h3k4me2 = input_score_data$V6, peaks_h3k4me1 = input_score_data$V7)

## Performing merge to figure out the score and class matrix.

intermatrix1 <- merge(as.data.frame(positive_class_labels), as.data.frame(negative_class_labels), 
                      all= TRUE) ## positive and negative classes ##
intermatrix2 <- merge(as.data.frame(intermatrix1), as.data.frame(input_score), 
                      all= TRUE) ## scores and classes ##

In [25]:
str(intermatrix2)

## Let's make a duplicate copy of the final results dataframe for further processing. 
## We must preserve a copy for backup.
final_data <- intermatrix2

## Replacing NAs with 0s(zeros) in the peaks' columns. Since the score data is not available, 
## imputing empty cells with zero entries engenders mathematical convenience.

final_data$peaks_h3k27ac[is.na(final_data$peaks_h3k27ac)] <- 0
final_data$peaks_h3k4me3[is.na(final_data$peaks_h3k4me3)] <- 0
final_data$peaks_h3k4me2[is.na(final_data$peaks_h3k4me2)] <- 0
final_data$peaks_h3k4me1[is.na(final_data$peaks_h3k4me1)] <- 0

## Sorting on the basis of first two columns, viz. seqnames, start.
final_data <- final_data[order(final_data$seqnames, final_data$start),]


'data.frame':	17327990 obs. of  10 variables:
 $ seqnames     : Factor w/ 146 levels "chr1","chr10",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ start        : int  0 847 8000 8952 10101 10148 10235 10444 11362 12000 ...
 $ end          : int  8000 947 12000 9052 10140 10209 10290 10580 11397 16000 ...
 $ width        : int  8001 101 4001 101 40 62 56 137 36 4001 ...
 $ strand       : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3 3 3 3 3 ...
 $ class        : chr  NA "Non-Enhancer" NA "Non-Enhancer" ...
 $ peaks_h3k27ac: num  0 NA 0.0568 NA NA ...
 $ peaks_h3k4me3: num  0 NA 0 NA NA NA NA NA NA 0 ...
 $ peaks_h3k4me2: num  0 NA 0 NA NA NA NA NA NA 0 ...
 $ peaks_h3k4me1: num  0 NA 0 NA NA NA NA NA NA 0 ...


The dataset we finally implement for the machine learning model is the following.

In [26]:
head(final_data)

seqnames,start,end,width,strand,class,peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1
<fct>,<int>,<int>,<int>,<fct>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,0,8000,8001,*,,0.0,0,0,0
chr1,847,947,101,*,Non-Enhancer,0.0,0,0,0
chr1,8000,12000,4001,*,,0.0567974,0,0,0
chr1,8952,9052,101,*,Non-Enhancer,0.0,0,0,0
chr1,10101,10140,40,*,Enhancer,0.0,0,0,0
chr1,10148,10209,62,*,Enhancer,0.0,0,0,0


In [27]:
str(final_data)

'data.frame':	17327990 obs. of  10 variables:
 $ seqnames     : Factor w/ 146 levels "chr1","chr10",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ start        : int  0 847 8000 8952 10101 10148 10235 10444 11362 12000 ...
 $ end          : int  8000 947 12000 9052 10140 10209 10290 10580 11397 16000 ...
 $ width        : int  8001 101 4001 101 40 62 56 137 36 4001 ...
 $ strand       : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3 3 3 3 3 ...
 $ class        : chr  NA "Non-Enhancer" NA "Non-Enhancer" ...
 $ peaks_h3k27ac: num  0 0 0.0568 0 0 ...
 $ peaks_h3k4me3: num  0 0 0 0 0 0 0 0 0 0 ...
 $ peaks_h3k4me2: num  0 0 0 0 0 0 0 0 0 0 ...
 $ peaks_h3k4me1: num  0 0 0 0 0 0 0 0 0 0 ...


Let us write this data to the disk for use with other applications too.

In [28]:
table(final_data$class)


    Enhancer Non-Enhancer 
     6504779      9422552 

In [29]:
saveRDS(final_data,"./data/ep_data.rds")

## References

> Kim, S. G., Harwani, M., Grama, A., & Chaterji, S. (2016). EP-DNN : A Deep Neural Network- Based Global Enhancer Prediction Algorithm. Nature Publishing Group, (November), 1–13. https://doi.org/10.1038/srep38433