## Dataset Preparation

<p>The overall selection of data and the analysis protocol has been loosely borrowed from Kim et al. (2016). Typically for a machine learning problem, we shall have a dataset (consolidated, with class and variable/ feature definitions) and that'll be bifurcated (typically in 3:7 or 2:8 proportions) to be used as testing and training sets respectively. Contrarily, in this study both the categories have been sourced differently as you'll see.
In this exercise, we shall have positive and negative training examples for the training dataset but only positive examples for the test dataset. And that is perfectively fine.</p>
<p>The H1 cell line data for H3K27, H3K4me1, HeK4me2, and H3K4me3 have been sourced from the [ENCODE project](https://www.encodeproject.org/) featured by [Ren Lab](http://renlab.sdsc.edu/renlab_website/bing/). The version of the genome considered is **GrCh38**.</p>
<p>The *bam* data is downloaded and needs to be converted to bed/bw files for further processing. For the same, we employ the bedtools Merge function. The four files showcasing various histone markers for enhancer data are merged together on the basis of overlapping intervals to get a matrix of peak counts that act as features to the 'enhancer' class.</p>

In [None]:
# The individual bam files are indexed.

system("samtools index -b ./H1_GrCh38/H3K27ac/ENCFF663SAM.bam")
system("samtools index -b ./H1_GrCh38/H3K4me1/ENCFF441KOL.bam")
system("samtools index -b ./H1_GrCh38/H3K4me2/ENCFF799BDH.bam")
system("samtools index -b ./H1_GrCh38/H3K4me3/ENCFF340UJK.bam")

After having the BAM files indexed, we proceed towards binning them into the desired intervals of 2000bp (since that is the aggregate size of the enhancers) ans for consistency the non-enhancer segaments shall be of the same size as well. We achieve this via **deeptools** suite, a function called [bamCoverage](https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html). 

In [None]:
system("bamCoverage --bam ENCFF663SAM.bam -o ENCFF663SAM2000_30.bw --binSize 2000 --normalizeUsing RPKM --effectiveGenomeSize 2913022398 --outFileFormat bedgraph --maxFragmentLength 30")
system("bamCoverage --bam ENCFF441KOL.bam -o ENCFF441KOL2000_36.bw --binSize 2000 --normalizeUsing RPKM --effectiveGenomeSize 2913022398 --outFileFormat bedgraph --maxFragmentLength 36")
system("bamCoverage --bam ENCFF799BDH.bam -o ENCFF799BDH2000_36.bw --binSize 2000 --normalizeUsing RPKM --effectiveGenomeSize 2913022398 --outFileFormat bedgraph --maxFragmentLength 36")
system("bamCoverage --bam ENCFF340UJK.bam -o ENCFF340UJK2000_36.bw --binSize 2000 --normalizeUsing RPKM --effectiveGenomeSize 2913022398 --outFileFormat bedgraph --maxFragmentLength 36")

The above resultant files are the genome-wide coverages of the respective histone marks in the given cell (H1).

<img src="./props/Data_Schema.jpg">

### Testing Data

For testing data, we consider data only for enhancers. Even with this data, we shall be able to evaluate the model's veracity of predicting positive examples. 

As noted, the primary step is to source the BAM files, index them, and finally tranform into BEDGRAPH files. The column naming is done appropriately. 

In [2]:
h3k27ac <- read.csv("./data/H1_Cell_Line/H3K27ac/ENCFF663SAM2000_30.bw", sep = '\t', header = FALSE)
colnames(h3k27ac) <- c("chrom","start","end","reads")   
h3k4me3 <- read.csv("./data/H1_Cell_Line/H3K4me3/ENCFF340UJK2000_36.bw", sep = '\t', header = FALSE)
colnames(h3k4me3) <- c("chrom","start","end","reads")
h3k4me2 <- read.csv("./data/H1_Cell_Line/H3K4me2/ENCFF799BDH2000_36.bw", sep = '\t', header = FALSE)
colnames(h3k4me2) <- c("chrom","start","end","reads")
h3k4me1 <- read.csv("./data/H1_Cell_Line/H3K4me1/ENCFF441KOL2000_36.bw", sep = '\t', header = FALSE)
colnames(h3k4me1) <- c("chrom","start","end","reads")

These files (with extension **.bw**) are eventually imported to the online Galaxy interface and executed for the *bedtools Merge* function. The four files showcasing various histone markers for enhancer data are merged together on the basis of overlapping intervals to get a matrix of coverage reads that act as features to the 'enhancer' class. 

In [8]:
bedtools_unionbedg_all.bw <- system ("bedtools unionbedg -i ./data/H1_Cell_Line/H3K27ac/ENCFF663SAM2000_30.bw ./data/H1_Cell_Line/H3K4me3/ENCFF340UJK2000_36.bw ./data/H1_Cell_Line/H3K4me2/ENCFF799BDH2000_36.bw ./data/H1_Cell_Line/H3K4me1/ENCFF441KOL2000_36.bw -header -names H3K27ac H3K4me3 H3K4me2 H3K4me1")

In [17]:
## Importing the merged BEDGRAPH/BW file

merged_bw <- read.csv("bedtools_unionbedg_all.bw", sep = '\t', header = TRUE)
head(merged_bw)
merged_bw$Class <- "enhancer"
head(merged_bw)


## Identifying examples of standard chromosomes only and filtering the residuals.

chromosomes <- c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14",
                 "chr15", "chr16", "chr17", "chr18", "chr19", "chr20","chr21", "chr22", "chrX", "chrY")
merged_bw<- as.data.frame(merged_bw[merged_bw$chrom %in% chromosomes, ])

chrom,start,end,H3K27ac,H3K4me3,H3K4me2,H3K4me1
<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,0,8000,0,0,0,0.0
chr1,8000,12000,0,0,0,0.0568576
chr1,12000,16000,0,0,0,0.0
chr1,16000,18000,0,0,0,0.0568576
chr1,18000,40000,0,0,0,0.0
chr1,40000,42000,0,0,0,0.0568576


chrom,start,end,H3K27ac,H3K4me3,H3K4me2,H3K4me1,Class
<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
chr1,0,8000,0,0,0,0.0,enhancer
chr1,8000,12000,0,0,0,0.0568576,enhancer
chr1,12000,16000,0,0,0,0.0,enhancer
chr1,16000,18000,0,0,0,0.0568576,enhancer
chr1,18000,40000,0,0,0,0.0,enhancer
chr1,40000,42000,0,0,0,0.0568576,enhancer


Since we are soleley concerned with the score values, we shall hack off the information on intervals and chromosome names. The resultant dataset shall have class labels (enhancers) and coverage scores spanning from columns 4 to 8.

In [18]:
## Deriving test data as input to the deep learning model.

test <- merged_bw[,c(4:8)]

Let's check out the dataset. 

In [19]:
head(test)

H3K27ac,H3K4me3,H3K4me2,H3K4me1,Class
<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0,0,0,0.0,enhancer
0,0,0,0.0568576,enhancer
0,0,0,0.0,enhancer
0,0,0,0.0568576,enhancer
0,0,0,0.0,enhancer
0,0,0,0.0568576,enhancer


The deep learning model has two basic requisites with the input data.
1. The data has to be *numeric* in type.
2. It has to range from 0 to 1. So if it isn't already, some sort of normalization procedure can help do that. The preferred one is the min-max normalization.

In [21]:
# We have already scanned the dataset for existence of NA values and there exist none. 

## Min-Max normalization
test$H3K27ac <- (test$H3K27ac-min(test$H3K27ac, na.rm = T))/(max(test$H3K27ac, na.rm = T)-min(test$H3K27ac, na.rm = T))
test$H3K4me3 <- (test$H3K4me3-min(test$H3K4me3, na.rm = T))/(max(test$H3K4me3, na.rm = T)-min(test$H3K4me3, na.rm = T))
test$H3K4me2 <- (test$H3K4me2-min(test$H3K4me2, na.rm = T))/(max(test$H3K4me2, na.rm = T)-min(test$H3K4me2, na.rm = T))
test$H3K4me1 <- (test$H3K4me1-min(test$H3K4me1, na.rm = T))/(max(test$H3K4me1, na.rm = T)-min(test$H3K4me1, na.rm = T))


In [22]:
head(test)

H3K27ac,H3K4me3,H3K4me2,H3K4me1,Class
<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0,0,0,0.0,enhancer
0,0,0,0.003164557,enhancer
0,0,0,0.0,enhancer
0,0,0,0.003164557,enhancer
0,0,0,0.0,enhancer
0,0,0,0.003164557,enhancer


### Positive Class

Importing relevant data that has been preprocessed explicitly. This is for the positive class labels.

BED files for p300 and DHS are (in that order): <br>
>1. pruned for standard chromosome entries <br>
>2. sorted <br>
>3. individually, bedtools "merged" for overlapping intervals <br>

In [23]:
## Importing relevant data ##
## Positive Class Labels ##

dhs <- read.csv("./data/H1_Cell_Line/GSM878621_H1_DNase_sorted.bed", sep = '\t', header = FALSE)
p300 <- read.csv("./data/H1_Cell_Line/GSM831036_H1_P300_sorted.bed", sep = '\t', header = FALSE)

## We see that the chromosome names in p300 data are just numbers. Let's add "chr" in the beginning for consistency. 
p300$V1 <- paste0("chr",p300$V1)

## Selecting useful columns: chrom, start, end.
p300 <- p300[,c(1,2,3)]
dhs <- dhs[,c(1,2,3)]

## Sifting standard chromosomes.
colnames(dhs) <- c("chrom","start","end")
colnames(p300) <- c("chrom","start","end")

## Valid chromosomes.
dhs <- as.data.frame(dhs[dhs$chrom %in% chromosomes, ])
p300 <- as.data.frame(p300[p300$chrom %in% chromosomes, ])

## Saving files

write.table(p300,"./data/H1_Cell_Line/h1_p300.bed", sep="\t", row.names=FALSE, quote = FALSE)
write.table(dhs,"./data/H1_Cell_Line/h1_dhs.bed", sep="\t", row.names=FALSE, quote = FALSE)


## Saving as a final intersection of p300 and DHS sites
system("intersectBed -a h1_p300_merged.bed -b h1_dhs_merged.bed > h1_p300_dhs_intersect.bed")


## Add class to the data: "enhancer"

h1_p300_dhs_intersect <- read.table("./data/H1_Cell_Line/h1_p300_dhs_intersect.bed", 
                                    sep = "\t")
h1_p300_dhs_intersect$V4 <- "Enhancer"
write.table(h1_p300_dhs_intersect,
            "./data/H1_Cell_Line/h1_p300_dhs_intersect_class.bed", 
            sep="\t",
            row.names=FALSE, 
            quote = FALSE)


Now, for the negative class labels, i.e. non-enhancers.

#### Transcription Start Sites  (TSS)

According to Wikipedia, an enhancer is a short (50-1500 bp) region of the DNA that can be bound by proteins. They can be located quite far from the promoter sequences of the genes that house the TSS. The transcription start sites' indices (start and end positions) are 'constant' throughout the genome. The gene positioning is the same across, rather the discrepenacy in distinct cell types is with the set of genes that get regulated. One source of downloading the TSS data is 'Ensembl Biomart'. <br>
> Step 1: Choose 'Human genes' under 'Dataset' tab on the left pane. <br> <br>
> Step 2: Under 'Attributes', select <br>
    (i) Chromosome/ scaffold name <br>
    (ii) Transcript start (bp) <br>
    (iii) Transcript end (bp) <br> <br>
> Step 3: Click on 'Results' button and download appropriately. <br>  
The other sources are refTSS and DBTSS databases. <br>

### Negative Class Labels

In [13]:
# TSS sites

tss_sites <- read.table("./data/H1_Cell_Line/TSS_Indices_Human_Genome.txt", sep = "\t", header = TRUE)
tss_sites$Chromosome.scaffold.name <- paste0("chr", tss_sites$Chromosome.scaffold.name)
tss_sites <- as.data.frame(tss_sites[tss_sites$Chromosome.scaffold.name  %in% chromosomes, ])
tss_sites <- tss_sites[order(tss_sites$Chromosome.scaffold.name),]

## Export TSS sites to create an overlap with DHS sites

write.table(tss_sites,"./data/H1_Cell_Line/tss_sites.bed", sep="\t", row.names=FALSE, quote = FALSE)

#### Random Sites in the Human Genome

In [9]:
## Sourcing chromosomal lengths in the human genome for generating random tracks

hg19_chrom_sizes <- read.table(url("https://genome.ucsc.edu/goldenPath/help/hg19.chrom.sizes"), sep = "\t", 
                               header = FALSE, col.names = c("chrom", "size"))
hg19_chrom_sizes <- as.data.frame(hg19_chrom_sizes[hg19_chrom_sizes$chrom %in% chromosomes, ])

## Saving file for generating random tracks via 'bedtools random' function ##
write.table(hg19_chrom_sizes,"./data/H1_Cell_Line/hg19.genome", sep="\t", row.names=FALSE, quote = FALSE)

The option -l allows user to specifiy the interval size of the tracks that are to be randomly generated from the genome file. However this is optional, but in our case we have to stay in sync with the interval profiles tha twe selected for combining the score matrix for histone marks. <br>

In [None]:
system("bedtools random -l 2000 -g hg19.genome > hg19_random_tracks.bed")

In [5]:
## Recalling

hg19_random_tracks <- read.table("./data/H1_Cell_Line/hg19_random_tracks.bed", sep = "\t", header = FALSE)
hg19_random_tracks <- hg19_random_tracks[,c(1,2,3)]

# ordering on the basis of chromosome name and start indices.
hg19_random_tracks_ordered <- hg19_random_tracks[order(hg19_random_tracks[,1],hg19_random_tracks[,2]),]

write.table(hg19_random_tracks_ordered,"./data/H1_Cell_Line/hg19_random_tracks_sorted_required.bed", sep="\t", 
            row.names=FALSE, quote = FALSE)

In [None]:
system("mergeBed -i hg_random_tracks_sorted_required.bed > hg_random_tracks_sorted_required_merged.bed")

In [None]:
system("awk '{if (NR!=1) {print}}' hg_random_tracks_sorted_required_merged.bed > hg_random_tracks_sorted_required_merged_header_removed.bed")

In [4]:
## Choosing random sites distal to TSS

## The strategy is to combine random tracks distal to the TSS or p300 binding sites. Now, let us create a 
## combination of TSS and p300 binding sites and then subtract the random sites from these, thus giving us
## the residuals.

tss_final <- read.table("./data/H1_Cell_Line/tss_sites_header_removed.bed", sep = "\t", header = FALSE)
p300_final <- read.table("./data/H1_Cell_Line/h1_p300_merged.bed", sep = "\t", header = FALSE)

## Combine the intervals and not 'merge' them
tss_and_p300 <- rbind(tss_final,p300_final)

## sort on the basis of chromosome names.
tss_and_p300 <- tss_or_p300[order(tss_or_p300[,1]),]

## Output file.
write.table(tss_and_p300,"./data/H1_Cell_Line/tss_and_p300.bed", sep="\t", row.names=FALSE, quote = FALSE)

Again as before, we remove the header and and sort the file "tss_and_p300.bed". The resulting file is named "tss_and_p300_header_removed_sorted.bed". <br> Additionally, the resultant file after intersecting the TSS and DHS sites from H1 cell line is "tss_h1_dhs_intersect_sorted_merged.bed". <br>
**The filenames are so chosen to reflect the order and type of manipulations that have been applied.**

Now to consolidate the negative class data, we have to : <br>
> 1. Find the random sites that are distal to known p300 and TSS regions. <br>
> 2. Find the intersection of TSS and DHS sites. <br>
> 3. Finally, club both of these together to represent a comprehensive non-enhancer region space.

In [None]:
## All the regions in the random sites of the human genome but not coinciding with the p300 and tss sites.
system("intersectBed -v -a hg19_random_tracks_sorted_required_header_removed.bed -b tss_and_p300_header_removed_sorted.bed > true_random_to_p300_and_TSS.bed")

In [None]:
system("cat true_random_to_p300_and_TSS.bed tss_h1_dhs_intersect_sorted_merged.bed > negative_class.bed")

In [20]:
## Import resultant files from intersection.
negative_class <- read.table("./data/H1_Cell_Line/negative_class.bed", sep = "\t", header = FALSE)
negative_class$V4 <- "Non-Enhancer"
write.table(negative_class,"./data/H1_Cell_Line/negative_class.bed", sep="\t", row.names=FALSE, quote = FALSE)

positive_class <- read.table("./data/H1_Cell_Line/h1_p300_dhs_intersect_class_header_removed.bed", sep = "\t", 
                             header = FALSE)

In [22]:
## Merging Data on the basis of overlapping intervals

## Positive class data (labels)
head(positive_class)

V1,V2,V3,V4
<fct>,<int>,<int>,<fct>
chr1,10101,10140,enhancer
chr1,10148,10209,enhancer
chr1,10235,10290,enhancer
chr1,10444,10580,enhancer
chr1,11362,11397,enhancer
chr1,12302,12333,enhancer


In [23]:
## Negative class (Labels)
head(negative_class)


V1,V2,V3,V4
<fct>,<int>,<int>,<chr>
chr1,847,947,Non-Enhancer
chr1,8952,9052,Non-Enhancer
chr1,39831,39931,Non-Enhancer
chr1,40608,40708,Non-Enhancer
chr1,43842,43942,Non-Enhancer
chr1,47908,48008,Non-Enhancer


In [1]:
## Score Matrix (Input data)
input_score_data <- merged_bw

## Converting data to GRanges objects
library(GenomicRanges)
positive_class_labels <- GRanges(seqnames = positive_class$V1, ranges = IRanges(start = positive_class$V2, 
                                                                       end = positive_class$V3))
mcols(positive_class_labels) <- DataFrame(class= "Enhancer")


negative_class_labels <- GRanges(seqnames = negative_class$V1, ranges = IRanges(start = negative_class$V2, 
                                                                       end = negative_class$V3))
mcols(negative_class_labels) <- DataFrame(class= "Non-Enhancer")


input_score <- GRanges(seqnames = input_score_data$chrom, ranges = IRanges(start = input_score_data$start,
                                                                        end = input_score_data$end))
mcols(input_score) <- DataFrame(H3K27ac = input_score_data$H3K27ac, H3K4me3 = input_score_data$H3K4me3,
                                H3K4me2 = input_score_data$H3K4me2, H3K4me1 = input_score_data$H3K4me1)

## Performing merge to figure out the score and class matrix.

intermatrix <- merge(as.data.frame(positive_class_labels), as.data.frame(negative_class_labels), 
                      all= TRUE) ## positive and negative classes ##

Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colnames, dirname, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
    pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The followin

**Exporting 'intermatrix' and 'input_score' as bed files to merge.**

> write.table(intermatrix, "./data/class_labels.bed", sep ='\t', quote = FALSE, row.names = FALSE) <br>
> write.table(input_score, "./data/score.bed", sep ='\t', quote = FALSE, row.names = FALSE) <br>

The files are treated for header removal and then merged together with the **intersectBed** tool.

> awk '{if (NR!=1) {print}}' score.bed > score_header_removed.bed <br>
> awk '{if (NR!=1) {print}}' class_labels.bed > class_labels_header_removed.bed <br>


> bedtools intersect -wa -wb -a score_header_removed.bed -b class_labels_header_removed.bed > score_labels.bed <br>

The dataset we finally implement for the machine learning model is the following.

In [2]:
## Importing the merged file

score_labels <- read.table("./data/score_labels.bed", sep = "\t", header = FALSE, stringsAsFactors=FALSE)

In [3]:
str(score_labels)

'data.frame':	19834154 obs. of  15 variables:
 $ V1 : chr  "chr1" "chr1" "chr1" "chr1" ...
 $ V2 : int  0 0 10100 10100 10100 10100 10100 10100 10100 10100 ...
 $ V3 : int  9900 9900 17400 17400 17400 17400 17400 17400 17400 17400 ...
 $ V4 : int  9901 9901 7301 7301 7301 7301 7301 7301 7301 7301 ...
 $ V5 : chr  "*" "*" "*" "*" ...
 $ V6 : chr  "0" "0" "0" "0" ...
 $ V7 : chr  "0" "0" "0" "0" ...
 $ V8 : chr  "0" "0" "0" "0" ...
 $ V9 : chr  "0" "0" "0" "0" ...
 $ V10: chr  "chr1" "chr1" "chr1" "chr1" ...
 $ V11: int  847 8952 10101 10148 10235 10444 11362 12302 12302 14931 ...
 $ V12: int  947 9052 10140 10209 10290 10580 11397 12333 12337 14956 ...
 $ V13: int  101 101 40 62 56 137 36 32 36 26 ...
 $ V14: chr  "*" "*" "*" "*" ...
 $ V15: chr  "Non-Enhancer" "Non-Enhancer" "Enhancer" "Enhancer" ...


We notice that our effective data is "sparse", i.e. there are less non-zero entries and a larger protion of the data is occupied by zeroes. A good strategy for optimization in terms of memory and also data processing is the usage of **sparse matrix** feature available from Matrix library. 

Let us first isolate the actual input data from the entire dataframe and convert it into matrix format. 

In [1]:
## Picking relevant columns
final_data <- score_labels[,c(6:9,15)]
colnames(final_data)=c("peaks_h3k27ac","peaks_h3k4me3","peaks_h3k4me2","peaks_h3k4me1","class")

## The matrix is a sparse matrix, having many non-zero entries. Let us convert it into a sparse matrix object.
library(Matrix)
final_sparse_data <- final_data
final_sparse_data$class <- as.numeric(as.factor(final_sparse_data$class))-1
final_sparse_data$peaks_h3k27ac <- as.numeric(final_sparse_data$peaks_h3k27ac)
final_sparse_data$peaks_h3k4me3 <- as.numeric(final_sparse_data$peaks_h3k4me3)
final_sparse_data$peaks_h3k4me2 <- as.numeric(final_sparse_data$peaks_h3k4me2)
final_sparse_data$peaks_h3k4me1 <- as.numeric(final_sparse_data$peaks_h3k4me1)

final_sparse_data <- final_sparse_data[complete.cases(final_sparse_data), ]

## Converting data frame to matrix and then to sparse matrix.
final_sparse_data_matrix <- data.matrix(final_sparse_data)
final_sparse_data_matrix <- Matrix(final_sparse_data_matrix, sparse=TRUE)

“NAs introduced by coercion”

We observe that all zero entries have been supressed with a dot(.) symbol. Let us write this data to the disk for use with other applications too.

In [2]:
head(final_sparse_data)
print(object.size(final_sparse_data),units="auto")

peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,0,0,0,1
0,0,0,0,1
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0


830.4 Mb


We see that the resultant data matrix is substantially low in size, yet upholding the integrity of th original data.

In [29]:
saveRDS(final_sparse_data,"./data/ep_data.rds")

## References

> Kim, S. G., Harwani, M., Grama, A., & Chaterji, S. (2016). EP-DNN : A Deep Neural Network- Based Global Enhancer Prediction Algorithm. Nature Publishing Group, (November), 1–13. https://doi.org/10.1038/srep38433