# Deep Learning 

## Background

Machine learning has conquered our digital space in a much conceivable way. Aritifical Intelligence is the umbrella discipline that describes the varied notions of the systemic behavior akin to humans. Unlike the common held adage, **Deep Learning** isn't a part of Machine Learning per se, rather it has evolved to be an exclusive stream altogether.  
If we recall from the previous session, we discussed the relevance of decision trees and random forests and how crucial former is to the latter. The diversity and aggregation of results in the random forests engender them the alacrity to handle multidimensional data. Although, neural nets do not foster the *same* relationship with deep learning, it is of the utmost importance to fathom the concept prior to proceeding to the holistic theme of deep learning. 

<img src="./props/AI_branches.jpg">

As mentioned above, deep learning is founded on **neural networks** that have, generally, outperformed other classification algorithms like Support Vector Machines (SVM), logistic regression, etc. *Neural Networks* are a specialized set of algorithms that work on assigning and moderating weights from the original data (input), across through the layers to the classification result (ouput).  

<img src="./props/deep_learning_NN_better.jpg">

### Exercise

> Loosely, Neural Networks is to Deep Learning as ________ is to Random Forests.

## Package Installation

## keras

#### Installation Errors

There could be subjective instances of errors that a system would result, owing to some missing or conflicting dependencies. Few encountered are as under:

<img src="./props/install_keras_error.jpg">

The TensorFlow is the Google's offering for machine learning and deep learning related tasks; *Anaconda* and *Python* libraries are major requisites for *TensorFlow* deployment in R. 

<img src="./props/install_keras_error1.jpg">

Often,
- there is a missing declaration in the PATH variable, or 
- *Anaconda* and *Python* installations have suffered shadowing from the other. 

<img src="./props/install_keras_error2.jpg">

## neuralnet

In [1]:
## Install the package and load the library

install.packages("neuralnet", 
                 repos = "https://mirrors.tuna.tsinghua.edu.cn/CRAN/",
                 dependencies = TRUE)
library(neuralnet)


The downloaded binary packages are in
	/var/folders/hm/c3_fjypn62v5xh5b5ygv267m0000gn/T//Rtmp2kJhmA/downloaded_packages


## Dataset Preparation

The overall selection of data and the analysis protocol has been loosely borrowed from Kim et al. (2016). Typically for a machine learning problem, we shall have a dataset (consolidated, with class and variable/ feature definitions) and that'll be bifurcated (typically in 3:7 or 2:8 proportions) to be used as testing and training sets respectively. Contrarily, in this study both the categories have been sourced differently as you'll see.
In this exercise, we shall have positive and negative training examples for the training dataset but only positive examples for the test dataset. And that is perfectively fine.

### Testing Data

For testing data, we consider data only for enhancers. Even with this data, we shall be able to evaluate the model's veracity of predicting positive examples. 

As noted, the primary step is to source the BAM files, index them, and finally tranform into BEDGRAPH files. The column naming is done appropriately. 

In [2]:
h3k27ac <- read.csv("./data/H1_Cell_Line/H3K27ac/ENCFF663SAM.bw", sep = '\t', header = FALSE)
colnames(h3k27ac) <- c("chrom","start","end","peaks")
h3k4me3 <- read.csv("./data/H1_Cell_Line/H3K4me3/ENCFF340UJK.bw", sep = '\t', header = FALSE)
colnames(h3k4me3) <- c("chrom","start","end","peaks")
h3k4me2 <- read.csv("./data/H1_Cell_Line/H3K4me2/ENCFF799BDH.bw", sep = '\t', header = FALSE)
colnames(h3k4me2) <- c("chrom","start","end","peaks")
h3k4me1 <- read.csv("./data/H1_Cell_Line/H3K4me1/ENCFF441KOL.bw", sep = '\t', header = FALSE)
colnames(h3k4me1) <- c("chrom","start","end","peaks")

These files (with extension **.bw**) are eventually imported to the online Galaxy interface and executed for the *bedtools Merge* function. The four files showcasing various histone markers for enhancer data are merged together on the basis of overlapping intervals to get a matrix of peak counts that act as features to the 'enhancer' class. 

The resultant file is eventually imported back to R for further processing.

In [3]:
## Importing the merged BEDGRAPH/BW file

merged_bw <- read.csv("./data/H1_Cell_Line/bedtools_Merge_2000.bedgraph", sep = '\t', header = FALSE)
#merged_bw <- merged_bw[-1,]
colnames(merged_bw) <- c("chrom", "start", "end", "peaks_h3k27ac", "peaks_h3k4me3", "peaks_h3k4me2", "peaks_h3k4me1")
head(merged_bw)
merged_bw$class <- "enhancer"
head(merged_bw)


## Identifying examples of standard chromosomes only and filtering the residuals.

chromosomes <- c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14",
                 "chr15", "chr16", "chr17", "chr18", "chr19", "chr20","chr21", "chr22", "chrX", "chrY")
merged_bw<- as.data.frame(merged_bw[merged_bw$chrom %in% chromosomes, ])


## Deriving test data as input to the deep learning model.

test <- merged_bw[,c(4:8)]


chrom,start,end,peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1
<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,0,8000,0.0,0,0,0
chr1,8000,12000,0.0567974,0,0,0
chr1,12000,16000,0.0,0,0,0
chr1,16000,18000,0.0567974,0,0,0
chr1,18000,40000,0.0,0,0,0
chr1,40000,42000,0.0567974,0,0,0


chrom,start,end,peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1,class
<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
chr1,0,8000,0.0,0,0,0,enhancer
chr1,8000,12000,0.0567974,0,0,0,enhancer
chr1,12000,16000,0.0,0,0,0,enhancer
chr1,16000,18000,0.0567974,0,0,0,enhancer
chr1,18000,40000,0.0,0,0,0,enhancer
chr1,40000,42000,0.0567974,0,0,0,enhancer


Let's check out the dataset. 

In [4]:
head(test)

peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1,class
<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.0,0,0,0,enhancer
0.0567974,0,0,0,enhancer
0.0,0,0,0,enhancer
0.0567974,0,0,0,enhancer
0.0,0,0,0,enhancer
0.0567974,0,0,0,enhancer


The deep learning model has two basic requisites with the input data.
1. The data has to be *numeric* in type.
2. It has to range from 0 to 1. So if it isn't already, some sort of normalization procedure can help do that. The preferred one is the min-max normalization.

In [5]:
# Transforming to numeric(double).

test$peaks_h3k27ac <- as.double(as.character(test$peaks_h3k27ac))
test$peaks_h3k4me3 <- as.double(as.character(test$peaks_h3k4me3))
test$peaks_h3k4me2 <- as.double(as.character(test$peaks_h3k4me2))
test$peaks_h3k4me1 <- as.double(as.character(test$peaks_h3k4me1))


In [6]:
## For fair results, let's make sure that no "NA" values exist.
na.omit(test)

## Min-Max normalization
test$peaks_h3k27ac <- (test$peaks_h3k27ac-min(test$peaks_h3k27ac, na.rm = T))/(max(test$peaks_h3k27ac, na.rm = T)-min(test$peaks_h3k27ac, na.rm = T))
test$peaks_h3k4me3 <- (test$peaks_h3k4me3-min(test$peaks_h3k4me3, na.rm = T))/(max(test$peaks_h3k4me3, na.rm = T)-min(test$peaks_h3k4me3, na.rm = T))
test$peaks_h3k4me2 <- (test$peaks_h3k4me2-min(test$peaks_h3k4me2, na.rm = T))/(max(test$peaks_h3k4me2, na.rm = T)-min(test$peaks_h3k4me2, na.rm = T))
test$peaks_h3k4me1 <- (test$peaks_h3k4me1-min(test$peaks_h3k4me1, na.rm = T))/(max(test$peaks_h3k4me1, na.rm = T)-min(test$peaks_h3k4me1, na.rm = T))


peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1,class
<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.0000000,0.0000000,0.0000000,0.0000000,enhancer
0.0567974,0.0000000,0.0000000,0.0000000,enhancer
0.0000000,0.0000000,0.0000000,0.0000000,enhancer
0.0567974,0.0000000,0.0000000,0.0000000,enhancer
0.0000000,0.0000000,0.0000000,0.0000000,enhancer
0.0567974,0.0000000,0.0000000,0.0000000,enhancer
0.0000000,0.0000000,0.0000000,0.0000000,enhancer
0.0567974,0.0000000,0.0000000,0.0000000,enhancer
0.0000000,0.0000000,0.0000000,0.0000000,enhancer
0.0567974,0.0432946,0.0000000,0.0000000,enhancer


Importing relevant data that has been preprocessed explicitly. This is for the positive class labels.

In [7]:
## Importing relevant data ##
## Positive Class Labels ##

dhs <- read.csv("./data/H1_Cell_Line/GSM878621_H1_DNase_sorted.bed", sep = '\t', header = FALSE)
p300 <- read.csv("./data/H1_Cell_Line/GSM831036_H1_P300_sorted.bed", sep = '\t', header = FALSE)

## We see that the chromosome names in p300 data are just numbers. Let's add "chr" in the beginning for consistency. 
p300$V1 <- paste0("chr",p300$V1)

## Selecting useful columns: chrom, start, end.
p300 <- p300[,c(1,2,3)]
dhs <- dhs[,c(1,2,3)]

## Sifting standard chromosomes.
colnames(dhs) <- c("chrom","start","end")
colnames(p300) <- c("chrom","start","end")

## Valid chromosomes.
dhs <- as.data.frame(dhs[dhs$chrom %in% chromosomes, ])
p300 <- as.data.frame(p300[p300$chrom %in% chromosomes, ])

## Saving files

write.table(p300,"./data/H1_Cell_Line/h1_p300.bed", sep="\t", row.names=FALSE, quote = FALSE)
write.table(dhs,"./data/H1_Cell_Line/h1_dhs.bed", sep="\t", row.names=FALSE, quote = FALSE)

## Add class to the data: "enhancer"

h1_p300_dhs_intersect <- read.table("./data/H1_Cell_Line/h1_p300_dhs_intersect.bed", sep = "\t")
h1_p300_dhs_intersect$V4 <- "Enhancer"
write.table(h1_p300_dhs_intersect,"./data/H1_Cell_Line/h1_p300_dhs_intersect_class.bed",sep="\t",row.names=FALSE, quote = FALSE)


Now, for the negative class labels, i.e. non-enhancers.

In [13]:
## Negative Class Labels ##

tss_sites <- read.table("./data/H1_Cell_Line/TSS_Indices_Human_Genome.txt", sep = "\t", header = TRUE)
tss_sites$Chromosome.scaffold.name <- paste0("chr", tss_sites$Chromosome.scaffold.name)
tss_sites <- as.data.frame(tss_sites[tss_sites$Chromosome.scaffold.name  %in% chromosomes, ])
tss_sites <- tss_sites[order(tss_sites$Chromosome.scaffold.name),]

## Export TSS sites to create an overlap with DHS sites

write.table(tss_sites,"./data/H1_Cell_Line/tss_sites.bed", sep="\t", row.names=FALSE, quote = FALSE)


## Shifting intervals forward by 3000 bp
tss_sites_test <- tss_sites
tss_sites_test$Transcript.start..bp. <- tss_sites_test$Transcript.start..bp.+3000
tss_sites_test$Transcript.end..bp. <- tss_sites_test$Transcript.end..bp.+3000

## Sourcing chromosomal lengths in the human genome for generating random tracks

hg19_chrom_sizes <- read.table(url("https://genome.ucsc.edu/goldenPath/help/hg19.chrom.sizes"), sep = "\t", header = FALSE, col.names = c("chrom", "size"))
hg19_chrom_sizes <- as.data.frame(hg19_chrom_sizes[hg19_chrom_sizes$chrom %in% chromosomes, ])

## Saving file for generating random tracks via 'bedtools random' function ##
write.table(hg19_chrom_sizes,"./data/H1_Cell_Line/hg19.genome", sep="\t", row.names=FALSE, quote = FALSE)

## Recalling

hg19_random_tracks <- read.table("./data/H1_Cell_Line/hg19_random_tracks.bed", sep = "\t", header = FALSE)
hg19_random_tracks <- hg19_random_tracks[,c(1,2,3)]
hg19_random_tracks_ordered <- hg19_random_tracks[order(hg19_random_tracks[,1],hg19_random_tracks[,2]),]

write.table(hg19_random_tracks_ordered,"./data/H1_Cell_Line/hg19_random_tracks_sorted_required.bed", sep="\t", row.names=FALSE, quote = FALSE)

## Choosing random sites distal to TSS

## The strategy is to combine random tracks distal to the TSS or p300 binding sites. Now, let us create a 
## combination of TSS and p300 binding sites and then subtract the random sites from these, thus giving us
## the residuals.

tss_final <- read.table("./data/H1_Cell_Line/tss_sites_header_removed.bed", sep = "\t", header = FALSE)
p300_final <- read.table("./data/H1_Cell_Line/h1_p300_merged.bed", sep = "\t", header = FALSE)

## Combine the intervals and not 'merge' them
tss_or_p300 <- rbind(tss_final,p300_final)

## sort on the basis of chromosome names.
tss_or_p300 <- tss_or_p300[order(tss_or_p300[,1]),]

## Output file.
write.table(tss_or_p300,"./data/H1_Cell_Line/tss_or_p300.bed", sep="\t", row.names=FALSE, quote = FALSE)


## Import resultant files from intersection.
negative_class <- read.table("./data/H1_Cell_Line/negative_class.bed", sep = "\t", header = TRUE)
negative_class$V4 <- "Non-Enhancer"
write.table(negative_class,"./data/H1_Cell_Line/negative_class.bed", sep="\t", row.names=FALSE, quote = FALSE)

positive_class <- read.table("./data/H1_Cell_Line/h1_p300_dhs_intersect_class_header_removed.bed", sep = "\t", header = FALSE)

In [9]:
## Merging Data on the basis of overlapping intervals

## Positive class data (labels)
positive_class

V1,V2,V3,V4
<fct>,<int>,<int>,<fct>
chr1,10101,10140,enhancer
chr1,10148,10209,enhancer
chr1,10235,10290,enhancer
chr1,10444,10580,enhancer
chr1,11362,11397,enhancer
chr1,12302,12333,enhancer
chr1,14931,14956,enhancer
chr1,15690,15708,enhancer
chr1,16214,16230,enhancer
chr1,17465,17485,enhancer


V1,V2,V3,V4
<fct>,<fct>,<fct>,<chr>
V1,V2,V3,Non-Enhancer
chr1,847,947,Non-Enhancer
chr1,8952,9052,Non-Enhancer
chr1,39831,39931,Non-Enhancer
chr1,40608,40708,Non-Enhancer
chr1,43842,43942,Non-Enhancer
chr1,47908,48008,Non-Enhancer
chr1,64271,64371,Non-Enhancer
chr1,76123,76223,Non-Enhancer
chr1,79121,79221,Non-Enhancer


In [14]:
## Negative class (Labels)
negative_class


V1,V2,V3,V4
<fct>,<int>,<int>,<chr>
chr1,847,947,Non-Enhancer
chr1,8952,9052,Non-Enhancer
chr1,39831,39931,Non-Enhancer
chr1,40608,40708,Non-Enhancer
chr1,43842,43942,Non-Enhancer
chr1,47908,48008,Non-Enhancer
chr1,64271,64371,Non-Enhancer
chr1,76123,76223,Non-Enhancer
chr1,79121,79221,Non-Enhancer
chr1,179003,179103,Non-Enhancer


In [18]:
## Score Matrix (Input data)
input_score_data <- read.table("./data/H1_Cell_Line/bedtools_Merge_2000.bedgraph", sep = "\t", header = FALSE)
input_score_data<- as.data.frame(input_score_data[input_score_data$V1 %in% chromosomes, ])


## Converting data to GRanges objects
library(GenomicRanges)
positive_class_labels <- GRanges(seqnames = positive_class$V1, ranges = IRanges(start = positive_class$V2, 
                                                                       end = positive_class$V3))
mcols(positive_class_labels) <- DataFrame(class= "Enhancer")


negative_class_labels <- GRanges(seqnames = negative_class$V1, ranges = IRanges(start = negative_class$V2, 
                                                                       end = negative_class$V3))
mcols(negative_class_labels) <- DataFrame(class= "Non-Enhancer")


input_score <- GRanges(seqnames = input_score_data$V1, ranges = IRanges(start = input_score_data$V2,
                                                                        end = input_score_data$V3))
mcols(input_score) <- DataFrame(peaks_h3k27ac = input_score_data$V4, peaks_h3k4me3 = input_score_data$V5,
                                peaks_h3k4me2 = input_score_data$V6, peaks_h3k4me1 = input_score_data$V7)

## Performing merge to figure out the score and class matrix.

intermatrix1 <- merge(positive_class_labels, negative_class_labels, all= TRUE) ## positive and negative classes ##
intermatrix2 <- merge(intermatrix1, input_score, all= TRUE) ## scores and classes ##

## Pulling back the data frame from the GenomicRanges format for (i) sorting (ii) removing NAs.

final_data <- data.frame(intermatrix2)
str(final_data)

## Sorting on the basis of first two columns, viz. seqnames, start.
final_data <- final_data[order(final_data$seqnames, final_data$start),]

## Let's make a duplicate copy of the final results dataframe for further processing. We must preserve a copy for backup.
buffer_final_data <- final_data

## Pruning rows involving NA terms. | DOESN'T WORK !!!

# for(i in 1:nrow(buffer_final_data)) ## all rows
# {
#   for(j in 1:length(buffer_final_data)) ## all columns
#   {
#     if (is.na(buffer_final_data[i,j])) ## if a cell has 'NA'
#     {
#       buffer_final_data <- buffer_final_data[-i,] ## remove that particular row
#     }
#   }
# }

## Replacing NAs with 0s(zeros) in the peaks' columns.

buffer_final_data$peaks_h3k27ac[is.na(buffer_final_data$peaks_h3k27ac)] <- 0
buffer_final_data$peaks_h3k4me3[is.na(buffer_final_data$peaks_h3k4me3)] <- 0
buffer_final_data$peaks_h3k4me2[is.na(buffer_final_data$peaks_h3k4me2)] <- 0
buffer_final_data$peaks_h3k4me1[is.na(buffer_final_data$peaks_h3k4me1)] <- 0


“Some of the objects to merge contain duplicated elements. These
  elements were removed by applying unique() to each object before the
  merging.”

'data.frame':	8170997 obs. of  10 variables:
 $ seqnames     : Factor w/ 146 levels "chr1","chr10",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ start        : int  0 847 8000 8952 10101 10148 10235 10444 11362 12000 ...
 $ end          : int  8000 947 12000 9052 10140 10209 10290 10580 11397 16000 ...
 $ width        : int  8001 101 4001 101 40 62 56 137 36 4001 ...
 $ strand       : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3 3 3 3 3 ...
 $ class        : chr  NA "Non-Enhancer" NA "Non-Enhancer" ...
 $ peaks_h3k27ac: num  0 NA 0.0568 NA NA ...
 $ peaks_h3k4me3: num  0 NA 0 NA NA NA NA NA NA 0 ...
 $ peaks_h3k4me2: num  0 NA 0 NA NA NA NA NA NA 0 ...
 $ peaks_h3k4me1: num  0 NA 0 NA NA NA NA NA NA 0 ...


The dataset we finally implement for the machine learning model is the following.

In [19]:
buffer_final_data

seqnames,start,end,width,strand,class,peaks_h3k27ac,peaks_h3k4me3,peaks_h3k4me2,peaks_h3k4me1
<fct>,<int>,<int>,<int>,<fct>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,0,8000,8001,*,,0.0000000,0,0,0
chr1,847,947,101,*,Non-Enhancer,0.0000000,0,0,0
chr1,8000,12000,4001,*,,0.0567974,0,0,0
chr1,8952,9052,101,*,Non-Enhancer,0.0000000,0,0,0
chr1,10101,10140,40,*,Enhancer,0.0000000,0,0,0
chr1,10148,10209,62,*,Enhancer,0.0000000,0,0,0
chr1,10235,10290,56,*,Enhancer,0.0000000,0,0,0
chr1,10444,10580,137,*,Enhancer,0.0000000,0,0,0
chr1,11362,11397,36,*,Enhancer,0.0000000,0,0,0
chr1,12000,16000,4001,*,,0.0000000,0,0,0


In [20]:
str(buffer_final_data)

'data.frame':	8170997 obs. of  10 variables:
 $ seqnames     : Factor w/ 146 levels "chr1","chr10",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ start        : int  0 847 8000 8952 10101 10148 10235 10444 11362 12000 ...
 $ end          : int  8000 947 12000 9052 10140 10209 10290 10580 11397 16000 ...
 $ width        : int  8001 101 4001 101 40 62 56 137 36 4001 ...
 $ strand       : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3 3 3 3 3 ...
 $ class        : chr  NA "Non-Enhancer" NA "Non-Enhancer" ...
 $ peaks_h3k27ac: num  0 0 0.0568 0 0 ...
 $ peaks_h3k4me3: num  0 0 0 0 0 0 0 0 0 0 ...
 $ peaks_h3k4me2: num  0 0 0 0 0 0 0 0 0 0 ...
 $ peaks_h3k4me1: num  0 0 0 0 0 0 0 0 0 0 ...


#### Transcription Start Sites  (TSS)

According to Wikipedia, an enhancer is a short (50-1500 bp) region of the DNA that can be bound by proteins. They can be located quite far from the promoter sequences of the genes that house the TSS. The transcription start sites' indices (start and end positions) are 'constant' throughout the genome. The gene positioning is the same across, rather the discrepenacy in distinct cell types is with the set of genes that get regulated. One source of downloading the TSS data is 'Ensembl Biomart'.
> Step 1: Choose 'Human genes' under 'Dataset' tab on the left pane.
> Step 2: Under 'Attributes', select
    (i) Chromosome/ scaffold name
    (ii) Transcript start (bp)
    (iii) Transcript end (bp)
> Step 3: Click on 'Results' button and download appropriately.  
The other sources are refTSS and DBTSS databases. 

In [6]:
# Data Partition
set.seed(108)
ind <- sample(2, nrow(test), replace = TRUE, prob = c(0.7, 0.3))
training <- test[ind==1,]
testing <- test[ind==2,]

# Neural Networks
set.seed(007)
nn <- neuralnet(class~peaks_h3k4me3+peaks_h3k4me2+peaks_h3k4me1+peaks_h3k27ac,
               data = training,
               hidden = 5,
               err.fct = "sse",
               act.fct = "logistic",
               linear.output = FALSE)
plot(nn)

ERROR: Error in neuralnet(class ~ peaks_h3k4me3 + peaks_h3k4me2 + peaks_h3k4me1 + : could not find function "neuralnet"


## References

> Kim, S. G., Harwani, M., Grama, A., & Chaterji, S. (2016). EP-DNN : A Deep Neural Network- Based Global Enhancer Prediction Algorithm. Nature Publishing Group, (November), 1–13. https://doi.org/10.1038/srep38433