# Deep Learning 

## Background

Machine learning has conquered our digital space in a much conceivable way. Aritifical Intelligence is the umbrella discipline that describes the varied notions of the systemic behavior akin to humans. Unlike the common held adage, **Deep Learning** isn't a part of Machine Learning per se, rather it has evolved to be an exclusive stream altogether.  
If we recall from the previous session, we discussed the relevance of decision trees and random forests and how crucial former is to the latter. The diversity and aggregation of results in the random forests engender them the alacrity to handle multidimensional data. Although, neural nets do not foster the *same* relationship with deep learning, it is of the utmost importance to fathom the concept prior to proceeding to the holistic theme of deep learning. 

<img src="./props/AI_branches.jpg">

As mentioned above, deep learning is founded on **neural networks** that have, generally, outperformed other classification algorithms like Support Vector Machines (SVM), logistic regression, etc. *Neural Networks* are a specialized set of algorithms that work on assigning and moderating weights from the original data (input), across through the layers to the classification result (ouput).  

<img src="./props/deep_learning_NN_better.jpg">

### Exercise

> Loosely, Neural Networks is to Deep Learning as ________ is to Random Forests.

## Package Installation

## keras

#### Installation Errors

There could be subjective instances of errors that a system would result, owing to some missing or conflicting dependencies. Few encountered are as under:

<img src="./props/install_keras_error.jpg">

The TensorFlow is the Google's offering for machine learning and deep learning related tasks; *Anaconda* and *Python* libraries are major requisites for *TensorFlow* deployment in R. 

<img src="./props/install_keras_error1.jpg">

Often,
- there is a missing declaration in the PATH variable, or 
- *Anaconda* and *Python* installations have suffered shadowing from the other. 

<img src="./props/install_keras_error2.jpg">

## Dataset Preparation

The overall selection of data and the analysis protocol has been loosely borrowed from Kim et al. (2016). Typically for a machine learning problem, we shall have a dataset (consolidated, with class and variable/ feature definitions) and that'll be bifurcated (typically in 3:7 or 2:8 proportions) to be used as testing and training sets respectively. Contrarily, in this study both the categories have been sourced differently as you'll see.
In this exercise, we shall have positive and negative training examples for the training dataset but only positive examples for the test dataset. And that is perfectively fine.

### Testing Data

For testing data, we consider data only for enhancers. Even with this data, we shall be able to evaluate the model's veracity of predicting positive examples. 

As noted, the primary step is to source the BAM files, index them, and finally tranform into BEDGRAPH files. The column naming is done appropriately. 

In [None]:
h3k27ac <- read.csv("./data/H1_Cell_Line/H3K27ac/ENCFF663SAM.bw", sep = '\t', header = FALSE)
colnames(h3k27ac) <- c("chrom","start","end","peaks")
h3k4me1 <- read.csv("./data/H1_Cell_Line/H3K4me1/ENCFF441KOL.bw", sep = '\t', header = FALSE)
colnames(h3k4me1) <- c("chrom","start","end","peaks")
h3k4me2 <- read.csv("./data/H1_Cell_Line/H3K4me2/ENCFF799BDH.bw", sep = '\t', header = FALSE)
colnames(h3k4me2) <- c("chrom","start","end","peaks")
h3k4me3 <- read.csv("./data/H1_Cell_Line/H3K4me3/ENCFF340UJK.bw", sep = '\t', header = FALSE)
colnames(h3k4me3) <- c("chrom","start","end","peaks")

These files (with extension **.bw**) are eventually imported to the online Galaxy interface and executed for the *bedtools Merge* function. The four files showcasing various histone markers for enhancer data are merged together on the basis of overlapping intervals to get a matrix of peak counts that act as features to the 'enhancer' class. 

The resultant file is eventually imported back to R for further processing.

In [None]:
merged_bw <- read.csv("./data/H1_Cell_Line/bedtools_Merge.bw", sep = '\t', header = FALSE)
merged_bw <- merged_bw[-1,]
colnames(merged_bw) <- c("chrom", "start", "end", "peaks_h3k4me3", "peaks_h3k4me2", "peaks_h3k4me1", "peaks_h3k27ac")
head(merged_bw)
merged_bw$class <- "enhancer"
head(merged_bw)


## Identifying examples of standard chromosomes only and filtering the residuals.

chromosomes <- c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14",
                 "chr15", "chr16", "chr17", "chr18", "chr19", "chr20","chr21", "chr22", "chrX", "chrY")
merged_bw<- as.data.frame(merged_bw[merged_bw$chrom %in% chromosomes, ])


## Deriving test data as input to the deep learning model.

test <- merged_bw[,c(4:8)]


Let's check out the dataset. 

In [None]:
head(test)
plot(test)

The deep learning model has two basic requisites with the input data.
1. The data has to be *numeric* in type.
2. It has to range from 0 to 1. So if it isn't already, some sort of normalization procedure can help do that. The preferred one is the min-max normalization.

In [None]:
# Transforming to numeric(double).

test$peaks_h3k4me3 <- as.double(as.character(test$peaks_h3k4me3))
test$peaks_h3k4me2 <- as.double(as.character(test$peaks_h3k4me2))
test$peaks_h3k4me1 <- as.double(as.character(test$peaks_h3k4me1))
test$peaks_h3k27ac <- as.double(as.character(test$peaks_h3k27ac))

In [None]:
# Min-Max normalization.

test$peaks_h3k4me3 <- (test$peaks_h3k4me3-min(test$peaks_h3k4me3))/(max(test$peaks_h3k4me3)-min(test$peaks_h3k4me3))
test$peaks_h3k4me2 <- (test$peaks_h3k4me2-min(test$peaks_h3k4me2))/(max(test$peaks_h3k4me2)-min(test$peaks_h3k4me2))
test$peaks_h3k4me1 <- (test$peaks_h3k4me1-min(test$peaks_h3k4me1))/(max(test$peaks_h3k4me1)-min(test$peaks_h3k4me1))
test$peaks_h3k27ac <- (test$peaks_h3k27ac-min(test$peaks_h3k27ac))/(max(test$peaks_h3k27ac)-min(test$peaks_h3k27ac))


Now, let's check the data again.

In [None]:
head(test)
plot(test)

## References

> Kim, S. G., Harwani, M., Grama, A., & Chaterji, S. (2016). EP-DNN : A Deep Neural Network- Based Global Enhancer Prediction Algorithm. Nature Publishing Group, (November), 1–13. https://doi.org/10.1038/srep38433