#  Decision Trees and Random Forests

Shaurya Jauhari (Email: shauryajauhari@gzhmu.edu.cn)

In order to appreciate the concept of random forests, it is incumbent to learn about decision trees. A decision tree is the unit of random forest. 

There is also an added understanding about dimensionality reduction in decision trees (and random forests as well). If you may recall the same concept in logistic regression, the idea was to penalize "non-performing" features by reducing their coefficients close to(alpha=0:Ridge) or equal to zero(alpha=1:Lasso).(P.S. There is also a provision for replicating the objective of train-test data partitioning, i.e. cross-validation). In decision trees, entropy and information gain parameters are calculated to ascertain the best attribute to split/ partition the tree. It is crucial to engender the tree a definitive structure, else the curse of biasness in decision trees for continuous and nominal data comes to play.   

Random forests enlighten on the dogma of democracy, i.e. *majority wins*. Decision Trees are rudimentary classification algorithms that, at a low-level, are synonymous to *if-then* conditional statements in programming languages. They follow the strategy of iterative recursion, and intuitively the leaf nodes hold the final verdict. The highest aggregate from all leaf nodes (terminals) is graded as the output of that decision tree.  

In this module, we shall delve into creation of basic decision trees to have an understanding of it. For the purpose, we shall load the package **party** and make use of the function **ctree()** to calculate and analyze decision trees. 

In [1]:
install.packages("party", dependencies = TRUE, repos = "https://mirrors.tuna.tsinghua.edu.cn/CRAN/")
library(party)


The downloaded binary packages are in
	/var/folders/hm/c3_fjypn62v5xh5b5ygv267m0000gn/T//RtmpTrphVb/downloaded_packages


Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich


Let us now, pick up a dataset. The dataset pertains to the soft computation of the **enhancer prediction** module in bioinformatics. We certainly take into cognizance the biological implications about enhancer regions (See Figure 2 in the handout).   
Certain known classes of proteins/ cis-regulatory elements called Transcription Factors (TFs) and Transcription Co-Activators (TCoAs) are programmed to bind to regions in the genome called *Enhancers*, that remotely orchestrate the phenomena of gene regulation. They are at a distal location to the *Promoters*, regions associated with genes and respective Transcription Start Sites (TSS). On stimulus from TFs, the enhancer and promoter sequences reciprocate and actuate the transcription machinery.

In [1]:
mydata_enhancers <- read.csv("./GSM2445787.bed", sep = '\t', header = FALSE)

## We choose to incorporate only those peaks with occurence of 1 and above. Additionally, a column for "Class" has been added that uniformly holds the value "Enhancer" as all the peaks correspond to p300 bindings in the genome. Finally, we have also pruned the dataset for columns on peak ids and frequencies.

mydata_enhancers <- mydata_enhancers[mydata_enhancers$V5 > 0,]
mydata_enhancers <- mydata_enhancers[,c(1,2,3)]
mydata_enhancers$Class <- "Enhancer"
colnames(mydata_enhancers) <- c("Chrom", "Start", "End", "Class")

## We have hit a roadbloack here! In order to execute for decision trees, given the dataset, we shall always have a node that is having two attributes minimum. Since fields "Chrom", "Start", and "End" are intimately related, we'll need "Chrom" as default attribute, alongwith "Start" or "End" or both. There are two ways to solve this problem. First is to have a tree with two attributes at each node. Second way could help by, (i) having the file in sorted order by chromosome names, (ii) and then setting up a new column indicating cumulative "Start" and "End" extremities. 

## Let's try the second way out.
## It's better to confirm if we have the data in sorted order by chromosomes.

mydata_enhancers_sorted_chrom_names <- mydata_enhancers[with(mydata_enhancers, order(Chrom)), ]

## Cool!

In [3]:
mydata_enhancers_sorted_chrom_names

Unnamed: 0_level_0,Chrom,Start,End,Class
Unnamed: 0_level_1,<fct>,<int>,<int>,<chr>
101,chr1,10000,10100,Enhancer
102,chr1,10100,10200,Enhancer
103,chr1,10200,10300,Enhancer
104,chr1,10300,10400,Enhancer
105,chr1,10400,10500,Enhancer
106,chr1,10500,10600,Enhancer
107,chr1,10600,10700,Enhancer
116,chr1,11500,11600,Enhancer
117,chr1,11600,11700,Enhancer
118,chr1,11700,11800,Enhancer
