# Tree-based Methods

The materials used in this tutorial are based on the applied exercises provided in the book "An Introduction to Statistical Learning with Applications in R" (ISLR). We are trying to demonstrate how to implement the following tree based method:

* Decision tree
* Random forest 
* Bagging 


First you need to install libraries

```r
install.packages("tree")
install.packages("randomForest")
```

In [None]:
#load all the libraries
require(tree)
require(randomForest)
library(glmnet)
library(gbm)
library(ISLR)

## 1. Predict purchase in OJ dataset
In this question, we will treat <font color="brown">Purchase</font> as the response in <font color="brown">OJ</font> dataset. This dataset consists of 1070 observations on the following 18 variables. (The following detail is copied and pasted from <a href="https://cran.r-project.org/web/packages/ISLR/ISLR.pdf">here</a>)

In [None]:
?OJ

In [None]:
head(OJ)

In [None]:
table(OJ$Purchase)

### 1.1 Create a training set containing a random sample of 800 observations, and a test set containing the remaining observation.

In [None]:
set.seed(1)

Generate training samples,

Fit a classification tree,

The tree only uses four variables 
* LoyalCH
* PriceDiff
* SpecialCH
* ListPriceDiff

They are the important predictors that are used to construct the decision tree. 

The train error rate is 0.165 for this classification tree. There are 8 terminal nodes in this tree.

### 1.2 Create a text output for the tree

Let's pick terminal node labeled “11)”. The splitting variable at this node is 𝙿𝚛𝚒𝚌𝚎𝙳𝚒𝚏𝚏. 

* The splitting value of this node is 0.195. 
* There are 101 points in the subtree below this node. 
* The deviance for all points contained in region below this node is 139.20. 
* A * in the line denotes that this is in fact a terminal node. 
* The prediction at this node is 𝚂𝚊𝚕𝚎𝚜 = CH. 
* About 54.4% points in this node have 𝙲𝙷 as value of 𝚂𝚊𝚕𝚎𝚜. Remaining 45.5% points have 𝙼𝙼 as value of 𝚂𝚊𝚕𝚎𝚜.

### 1.3 Create a plot of the tree, and interpret the results

𝙻𝚘𝚢𝚊𝚕𝙲𝙷 is the most important variable of the tree, in fact top 3 nodes contain 𝙻𝚘𝚢𝚊𝚕𝙲𝙷. 

If 𝙻𝚘𝚢𝚊𝚕𝙲𝙷<𝟶.𝟸64, the tree predicts 𝙼𝙼. 

If 𝙻𝚘𝚢𝚊𝚕𝙲𝙷>𝟶.𝟽𝟼5, the tree predicts 𝙲𝙷. 

For intermediate values of 𝙻𝚘𝚢𝚊𝚕𝙲𝙷, the decision also depends on the value of 𝙿𝚛𝚒𝚌𝚎𝙳𝚒𝚏𝚏.



### 1.4 Predict the response on the test data, and produce the confusion matrix comparing the test labels to the predicted test labels. What is the error rate?

### 1.5 Apply the <font color="blue">cv.tree()</font> function to the training set in order to determine the optimal tree size. Produce plots with tree size and cross-validation classification error rate. Which tree size is chosen?

The best size number are 2, 5 and 8. Let's choose 5.

### 1.6 Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation.

### 1.7 Compare the training error rates between the pruned and unpruned trees. Which is higher?

You can use the summary() function.

The train error for both classification trees are the same at 0.165, but the residual mean deviance for the pruned tree is higher.

### 1.8 Compare the testing error rates between the pruned and unpruned trees. Which is higher?

## 2. Use boosting to predict Salary in the Hitters dataset 

In this task, we are going to study how to fit a boosted regression tree.

### 2.1 Prepare the training and the testing datasets.
There are some observations for whom the salary information is unknown. We need to exclude those observations from the datasets. And then, log-transform the salaries.

In [None]:
sum(is.na(Hitters$Salary))

In [None]:
Hitters <- na.omit(Hitters)
Hitters$Salary <- log(Hitters$Salary)

As what you have done in the task 1 above, we generate the training and testing splits.

In [None]:
train <- 1:200
Hitters.train <- Hitters[train, ]
Hitters.test <- Hitters[-train, ]

### 2.2 Fit a boosted tree 
Perform boosting on the training set with 1000 trees for a range of values of the shrinkage parameter $\lambda$.
Produce twos plots:
* one with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.
* one with different shrinkage values on the x-axis and the corresponding testing set MSE on the y-axis.

In [None]:
set.seed(1)

In [None]:
pows = seq(-10, -0.2, by = 0.1)
lambdas = 10^pows
length.lambdas = length(lambdas)
train.errors = rep(NA, length.lambdas)
test.errors = rep(NA, length.lambdas)

Now, perform the boosting with the setting above.

In [None]:
for (i in 1:length.lambdas) {
#Write you code here
    
    
}

Plot the training/testing MSE as a function of $\lambda$s results below

In [None]:
par(mfcol=c(1,2),  pty = "s")

What is the min test error?

In [None]:
min(test.errors)

In [None]:
lambdas[which.min(test.errors)]

### 2.3 Comparison

Compare the test MSE of boosting to the test MSE that results from applying the regression approaches: 
* regression 
* lasso regression
* Ridge regression

In [None]:
x <- model.matrix(Salary ~ ., data = Hitters.train)
x.test <- model.matrix(Salary ~ ., data = Hitters.test)
y <- Hitters.train$Salary
#Write your ridge regression code here


In [None]:
#Write your lasso regression code


Both linear model and regularization like Lasso have higher test MSE than boosting.

### 2.4 Which variables appear to be the most important predictors in the boosted model ?

We may see that “CAtBat” is by far the most important variable.

### 2.5 Now apply bagging to the training set. What is the test set MSE for this approach ?

In [None]:
set.seed(1)

The test MSE for bagging is 0.23, which is slightly lower than the test MSE for boosting.