### Comparing tree-based models with regression & classification methods

In [70]:
options(warn=-1)
# Get and describe the dataset
library(MASS)
data(Boston)
head(Boston)

crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7


In [71]:
summary(Boston)

      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax 

I am interested in predicting whether a given suburb has a crime rate above or below the median.

In [72]:
# Creating a dummy variable for if the crime rate is above or below the median
crime <- rep(0, length(crim))
crime[crim > median(crim)] <- 1
Boston = data.frame(Boston,crime)
summary(Boston)

      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax 

For the tree-based methods, I will apply bagging, random forest, and boosting methods. 

In [73]:
# Spliting data into training and test sets
set.seed(6)
train=sample(1:nrow(Boston),nrow(Boston)/2)
test=-train
training_data=Boston[train,]
testing_data=Boston[test,]

In [74]:
# Bagging
library("randomForest")
set.seed(4)
bag.data=randomForest(crime~., data=training_data, mtry=13, importance=TRUE)
bag_predict=predict(bag.data, testing_data, type="class")
mean((bag_predict - testing_data$crime)^2)

Bagging was performed using 13 predictors and default number of trees (i.e., 500). The associated error rate with bagging was 0.1164%.

In [75]:
# Random Forest
set.seed(15)
rf.data=randomForest(crime~., data=training_data, mtry=4, importance=TRUE, proximity=TRUE)
rf_predict=predict(rf.data, testing_data, type="class")
mean((rf_predict - testing_data$crime)^2)

Random forest was performed using 4 predictors and default number of trees (i.e., 500). The error rate for random forest was 0.48%.

In [76]:
library("gbm")
set.seed(9)
boosting.model = gbm(crime~.,data=training_data,distribution="gaussian",
                      n.trees=500,shrinkage=0.1)
boosting_predict=predict(boosting.model, testing_data, type="response")
mean((boosting_predict - testing_data$crime)^2)

Using 500 trees...



Boosting was performed on 500 trees with 0.1 shrinkage, which produced an error rate of 0.34%.

I will use logistic regression and KNN for comparison.

In [77]:
# Logistic Regression
set.seed(8)
logit.fit <- glm(crime~., data = training_data, family=binomial)
logit.pred <-  predict(logit.fit, testing_data, type="response")
mean((logit.pred - testing_data$crime)^2)

In [78]:
# KNN
library("caret")
set.seed(44)
trControl <- trainControl(method = "repeatedcv",number = 10,repeats = 3)
fit.1 <- train(crime~.,data = training_data,method = 'knn',tuneLength = 20,
       trControl = trControl,preProc = c("center", "scale"))
pred.1 <- predict(fit.1, newdata = testing_data)
mean((pred.1 - testing_data$crime)^2)

I see that logistic regression and KNN yielded huge error rates. The logistic regression model produced an error rate of 4.29%. KNN was the worst among all the models with a 7.21% error rate. 

### Conclusions

Overall, bagging performed the best with the lowest error rate, followed by boosting and random forest. The error rate for boosting, however, is subject to change with a different shrinkage level. The logistic regression model performed better than the KNN.

The tree models overall performed better with the model, but tree models generally lack predictive accuracy as other regression and classification method. Tree models are often preferred for their ease of interpretatibity and visualization, so in that case the model with the aforementioned low error rate can be useful. However, for prediction, the classification models are more useful.