In [2]:
suppressMessages(library(ipred))
suppressMessages(library(caret))
suppressMessages(library(Metrics))
suppressMessages(library(plyr))
suppressMessages(library(e1071))

# Ignoring warning for better presentation
options(warn=-1)

In [3]:
#This notebook covers a R-based approach to Bagged Trees. We'll start with implementing a simple bagged tree model and then evalaute the model using confusion matrix. Then we will look at a cross validated bagged model and compare its result with the previous model. Compared to a single decision tree model, bagged tree model promises to increase the accuracy of the resulting predictions and reduces variance by averaging a set of observations. However, unlike decision trees bagged trees are harder to understand, interpret and visualize.

In [5]:
# reading data file ~ credit.csv 
# taken from https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

# credit.csv is composed of qualitative and quantitative variables.
# For this exercise we will focus on the following variables only, namely -
# months_loan_duration: colnumber - 2
# percent_of_income: colnumber - 8
# years_at_residence: colnumber - 9
# age - 10
# default - 17

cols <- rep('NULL', 17)
cols[c(2, 8, 9, 10, 17)] <- NA

creditsub <- read.csv(file = '/home/dell/R_programming/case studies/trees/machinelearning-R/credit.csv', 
                      colClasses = cols,
                      header = T)

# Let's take a look at the dataframe
head(creditsub)

months_loan_duration,percent_of_income,years_at_residence,age,default
6,4,4,67,no
48,2,2,22,yes
12,2,3,49,no
42,2,4,45,no
24,3,4,53,yes
36,2,4,35,no


In [6]:
#If you have been following from my other notebook titled "Decision Trees in R" then you'll realize that we are using the same dataset. In the previous notebook we followed a decision tree model based approach and in this one we'll follow bagged trees approach.

In [7]:
# Let's split the data into train and test

# Setting seed for reproducible train and test partitions
set.seed(123)

smp_size <- floor(0.75 * nrow(creditsub))

train_ind <- sample(seq_len(nrow(creditsub)), size = smp_size)

credit_train <- creditsub[train_ind, ]

credit_test <- creditsub[-train_ind, ]

In [8]:
# let us train our models.
# Training a bagged model
credit_model <- bagging(formula = default ~ ., 
                        data = credit_train,
                        coob = TRUE)

# Let's print the model
print(credit_model)


Bagging classification trees with 25 bootstrap replications 

Call: bagging.data.frame(formula = default ~ ., data = credit_train, 
    coob = TRUE)

Out-of-bag estimate of misclassification error:  0.344 



In [9]:
#In the above cell, we used 'coob'=TRUE as one of the parameters to bagging. Assigning this true allows us to estimate the model's accuracy using the "out-of-bag" (OOB) samples. The OOB samples are the training obsevations that were not selected into the bootstrapped sample (used in training). Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model (done automatically inside the bagging() function).

#Let's make prediction with the model we just created and evaluate its performance with confusion matrix.

In [10]:

# Generate predicted classes using the model object# Gener 
class_prediction <- predict(object = credit_model,    
                            newdata = credit_test,  
                            type = "class")

# Let's calculate the confusion matrix for the test set
confusionMatrix(data = class_prediction,       
                reference = credit_test$default)

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  138  55
       yes  37  20
                                          
               Accuracy : 0.632           
                 95% CI : (0.5689, 0.6919)
    No Information Rate : 0.7             
    P-Value [Acc > NIR] : 0.99127         
                                          
                  Kappa : 0.0593          
 Mcnemar's Test P-Value : 0.07633         
                                          
            Sensitivity : 0.7886          
            Specificity : 0.2667          
         Pos Pred Value : 0.7150          
         Neg Pred Value : 0.3509          
             Prevalence : 0.7000          
         Detection Rate : 0.5520          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.5276          
                                          
       'Positive' Class : no              
                                          

In [11]:
#The accuracy of bagged model is even worse than the decision tree model, which gave accuracy around 70%.

#Let's see if we can improve it somehow. If we look at the predict function above, we will realize that we have used type="class" which provides a particular class as the predicted output. However, we could have passed type="prob" which would have provided us with the probability of a data set belonging to that class. Let's have a look at it.

In [12]:
pred <- predict(object = credit_model,
                newdata = credit_test,
                type = "prob")
                
# Let's look at the pred format
head(pred)

no,yes
0.88,0.12
0.84,0.16
0.36,0.64
0.76,0.24
1.0,0.0
1.0,0.0


In [14]:
#Since we have the probabilities, we can decide the threshold value that will give the best result. The best way to choose a value is to compare the results at each threshold. If we measure area under the ROC curve, it will give us the error rate for one particular model. Let's check that out.

In [15]:
auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
    predicted = pred[,"yes"])

In [16]:
#To make sure that our result is consistent if we change the training and testing data set we should consider using cross-validation. let cross-validate a bagged tree model using caret.

In [17]:
# Specify the training configuration
ctrl <- trainControl(method = "cv",     # Cross-validation
                     number = 5,      # 5 folds
                     classProbs = TRUE,                  # For AUC
                     summaryFunction = twoClassSummary)  # For AUC


credit_caret_model <- train(default ~ .,
                            data = credit_train, 
                            method = "treebag",
                            metric = "ROC",
                            trControl = ctrl)

print(credit_caret_model)

Bagged CART 

750 samples
  4 predictors
  2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 600, 600, 600, 600, 600 
Resampling results:

  ROC        Sens       Spec     
  0.5884868  0.8133333  0.3066667



In [18]:
# Inspect the contents of the model list 
names(credit_caret_model)

# Printing the CV AUC
credit_caret_model$results[,"ROC"]

In [19]:
#So upon cross validation, we found that the performance of the model improved (at least on training set). Let's check the performance of the cross-validated bagged tree model on the test dataset.

In [20]:
pred <- predict(object = credit_caret_model, 
                newdata = credit_test,
                type = "prob")

# auc
auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
                    predicted = pred[,"yes"])

In [21]:
#The performance of the cross-validated bagged tree model is 0.529 which could be taken as the expected auc of the model. This is almost no better than flipping a coin and telling if the person will default or not. We need to look at some other method now.

In [22]:
# Taking a peek at CE as well
pred <- predict(object = credit_caret_model, 
                newdata = credit_test,
                type = "raw")

# classification error (ce)
ce(actual = credit_test$default, 
   predicted = pred)

In [23]:
# As seen classification error is 0.376 which is worse than that of decision trees, 0.288