### This assignment is an exercise in working with imbalanced data

In [3]:
library(caret)
library(ranger)

Loading required package: lattice
Loading required package: ggplot2


Importing the data and fixing any variable types that were loaded incorrectly

In [1]:
hmeq = read.csv('hmeq.csv')
str(hmeq)

hmeq$BAD=as.factor(hmeq$BAD)

'data.frame':	5960 obs. of  13 variables:
 $ BAD    : int  1 1 1 1 0 1 1 1 1 1 ...
 $ LOAN   : int  1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
 $ MORTDUE: num  25860 70053 13500 NA 97800 ...
 $ VALUE  : num  39025 68400 16700 NA 112000 ...
 $ REASON : Factor w/ 3 levels "","DebtCon","HomeImp": 3 3 3 1 3 3 3 3 3 3 ...
 $ JOB    : Factor w/ 7 levels "","Mgr","Office",..: 4 4 4 1 3 4 4 4 4 6 ...
 $ YOJ    : num  10.5 7 4 NA 3 9 5 11 3 16 ...
 $ DEROG  : int  0 0 0 NA 0 0 3 0 0 0 ...
 $ DELINQ : int  0 2 0 NA 0 0 2 0 2 0 ...
 $ CLAGE  : num  94.4 121.8 149.5 NA 93.3 ...
 $ NINQ   : int  1 0 1 NA 0 1 1 0 1 0 ...
 $ CLNO   : int  9 14 10 NA 14 8 17 8 12 13 ...
 $ DEBTINC: num  NA NA NA NA NA ...


Quickly cleaned the data to handle missing values and renamed the target variable

In [2]:
sum(is.na(hmeq))
summary(hmeq)
hmeq[hmeq=='']=NA
hmeq_complete=hmeq[complete.cases(hmeq),]
nrow(hmeq_complete)

names(hmeq_complete)[1]='target'

 BAD           LOAN          MORTDUE           VALUE            REASON    
 0:4771   Min.   : 1100   Min.   :  2063   Min.   :  8000          : 252  
 1:1189   1st Qu.:11100   1st Qu.: 46276   1st Qu.: 66076   DebtCon:3928  
          Median :16300   Median : 65019   Median : 89236   HomeImp:1780  
          Mean   :18608   Mean   : 73761   Mean   :101776                 
          3rd Qu.:23300   3rd Qu.: 91488   3rd Qu.:119824                 
          Max.   :89900   Max.   :399550   Max.   :855909                 
                          NA's   :518      NA's   :112                    
      JOB            YOJ             DEROG             DELINQ       
        : 279   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.0000  
 Mgr    : 767   1st Qu.: 3.000   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Office : 948   Median : 7.000   Median : 0.0000   Median : 0.0000  
 Other  :2388   Mean   : 8.922   Mean   : 0.2546   Mean   : 0.4494  
 ProfExe:1276   3rd Qu.:13.000   3rd Qu.: 0.0000   3rd 

Split the data into 70% training, 30% testing

In [4]:
set.seed(2018)
splitIndex=createDataPartition(hmeq_complete$target, p=.70, list=FALSE, times=1)
train_hmeq=hmeq_complete[splitIndex,]
test=hmeq_complete[-splitIndex,]

Train and test random forest (ranger).  Report the misclassification/accuracy and balanced accuracy

In [5]:
forest1=ranger(target~.,data=train_hmeq)
pred=predict(forest1,data=test)$predictions
cm=confusionMatrix(pred,test$target,positive="1")
cm$overall['Accuracy']
cm$byClass['Balanced Accuracy']

The ratio of Default: Non-Default clients

In [6]:
prop.table(table(train_hmeq$target))


         0          1 
0.91082803 0.08917197 

Balancing the data using undersampling and rerunning the forest

In [7]:
train1=train_hmeq[train_hmeq$target=="1",]
n1=nrow(train1)
table(train1$target)

train0=train_hmeq[train_hmeq$target=="0",]
n0=nrow(train0)
table(train0$target)

train00=train0[sample(1:n0,n1),]

train_under=rbind(train00,train1)

model_under=ranger(target~.,data=train_under)
pred_under=predict(model_under,data=test)$predictions
cm_under=confusionMatrix(pred_under,test$target,positive="1")
cm_under$byClass['Balanced Accuracy']


  0   1 
  0 210 


   0    1 
2145    0 

Balance the data using oversampling and rerunning the forest

In [8]:
train11=train1[sample(1:n1,n0,replace = TRUE),]

train_over=rbind(train11,train0)

model_over=ranger(target~.,data=train_over)
pred_over=predict(model_over,data=test)$predictions
cm_over=confusionMatrix(pred_over,test$target,positive="1")
cm_over$byClass['Balanced Accuracy']

Writing a function that takes a dataset argument with a target variable named target and a method argument specifying undersampling or oversampling, then outputs a dataset with a balanced target

In [11]:
quick_bal=function(x,method){
  train1=x[x[,'target']=="1",]
  n1=nrow(train1)
  table(train1$target)
  
  train0=x[x[,'target']=="0",]
  n0=nrow(train0)
  table(train0$target)
  
  if (method=="over"){
    train11=train1[sample(1:n1,n0,replace = TRUE),]
    train_over=rbind(train11,train0)
  }
  else if (method=="under"){
    train00=train0[sample(1:n0,n1),]
    train_under=rbind(train00,train1)
  }
  else{
    print("Please input either under or over for the method parameter")
  }
}
train_bal=quick_bal(train_hmeq,"under")
table(train_bal$target)


  0   1 
210 210 

Writing a function that takes a dataset with a target variable named target and outputs the balanced accuracies of random forests with both undersampling and oversampling being applied on the training dataset. 

In [12]:
quick_model_bal=function(x){
  train1=x[x[,'target']=="1",]
  n1=nrow(train1)
  table(train1$target)
  
  train0=x[x[,'target']=="0",]
  n0=nrow(train0)
  table(train0$target)
  
  print("Undersample:")
  train00=train0[sample(1:n0,n1),]
  train_under=rbind(train00,train1)
  model_under=ranger(target~.,data=train_under)
  pred_under=predict(model_under,data=test)$predictions
  cm_under=confusionMatrix(pred_under,test$target,positive="1")
  print(cm_under$byClass['Balanced Accuracy'])
  
  print("Oversample:")
  train11=train1[sample(1:n1,n0,replace = TRUE),]
  train_over=rbind(train11,train0)
  model_over=ranger(target~.,data=train_over)
  pred_over=predict(model_over,data=test)$predictions
  cm_over=confusionMatrix(pred_over,test$target,positive="1")
  print(cm_over$byClass['Balanced Accuracy'])
  
}
train_model_bal=quick_model_bal(train_hmeq)
train_model_bal

[1] "Undersample:"
Balanced Accuracy 
        0.8301596 
[1] "Oversample:"
Balanced Accuracy 
        0.7544674 
