## Resampling methods

#### 4 resampling techniques are explored and evaluated - all of which are in the **ROSE library**.<br>
The other libraries are needed for training the model and extracting evaluation metrics.<br>
Documentation for the **ROSE library** can be found here: __[Documentation](https://www.rdocumentation.org/packages/ROSE/versions/0.0-4/topics/ovun.sample)__ <br>
The linked blog was used as reference for checking the correctness of implementation: __[Blog](https://www.r-bloggers.com/2021/05/class-imbalance-handling-imbalanced-data-in-r/)__
***
### This notebook takes appx. 30 minutes to run start to end
***

### First, the dataset is loaded and split for training.

In [19]:
# loading required libraries
library(ROSE)
library(randomForest)
library(caret)
library(e1071)

In [20]:
# load the smaller dataset
newDataset <- read.csv(file = 'brfssCleanedSmall.csv')

# set the target variable as a factor
newDataset$ASTHMA3 <- as.factor(newDataset$ASTHMA3)

# split dataset into test and train data
set.seed(100)
ind <- sample(nrow(newDataset), 0.7*nrow(newDataset), replace = FALSE)
train <- newDataset[ind,]
test <- newDataset[-ind,]

In [21]:
# view number of instances of both classes on the train data
table(train$ASTHMA3)
# view number of instances of both classes on the test data
table(test$ASTHMA3)


    0     1 
24310  3826 


    0     1 
10449  1610 

***
### Then, the training data is resampled and saved as seperate subsets while observing the change in dataset size.

In [22]:
# oversampling and viewing the train dataset size
oversample <- ovun.sample(ASTHMA3~., data = train, method = "over", p = 0.5)$data
table(oversample$ASTHMA3)


    0     1 
24310 24141 

In [23]:
# undersampling and viewing the train dataset size
undersample <- ovun.sample(ASTHMA3~., data = train, method = "under", p = 0.5)$data
table(undersample$ASTHMA3)


   0    1 
3789 3826 

In [24]:
# combination of over and under sampling
bothsample <- ovun.sample(ASTHMA3~., data=train, method = "both", p = 0.5, seed = 222,)$data
table(bothsample$ASTHMA3) # view dataset size


    0     1 
14162 13974 

In [25]:
# using the ROSE function to generate synthetic samples
rosesample <- ROSE(ASTHMA3~., data = train, p=0.5, seed=111)$data
table(rosesample$ASTHMA3) # view dataset size


    0     1 
14057 14079 

***
### Random Forest models are now trained on all sampled subsets and metrics are printed for comparison

In [26]:
# setting the start time to log training duration
start_time <- Sys.time()

In [27]:
# training random forest on unsampled data
model_rf1 <- randomForest(ASTHMA3~., data = train, ntree = 500, mtry = 6, importance = TRUE)
# predicting on the test set and printing metrics of interest
cmrf1 <- confusionMatrix(predict(model_rf1, test), test$ASTHMA3, positive = '1')
cmrf1$overall['Accuracy']
cmrf1$byClass['Sensitivity']
cmrf1$byClass['Balanced Accuracy']
end_time1 <- Sys.time() #log time after prediction
print(end_time1 - start_time) # print time taken

Time difference of 4.851454 mins


In [28]:
# training random forest on oversampled data
model_rf1over <- randomForest(ASTHMA3~., data = oversample, ntree = 500, mtry = 6, importance = TRUE)
# predicting on the test set and printing metrics of interest
cmrf1over <- confusionMatrix(predict(model_rf1over, test), test$ASTHMA3, positive = '1')
cmrf1over$overall['Accuracy']
cmrf1over$byClass['Sensitivity']
cmrf1over$byClass['Balanced Accuracy']
end_time2 <- Sys.time() #log time after prediction
print(end_time2 - end_time1) # print time taken

Time difference of 9.305904 mins


In [29]:
# training random forest on undersampled data
model_rf1under <- randomForest(ASTHMA3~., data = undersample, ntree = 500, mtry = 6, importance = TRUE)
# predicting on the test set and printing metrics of interest
cmrf1under <- confusionMatrix(predict(model_rf1under, test), test$ASTHMA3, positive = '1')
cmrf1under$overall['Accuracy']
cmrf1under$byClass['Sensitivity']
cmrf1under$byClass['Balanced Accuracy']
end_time3 <- Sys.time() #log time after prediction
print(end_time3 - end_time2) #print time taken

Time difference of 1.062139 mins


In [30]:
# training random forest on a combination of oversampled and undersampled data
model_rf1both <- randomForest(ASTHMA3~., data = bothsample, ntree = 500, mtry = 6, importance = TRUE)
# predicting on the test set and printing metrics of interest
cmrf1both <- confusionMatrix(predict(model_rf1both, test), test$ASTHMA3, positive = '1')
cmrf1both$overall['Accuracy']
cmrf1both$byClass['Sensitivity']
cmrf1both$byClass['Balanced Accuracy']
end_time4 <- Sys.time() #log time after prediction
print(end_time4 - end_time3) #print time taken

Time difference of 4.702458 mins


In [31]:
# training random forest on synthetically generated data using ROSE function
model_rf1rose <- randomForest(ASTHMA3~., data = rosesample, ntree = 500, mtry = 6, importance = TRUE)
# predicting on the test set and printing metrics of interest
cmrf1rose <- confusionMatrix(predict(model_rf1rose, test), test$ASTHMA3, positive = '1')
cmrf1rose$overall['Accuracy']
cmrf1rose$byClass['Sensitivity']
cmrf1rose$byClass['Balanced Accuracy']
end_time5 <- Sys.time() #log time after prediction
print(end_time5 - end_time4) #print time taken

Time difference of 5.058254 mins


***
### This concludes experiments for choosing the ideal sampling method for our imbalanced data
We will be choosing undersampling due to relatively similar accuracy but significantly higher sensitivity values. Additionally, it uses much lower computational resources and time.
***