# R Customer Churn Model
This notebook demonstrates the process of creating a customer churn model using R. We will use the XGBoost and logistic regression algorithms to create two separate models and compare their performance.

The dataset used in this notebook is located at `../datasets/bank_customer_churn.csv`.

## Pre-Requisites:

#### install R using homebrew if you don't have it already:

```bash
brew install r
```

You additionally might need the following packages to run this notebook if you run into errors:

```bash
brew install harfbuzz fribidi libtiff libomp
```

#### Install the following packages in an R console:

```R
install.packages("xgboost")
install.packages("caret")
install.packages("pROC")
``` 

## Load Libraries
First, we load the required libraries.

In [18]:
# load the required libraries
library(xgboost)
library(dplyr)
library(caret)
library(pROC)

## Import and Preprocess Dataset
Now, we import the dataset and preprocess it by removing irrelevant columns, converting categorical variables, and one-hot encoding certain columns.

In [2]:
# import the dataset
df <- read.csv("../datasets/bank_customer_churn.csv", header = TRUE)

# remove irrelevant columns
df <- df %>% select(-c(RowNumber, CustomerId, Surname, CreditScore))

# Convert the 'Gender' column to 0 or 1 (assuming "Female" should be 0 and "Male" should be 1)
df$Gender <- ifelse(df$Gender == "Female", 0, 1)

# one-hot encode categorical columns with caret
df <- dummyVars(" ~ .", data = df) %>% predict(df)

# remove GeographySpain since it causes multicollinearity
df <- subset(df, select = -GeographySpain)

summary(df)

 GeographyFrance  GeographyGermany     Gender            Age       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :18.00  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:32.00  
 Median :1.0000   Median :0.0000   Median :1.0000   Median :37.00  
 Mean   :0.5012   Mean   :0.2511   Mean   :0.5495   Mean   :38.95  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:44.00  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :92.00  
     Tenure          Balance       NumOfProducts     HasCrCard     
 Min.   : 0.000   Min.   :     0   Min.   :1.000   Min.   :0.0000  
 1st Qu.: 3.000   1st Qu.:     0   1st Qu.:1.000   1st Qu.:0.0000  
 Median : 5.000   Median : 97264   Median :1.000   Median :1.0000  
 Mean   : 5.034   Mean   : 76434   Mean   :1.532   Mean   :0.7026  
 3rd Qu.: 8.000   3rd Qu.:128045   3rd Qu.:2.000   3rd Qu.:1.0000  
 Max.   :10.000   Max.   :250898   Max.   :4.000   Max.   :1.0000  
 IsActiveMember   EstimatedSalary         Exited

## Split Data
Next, we split the data into training
and testing sets using a 70/30 ratio.

In [3]:
# split data into training and testing sets
set.seed(123)
train_index <- sample(1:nrow(df), size = round(0.7*nrow(df)), replace = FALSE)
df_train <- df[train_index, ]
df_test <- df[-train_index, ]

## Save Train and Test Datasets
We save the train and test datasets as CSV files.

In [4]:
# save the train and test datasets as csv files
write.csv(df_train, file = "r_churn_train.csv", row.names = FALSE)
write.csv(df_test, file = "r_churn_test.csv", row.names = FALSE)

## Convert Data to DMatrix Format
We convert the data into DMatrix format, which is required by the XGBoost library.

In [5]:
# convert data into DMatrix format
dtrain <- xgb.DMatrix(data = df_train[,-c(11)], label = df_train[,"Exited"])
dtest <- xgb.DMatrix(data = df_test[,-c(11)], label = df_test[,"Exited"])

## Set Up XGBoost Parameters
We set up the XGBoost parameters to be used during the training process.

In [6]:
# set up XGBoost parameters
params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  max_depth = 3,
  eta = 0.1,
  gamma = 0.5,
  subsample = 0.8,
  colsample_bytree = 0.8,
  min_child_weight = 1,
  nthread = 4
)

## Train the XGBoost Model
We train the XGBoost model using the parameters and data prepared earlier.

In [7]:
# train the XGBoost model
model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  watchlist = list(train = dtrain, test = dtest),
  early_stopping_rounds = 10
)

[1]	train-auc:0.795105	test-auc:0.793822 
Multiple eval metrics are present. Will use test_auc for early stopping.
Will train until test_auc hasn't improved in 10 rounds.

[2]	train-auc:0.820697	test-auc:0.808123 
[3]	train-auc:0.823965	test-auc:0.811294 
[4]	train-auc:0.837212	test-auc:0.823692 
[5]	train-auc:0.839206	test-auc:0.827146 
[6]	train-auc:0.843781	test-auc:0.832219 
[7]	train-auc:0.853531	test-auc:0.836494 
[8]	train-auc:0.857080	test-auc:0.838679 
[9]	train-auc:0.857191	test-auc:0.839732 
[10]	train-auc:0.856166	test-auc:0.840575 
[11]	train-auc:0.857386	test-auc:0.841168 
[12]	train-auc:0.857084	test-auc:0.841385 
[13]	train-auc:0.856794	test-auc:0.842336 
[14]	train-auc:0.857827	test-auc:0.841208 
[15]	train-auc:0.858503	test-auc:0.842312 
[16]	train-auc:0.860074	test-auc:0.843007 
[17]	train-auc:0.858916	test-auc:0.843317 
[18]	train-auc:0.858676	test-auc:0.843113 
[19]	train-auc:0.858821	test-auc:0.843309 
[20]	train-auc:0.859816	test-auc:0.845118 
[21]	train-auc:0.86

## Model Summary
We display a summary of the trained XGBoost model.

In [8]:
summary(model)

                Length Class              Mode       
handle              1  xgb.Booster.handle externalptr
raw             81624  -none-             raw        
best_iteration      1  -none-             numeric    
best_ntreelimit     1  -none-             numeric    
best_score          1  -none-             numeric    
best_msg            1  -none-             character  
niter               1  -none-             numeric    
evaluation_log      3  data.table         list       
call                6  -none-             call       
params             10  -none-             list       
callbacks           3  -none-             list       
feature_names      10  -none-             character  
nfeatures           1  -none-             numeric    

## Predict on Test Data
We make predictions on the test data and calculate the accuracy of the model.

In [9]:
# predict on test data
test_preds <- predict(model, dtest)

# Convert predicted probabilities to binary predictions
test_preds_binary <- ifelse(test_preds > 0.5, 1, 0)

# Calculate accuracy on test set
accuracy <- sum(test_preds_binary == df_test[,"Exited"])/nrow(df_test)
accuracy

## Model Evaluation Metrics
We calculate the confusion matrix, precision, recall, F1 score, and ROC AUC for the model.

In [10]:
# Calculate the confusion matrix
cm <- confusionMatrix(as.factor(test_preds_binary), as.factor(df_test[,"Exited"]))

# Calculate precision, recall, and F1 score
precision <- cm$table[2, 2] / (cm$table[2, 2] + cm$table[2, 1])
recall <- cm$table[2, 2] / (cm$table[2, 2] + cm$table[1, 2])
f1_score <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
cat("Recall:", recall, "\n")
cat("F1 Score:", f1_score, "\n")

# Calculate ROC AUC
roc_obj <- roc(df_test[,"Exited"], test_preds)
roc_auc <- auc(roc_obj)
cat("ROC AUC:", roc_auc, "\n")

Precision: 0.7687075 
Recall: 0.4584178 
F1 Score: 0.5743329 


Setting levels: control = 0, case = 1

Setting direction: controls < cases



ROC AUC: 0.86181 


## Save the Model
We save the trained XGBoost model as a JSON file.

In [11]:
# save the model (notice the .json extension, we could also save it as .bin)
# this ensures compatibility with the ValidMind sdk
xgb.save(model, "r_xgb_churn_model.json")

## Train a Simple Logistic Regression Model
As a comparison, we train a simple logistic regression model using the training data.

In [12]:
# now lets train a simple logistic regression model
lg_reg_model <- glm(Exited ~ ., data = as.data.frame(df_train), family = "binomial")

## Model Summary
We display a summary of the trained logistic regression model.

In [13]:
summary(lg_reg_model)


Call:
glm(formula = Exited ~ ., family = "binomial", data = as.data.frame(df_train))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3284  -0.6470  -0.4563  -0.2781   2.8954  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.732e+00  2.305e-01 -16.187  < 2e-16 ***
GeographyFrance  -1.113e-01  9.474e-02  -1.174 0.240252    
GeographyGermany  7.394e-01  1.054e-01   7.018 2.25e-12 ***
Gender           -4.974e-01  7.319e-02  -6.796 1.07e-11 ***
Age               7.142e-02  3.433e-03  20.803  < 2e-16 ***
Tenure           -1.170e-02  1.263e-02  -0.926 0.354301    
Balance           2.525e-06  6.995e-07   3.610 0.000306 ***
NumOfProducts    -1.300e-01  6.475e-02  -2.008 0.044643 *  
HasCrCard        -1.468e-02  8.025e-02  -0.183 0.854836    
IsActiveMember   -9.979e-01  7.682e-02 -12.989  < 2e-16 ***
EstimatedSalary   2.248e-07  6.337e-07   0.355 0.722854    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 

In [14]:
coef(lg_reg_model)

## Predict on Test Data
We make predictions on the test data and calculate the accuracy of the logistic regression model.

In [15]:
# Make predictions on test set
test_preds <- predict(lg_reg_model, newdata = as.data.frame(df_test), type = "response")

# Convert predicted probabilities to binary predictions
test_preds_binary <- ifelse(test_preds > 0.5, 1, 0)

# Calculate accuracy on test set
accuracy <- sum(test_preds_binary == df_test[,"Exited"])/nrow(df_test)
accuracy

## Model Evaluation Metrics
We calculate the confusion matrix, precision, recall, F1 score, and ROC AUC for the logistic regression model.

In [16]:
# Calculate the confusion matrix
cm <- confusionMatrix(as.factor(test_preds_binary), as.factor(df_test[,"Exited"]))

# Calculate precision, recall, and F1 score
precision <- cm$table[2, 2] / (cm$table[2, 2] + cm$table[2, 1])
recall <- cm$table[2, 2] / (cm$table[2, 2] + cm$table[1, 2])
f1_score <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
cat("Recall:", recall, "\n")
cat("F1 Score:", f1_score, "\n")

# Calculate ROC AUC
roc_obj <- roc(df_test[,"Exited"], test_preds)
roc_auc <- auc(roc_obj)
cat("ROC AUC:", roc_auc, "\n")

Precision: 0.5890411 
Recall: 0.1744422 
F1 Score: 0.2691706 


Setting levels: control = 0, case = 1

Setting direction: controls < cases



ROC AUC: 0.7616043 


## Save the Model
We save the trained logistic regression model as an RDS file.

In [17]:
# save the model
saveRDS(lg_reg_model, "r_log_reg_churn_model.rds")