# R Customer Churn Model
This notebook demonstrates the process of creating a customer churn model using R. We will use the XGBoost and logistic regression algorithms to create two separate models and compare their performance.

The dataset used in this notebook is located at `../datasets/bank_customer_churn.csv`.

## Pre-Requisites:

#### install R using homebrew if you don't have it already:

```bash
brew install r
```

You additionally might need the following packages to run this notebook if you run into errors:

```bash
brew install harfbuzz fribidi libtiff libomp
```

#### Install the following packages in an R console:

```R
install.packages("xgboost")
install.packages("caret")
install.packages("pROC")
``` 

## Load Libraries
First, we load the required libraries.

In [None]:
# load the required libraries
library(xgboost)
library(dplyr)
library(caret)
library(pROC)

## Import and Preprocess Dataset
Now, we import the dataset and preprocess it by removing irrelevant columns, converting categorical variables, and one-hot encoding certain columns.

In [None]:
# import the dataset
df <- read.csv("../datasets/bank_customer_churn.csv", header = TRUE)

# remove irrelevant columns
df <- df %>% select(-c(RowNumber, CustomerId, Surname, CreditScore))

# Convert the 'Gender' column to 0 or 1 (assuming "Female" should be 0 and "Male" should be 1)
df$Gender <- ifelse(df$Gender == "Female", 0, 1)

# one-hot encode categorical columns with caret
df <- dummyVars(" ~ .", data = df) %>% predict(df)

# remove GeographySpain since it causes multicollinearity
df <- subset(df, select = -GeographySpain)

summary(df)

## Split Data
Next, we split the data into training
and testing sets using a 70/30 ratio.

In [None]:
# split data into training and testing sets
set.seed(123)
train_index <- sample(1:nrow(df), size = round(0.7*nrow(df)), replace = FALSE)
df_train <- df[train_index, ]
df_test <- df[-train_index, ]

## Save Train and Test Datasets
We save the train and test datasets as CSV files.

In [None]:
# save the train and test datasets as csv files
write.csv(df_train, file = "r_churn_train.csv", row.names = FALSE)
write.csv(df_test, file = "r_churn_test.csv", row.names = FALSE)

## Convert Data to DMatrix Format
We convert the data into DMatrix format, which is required by the XGBoost library.

In [None]:
# convert data into DMatrix format
dtrain <- xgb.DMatrix(data = df_train[,-c(11)], label = df_train[,"Exited"])
dtest <- xgb.DMatrix(data = df_test[,-c(11)], label = df_test[,"Exited"])

## Set Up XGBoost Parameters
We set up the XGBoost parameters to be used during the training process.

In [None]:
# set up XGBoost parameters
params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  max_depth = 3,
  eta = 0.1,
  gamma = 0.5,
  subsample = 0.8,
  colsample_bytree = 0.8,
  min_child_weight = 1,
  nthread = 4
)

## Train the XGBoost Model
We train the XGBoost model using the parameters and data prepared earlier.

In [None]:
# train the XGBoost model
model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  watchlist = list(train = dtrain, test = dtest),
  early_stopping_rounds = 10
)

## Model Summary
We display a summary of the trained XGBoost model.

In [None]:
summary(model)

## Predict on Test Data
We make predictions on the test data and calculate the accuracy of the model.

In [None]:
# predict on test data
test_preds <- predict(model, dtest)

# Convert predicted probabilities to binary predictions
test_preds_binary <- ifelse(test_preds > 0.5, 1, 0)

# Calculate accuracy on test set
accuracy <- sum(test_preds_binary == df_test[,"Exited"])/nrow(df_test)
accuracy

## Model Evaluation Tests
We calculate the confusion matrix, precision, recall, F1 score, and ROC AUC for the model.

In [None]:
# Calculate the confusion matrix
cm <- confusionMatrix(as.factor(test_preds_binary), as.factor(df_test[,"Exited"]))

# Calculate precision, recall, and F1 score
precision <- cm$table[2, 2] / (cm$table[2, 2] + cm$table[2, 1])
recall <- cm$table[2, 2] / (cm$table[2, 2] + cm$table[1, 2])
f1_score <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
cat("Recall:", recall, "\n")
cat("F1 Score:", f1_score, "\n")

# Calculate ROC AUC
roc_obj <- roc(df_test[,"Exited"], test_preds)
roc_auc <- auc(roc_obj)
cat("ROC AUC:", roc_auc, "\n")

## Save the Model
We save the trained XGBoost model as a JSON file.

In [None]:
# save the model (notice the .json extension, we could also save it as .bin)
# this ensures compatibility with the ValidMind sdk
xgb.save(model, "r_xgb_churn_model.json")

## Train a Simple Logistic Regression Model
As a comparison, we train a simple logistic regression model using the training data.

In [None]:
# now lets train a simple logistic regression model
lg_reg_model <- glm(Exited ~ ., data = as.data.frame(df_train), family = "binomial")

## Model Summary
We display a summary of the trained logistic regression model.

In [None]:
summary(lg_reg_model)

In [None]:
coef(lg_reg_model)

## Predict on Test Data
We make predictions on the test data and calculate the accuracy of the logistic regression model.

In [None]:
# Make predictions on test set
test_preds <- predict(lg_reg_model, newdata = as.data.frame(df_test), type = "response")

# Convert predicted probabilities to binary predictions
test_preds_binary <- ifelse(test_preds > 0.5, 1, 0)

# Calculate accuracy on test set
accuracy <- sum(test_preds_binary == df_test[,"Exited"])/nrow(df_test)
accuracy

## Model Evaluation Tests
We calculate the confusion matrix, precision, recall, F1 score, and ROC AUC for the logistic regression model.

In [None]:
# Calculate the confusion matrix
cm <- confusionMatrix(as.factor(test_preds_binary), as.factor(df_test[,"Exited"]))

# Calculate precision, recall, and F1 score
precision <- cm$table[2, 2] / (cm$table[2, 2] + cm$table[2, 1])
recall <- cm$table[2, 2] / (cm$table[2, 2] + cm$table[1, 2])
f1_score <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")
cat("Recall:", recall, "\n")
cat("F1 Score:", f1_score, "\n")

# Calculate ROC AUC
roc_obj <- roc(df_test[,"Exited"], test_preds)
roc_auc <- auc(roc_obj)
cat("ROC AUC:", roc_auc, "\n")

## Save the Model
We save the trained logistic regression model as an RDS file.

In [None]:
# save the model
saveRDS(lg_reg_model, "r_log_reg_churn_model.rds")