# LOGISTIC REGRESSION WITH FLIGHT DELAY DATASET

Adapted from Shmueli (2017) Chapter 10

Dataset is about acceptance of a personal loan by Universal Bank. The bank’s dataset includes data on 5000 customers.

The data include the customer’s response to the last personal loan campaign (Personal Loan), as well as customer demographic information (Age, Income, etc.) and the customer’s relationship with the bank (mortgage, securities account, etc.).

Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan offered to them in a previous campaign.

The goal is to build a model that identifies customers who are most likely to accept the loan offer in future mailings.

## Libraries and dataset

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(plotly) # for interactive visualizations
library(pROC) # for ROC curve
library(plotROC) # for pretty plot ROC curve
library(IRdisplay) # for displaying interactive ROC curves
library(lift) # for lift chart
library(caret) # for lift chart and confusion matrix
library(gains) # for lift chart

In [None]:
bank_dt <- fread("../data/csv/05_03_universalbank.csv", stringsAsFactors = T)

## Explore data

Delete unnecessary variables:

In [None]:
bank_dt[, c("ID", "ZIP Code") := NULL] #to drop ID and ZIP code columns

Convert education to factor:

In [None]:
bank_dt[, Education := factor(Education,
                              levels = 1:3,
                              labels = c("Undergrad",
                                        "Graduate",
                                        "Advanced/Professional"))]

Get the structure of the data:

In [None]:
str(bank_dt)

"Personal Loan" will be our dependent variable 

View data:

In [None]:
bank_dt

Summarize numeric variables:

In [None]:
bank_dt %>% purrr::keep(is.numeric) %>% sapply(quantile) %>% t()

In [None]:
bank_dt[,Age:Mortgage][,!"Education"]

We see that the last five variables are actually categoric ones that take values of 0 or 1 only

But we will keep them as is, since they will be treated as dummy variables in the model

In [None]:
bank_dt[,Age:Mortgage][,!"Education"] %>% # select columns
    tidyr::gather() %>% # reshape into long format in columns "key" and "value"
    ggplot(aes(value)) + # plot value
        facet_wrap(~ key, scale = "free" ) + # divide into separate plots by key
        geom_density(fill = "green")  # get density plots

Summarize factor variables:

In [None]:
bank_factors <- bank_dt[,Education:CreditCard][,!"Mortgage"] %>% # select columns

    tidyr::gather() %>% # convert into long format for faceting
    ggplot(aes(x = value)) + # plot value
    facet_wrap(~ key, scales = "free") + # divide into separate plots by key
    geom_bar()

plotly::ggplotly(bank_factors)

## Partition the dataset

We will take 60% as train and 40% as test set: 

In [None]:
#partition data
set.seed(2)
train <- bank_dt[,sample(.I, .N * 0.6)]

In [None]:
bank_train <- bank_dt[train]
bank_test <- bank_dt[-train]

## Build and train a model

Build a logistic regression model: 

In [None]:
logit_reg <- glm(`Personal Loan` ~ ., data = bank_train, family = "binomial")

Summarize the model:

In [None]:
summary(logit_reg)

All variables except Age, Experience and Mortgage are significant at 5%

Residual deviance is highly below null deviance, hence the model could explain o good portion of the total variation

## Evaluate classification performance

Get the fitted values for train set:

In [None]:
pred_train <- predict(logit_reg, bank_train[,!"Personal Loan"], type = "response")

Convert them to discrete values for labels:

In [None]:
train_class <- ifelse(pred_train > 0.5, 1, 0)

### A detailed look at the confusion matrix:

Create a confusion matrix:

In [None]:
table(bank_train[,`Personal Loan`], train_class) %>% caret::confusionMatrix()

Now let's go into the details of a confusion matrix:

![confusion matrix](https://3.bp.blogspot.com/--jLXutUe5Ss/VvPIO6ZH2tI/AAAAAAAACkU/pvVL4L-a70gnFEURcfBbL_R-GnhBR6f1Q/s1600/ConfusionMatrix.png)

According to Lantz (2015) Chapter 10:

- True Positive (TP): Correctly classified as the class of interest
- True Negative (TN): Correctly classified as not the class of interest
- False Positive (FP): Incorrectly classified as the class of interest
- False Negative (FN): Incorrectly classified as not the class of interest



- So accuracy is the sum of true negatives/positives over all cases
- Error rate is "1 - accuracy"
- The sensitivity of a model (also called the true positive rate) measures the proportion of positive examples that were correctly classified. Therefore, it is calculated as the number of true positives divided by the total number of positives, both correctly classified (the true positives) as well as incorrectly classified (the false negatives)
- The specificity of a model (also called the true negative rate) measures the proportion of negative examples that were correctly classified. As with sensitivity, this is computed as the number of true negatives, divided by the total number of negatives—the true negatives plus the false positives
- The precision (also known as the positive predictive value) is defined as the proportion of positive examples that are truly positive; in other words, when a model predicts the positive class, how often is it correct? A precise model will only predict the positive class in cases that are very likely to be positive. It will be very trustworthy.
- On the other hand, recall is a measure of how complete the results are. This is defined as the number of true positives over the total number of positives. You may have already recognized this as the same as sensitivity. However, in this case, the interpretation differs slightly. A model with a high recall captures a large portion of the positive examples, meaning that it has wide breadth. For example, a search engine with a high recall returns a large number of documents pertinent to the search query. Similarly, the SMS spam filter has a high recall if the majority of spam messages are correctly identified.

Now another important metric is the kappa statistic:

The kappa statistic (labeled Kappa in the previous output) adjusts accuracy by accounting for the possibility of a correct prediction by chance alone.

This is especially important for datasets with a severe class imbalance, because a classifier can obtain high accuracy simply by always guessing the most frequent class.

The kappa statistic will only reward the classifier if it is correct more often than this simplistic strategy.

Kappa values range from 0 to a maximum of 1, which indicates perfect agreement between the model's predictions and the true values. Values less than one indicate imperfect agreement. Depending on how a model is to be used, the interpretation of the kappa statistic might vary. One common interpretation is shown as follows:
- Poor agreement = less than 0.20
- Fair agreement = 0.20 to 0.40
- Moderate agreement = 0.40 to 0.60
- Good agreement = 0.60 to 0.80
- Very good agreement = 0.80 to 1.00

### ROC curve

Now let's draw the ROC curve:

In [None]:
p1 <- data.table(D = bank_train[,`Personal Loan`], M = pred_train) %>%
ggplot(aes(m = M, d = D)) +
    plotROC::geom_roc() +
    plotROC::style_roc(theme = theme_grey)

plotROC::export_interactive_roc(p1) %>% IRdisplay::display_html()

Curve is quite close the a perfect classifier

And let's calculate the area under curve (AUC)

In [None]:
pROC::auc(bank_train[,`Personal Loan`], pred_train)

AUC near 1 also confirms the accuracy

## Evaluate prediction performance

Let's predict the probabilities of test set:

In [None]:
pred_test <- predict(logit_reg, bank_test[,!"Personal Loan"], type = "response")

Convert probabilities to classes:

In [None]:
test_class <- ifelse(pred_test > 0.5, 1, 0)

And create a confusion matrix:

In [None]:
table(bank_test[,`Personal Loan`], test_class) %>% caret::confusionMatrix()

While kappa and accuracy on the test set is slightly lower than the values for the train set, the model still performs good

### Lift chart and decile-wise lift chart

Next we will use the lift chart using the caret and ggplot packages:

First create classes:

In [None]:
test_class2 <- as.factor(bank_test[,`Personal Loan`]) %>% forcats::fct_relevel("1")

And the lift object:

In [None]:
lift1 <- caret::lift(test_class2 ~ pred_test)
lift1

See the percent of classes:

In [None]:
bank_test[,`Personal Loan` %>% table()] %>% prop.table()

And create the lift chart:

In [None]:
lift1 %>%
ggplot(plot = "gain") %>% plotly::ggplotly()

And next draw the decile-wise lift chart:

In [None]:
lift::plotLift(pred_test, bank_test[,`Personal Loan` %>% as.numeric()],
               n.buckets = 10,
               cumulative = F)

In [None]:
lift::TopDecileLift(pred_test, bank_test[,`Personal Loan` %>% as.numeric()])

When 10% of all cases are tested, 79.06% of all positive cases are found