# Introduction

This is an exploration notebook for the credit card customer churn problem.

## Table of Contents:

* [Data Dictionary](#section-one)
* [Pakages](#section-two)
* [Reading and Splitting Data](#section-three)
* [Initial Analysis](#section-four)
    - [Target](#section-four-one)
    - [Feature Type](#section-four-two)
    - [Correlation](#section-four-three)
    - [Feature EDA](#section-four-four)
* [Feature Engineering](#section-five)
    - [Outlier treatment](#section-five-one)
* [Pre Processing](#section-six)
    - [One Hot Encoder](#section-six-one)
    - [Data Scaling](#section-six-two)
    - [Class Balance](#section-six-three)
* [Comparing Algorithms](#section-seven)
    - [Creating Models](#section-seven-one)
    - [Is XGBoost Overfitting?](#section-seven-two)
    - [Measuring XGBoost Performance on Test Set](#section-seven-three)
    - [Feature Selection](#section-seven-four)
* [Model Evaluation](#section-eight)
    - [Final Model](#section-eight-one)
    - [Model Interpretability - SHAP](#section-eight-two)

<a id="section-one"></a>
# Data Dictionary:

- **CLIENTNUM**: Client number. Unique identifier for the customer holding the account.
- **Attrition_Flag**: Internal event (customer activity) variable - if the account is closed then 1 else 0.
- **Customer_Age**: Demographic variable - Customer's Age in Years
- **Gender**: Demographic variable - M=Male, F=Female
- **Dependent_count**: Demographic variable - Number of dependents
- **Education_Level**: Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)
- **Marital_Status**: Demographic variable - Married, Single, Divorced, Unknown
- **Income_Category**: Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)
- **Card_Category**: Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
- **Months_on_book**: Period of relationship with bank
- **Total_Relationship_Count**: Total no. of products held by the customer
- **Months_Inactive_12_mon**: No. of months inactive in the last 12 months
- **Contacts_Count_12_mon**: No. of Contacts in the last 12 months
- **Credit_Limit**: Credit Limit on the Credit Card
- **Total_Revolving_Bal**: Total Revolving Balance on the Credit Card
- **Avg_Open_To_Buy**: Open to Buy Credit Line (Average of last 12 months)
- **Total_Amt_Chng_Q4_Q1**: Change in Transaction Amount (Q4 over Q1)
- **Total_Trans_Amt**: Total Transaction Amount (Last 12 months)
- **Total_Trans_Ct**: Total Transaction Count (Last 12 months)
- **Total_Ct_Chng_Q4_Q1**: Change in Transaction Count (Q4 over Q1)
- **Avg_Utilization_Ratio**: Average Card Utilization Ratio

<a id="section-two"></a>
# Packages

In [None]:
# Basic packages
import pandas as pd
import numpy as np

# Graphs
import matplotlib.pyplot as plt

# Feature Engineering
from scipy.stats import boxcox

# Class balance
from imblearn.over_sampling import SMOTENC

# Sklearn
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, roc_curve

# ML Algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

# Model evaluation
import shap

import warnings
warnings.filterwarnings("ignore")

<a id="section-three"></a>

# Reading and Splitting Data

In [None]:
df = pd.read_csv("../input/credit-card-customers/BankChurners.csv")

# Remove last two columns
df = df.iloc[:,:21]

In [None]:
# Check for null values
df.isna().sum().any()

In [None]:
# Make target binary
df["Attrition_Flag"] = df["Attrition_Flag"].apply(lambda x: 0 if x == 'Existing Customer' else 1)

In [None]:
X = df.loc[:,df.columns != "Attrition_Flag"]

y = df[["Attrition_Flag"]]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 20)

In [None]:
# Remove client id column from train set
X_train = X_train.iloc[:,1:]

In [None]:
X_train.shape

<a id="section-four"></a>
# Initial Analysis

<a id="section-four-one"></a>
## Target

There is a class imbalance in the dataset

In [None]:
y_train.iloc[:,0].value_counts().plot(kind = "bar", rot = 0, title = "Class distribution - Target")

plt.show()

<a id="section-four-two"></a>
## Feature type

In [None]:
num_vars = ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
            'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
            'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']

cat_vars = ['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']

<a id="section-four-three"></a>
## Correlation

There seems to be a high correlation between "Avg_Open_To_Buy" and "Credit_Limit" features

In [None]:
corr = X_train.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
# Remove one of them, decided to remove Credit_Limit at first
X_train = X_train.drop(columns = "Credit_Limit")

num_vars.remove("Credit_Limit")

<a id="section-four-four"></a>
## Feature EDA

### Categorical Features

We can see a high churn ratio amongst Platinum card users

In [None]:
for var in cat_vars:
    # Join training objects into a single dataframe
    feat = pd.DataFrame({"Feature": X_train[var].values, "target": y_train["Attrition_Flag"].values})
    # Aggregate count and sum to calculate ratio
    feat = feat.groupby(by = "Feature").agg(count = pd.NamedAgg(column = "target", aggfunc = "count"),
                                            sum = pd.NamedAgg(column = "target", aggfunc = "sum"))
    feat['churn_ratio'] = feat['sum']/feat['count']
    
    # Create subplot
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize = [10, 4.5])
    fig.suptitle(var)
    X_train[var].value_counts().plot(kind = 'bar', ax = ax1, title = "Bar plot")
    feat['churn_ratio'].plot(kind = 'bar', ax = ax2, title = "Churn Ratio")
    
    plt.show()

### Numerical Features

For numerical features, we can see some interesting things.
- As expected, clients who holds more products, tend to be more loyal than those with less products. (Total_Relationship_Count)
- Clients who less than two months, also tend to churn less. However, if the clients stays inactive for 3 or 4 months, chances are higher that they will cancel their products. (Months_Inactive_12_mon)
- Also, the more the client has to get in touch with the bank, the more insatified they are. (Contacts_Count_12_mon)
... and so on

As for outliers, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt and Total_Ct_Chng_Q4_Q1 features must be treated more closel

In [None]:
for var in num_vars:
    # Join training objects into a single dataframe
    feat = pd.DataFrame({"Feature": X_train[var].values, "target": y_train["Attrition_Flag"].values})
    
    # Create subplot
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = [14, 4.5])
    fig.suptitle(var)
    
    X_train[var].plot(kind = 'hist', ax = ax1, title = "Bar plot")
    
    X_train[var].plot(kind = 'box', ax = ax2, title = "Box plot")
    
    feat["Feature"].plot(kind = 'kde', ax = ax3, title = "Churn ratio", label = "Total Distribuition")
    feat.loc[feat["target"] == 1, "Feature"].plot(kind = 'kde', ax = ax3, label = "Churned")
    ax3.legend(loc = 2)
    
    plt.show()

### Bi-variate analysis

We could plot all features against each other and see how they behave. For simplicity, all plot an example of categorical vs categorical, categorical vs numerical and numerical vs numerical

#### Education_Level vs Income_Category

In [None]:
bivariate_cat = X_train[['Education_Level', 'Income_Category', 'Gender']].groupby(by = ['Education_Level', 'Income_Category']).agg(count = pd.NamedAgg(column = 'Gender', aggfunc = 'count'))

bivariate_cat = bivariate_cat.reset_index().pivot(index = 'Education_Level', columns = 'Income_Category', values = 'count')

# Sum row-wise
bivariate_cat['Total'] = bivariate_cat.sum(axis = 1)

# Calculate % by class, except Total
for column in bivariate_cat.columns:
    if column != 'Total':
        bivariate_cat[column] = 100*(bivariate_cat[column]/bivariate_cat['Total'])

# Plot without Total column
bivariate_cat.drop(columns = 'Total').plot.bar(stacked = True)

plt.show()

#### Gender vs Avg_Open_To_Buy

In [None]:
ax = X_train[['Gender', 'Avg_Open_To_Buy']].pivot(columns = 'Gender').boxplot()

# Change x-axis labels
ax.set_xticklabels(['F', 'M'])

plt.show()

#### Total_Trans_Ct vs Avg_Utilization_Ratio

In [None]:
fig, ax = plt.subplots()

X_train[['Total_Trans_Amt', 'Total_Trans_Ct']].join(y_train).plot.scatter(x = 'Total_Trans_Ct', 
                                                                          y = 'Total_Trans_Amt', 
                                                                          c = 'Attrition_Flag',
                                                                          colormap = 'viridis',
                                                                          ax = ax)

plt.show()

<a id="section-five"></a>
# Feature Engineering

<a id="section-five-one"></a>
## Outlier treatment

The graphs beneath show two different kinds of treatments that could be applied to features in order to adjust outlier values. 

For a better understanding on how boxcox works: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html

For these transformations, if the minimun value of a feature is 0, then we have to add 1 to it (otherwise, log(0) = inf)

### Avg_Open_To_Buy

In [None]:
print("The minimun value of the feature is:", X_train['Avg_Open_To_Buy'].min())

In [None]:
fig, axs = plt.subplots(2, 3, figsize = [14, 9])

X_train['Avg_Open_To_Buy'].plot(kind = 'box', ax = axs[0, 0], title = 'As is boxplot')

X_train['Avg_Open_To_Buy'].apply(np.log).plot(kind = 'box', ax = axs[0, 1], title = 'Log transformation boxplot')

values, lmbda = boxcox(X_train['Avg_Open_To_Buy'])

pd.DataFrame({'Avg_Open_To_Buy': values}).plot(kind = 'box', ax = axs[0, 2], title = 'Boxcox transformation boxplot')

X_train['Avg_Open_To_Buy'].plot(kind = 'hist', ax = axs[1, 0], title = 'As is histogram')

X_train['Avg_Open_To_Buy'].apply(np.log).plot(kind = 'hist', ax = axs[1, 1], title = 'Log transformation histogram')

pd.DataFrame({'Avg_Open_To_Buy': values}).plot(kind = 'hist', ax = axs[1, 2], title = 'Boxcox transformation histogram')

plt.show()

### Total_Trans_Amt

In [None]:
print("The minimun value of the feature is:", X_train['Total_Trans_Amt'].min())

In [None]:
fig, axs = plt.subplots(2, 3, figsize = [14, 9])

X_train['Total_Trans_Amt'].plot(kind = 'box', ax = axs[0, 0], title = 'As is boxplot')

X_train['Total_Trans_Amt'].apply(np.log).plot(kind = 'box', ax = axs[0, 1], title = 'Log transformation boxplot')

values, lmbda = boxcox(X_train['Total_Trans_Amt'])

pd.DataFrame({'Total_Trans_Amt': values}).plot(kind = 'box', ax = axs[0, 2], title = 'Boxcox transformation boxplot')

X_train['Total_Trans_Amt'].plot(kind = 'hist', ax = axs[1, 0], title = 'As is histogram')

X_train['Total_Trans_Amt'].apply(np.log).plot(kind = 'hist', ax = axs[1, 1], title = 'Log transformation histogram')

pd.DataFrame({'Total_Trans_Amt': values}).plot(kind = 'hist', ax = axs[1, 2], title = 'Boxcox transformation histogram')

plt.show()

### Total_Amt_Chng_Q4_Q1

In [None]:
print("The minimun value of the feature is:", X_train['Total_Amt_Chng_Q4_Q1'].min())

In [None]:
fig, axs = plt.subplots(2, 3, figsize = [14, 9])

X_train['Total_Amt_Chng_Q4_Q1'].plot(kind = 'box', ax = axs[0, 0], title = 'As is boxplot')

X_train['Total_Amt_Chng_Q4_Q1'].apply(lambda x: np.log(x + 1)).plot(kind = 'box', ax = axs[0, 1], title = 'Log transformation boxplot')

values, lmbda = boxcox(X_train['Total_Amt_Chng_Q4_Q1'].apply(lambda x: x + 1))

pd.DataFrame({'Total_Amt_Chng_Q4_Q1': values}).plot(kind = 'box', ax = axs[0, 2], title = 'Boxcox transformation boxplot')

X_train['Total_Amt_Chng_Q4_Q1'].plot(kind = 'hist', ax = axs[1, 0], title = 'As is histogram')

X_train['Total_Amt_Chng_Q4_Q1'].apply(lambda x: np.log(x + 1)).plot(kind = 'hist', ax = axs[1, 1], title = 'Log transformation histogram')

pd.DataFrame({'Total_Amt_Chng_Q4_Q1': values}).plot(kind = 'hist', ax = axs[1, 2], title = 'Boxcox transformation histogram')

plt.show()

### Total_Ct_Chng_Q4_Q1

In [None]:
print("The minimun value of the feature is:", X_train['Total_Ct_Chng_Q4_Q1'].min())

In [None]:
fig, axs = plt.subplots(2, 3, figsize = [14, 9])

X_train['Total_Ct_Chng_Q4_Q1'].plot(kind = 'box', ax = axs[0, 0], title = 'As is boxplot')

X_train['Total_Ct_Chng_Q4_Q1'].apply(lambda x: np.log(x + 1)).plot(kind = 'box', ax = axs[0, 1], title = 'Log transformation boxplot')

values, lmbda = boxcox(X_train['Total_Ct_Chng_Q4_Q1'].apply(lambda x: x + 1))

pd.DataFrame({'Total_Ct_Chng_Q4_Q1': values}).plot(kind = 'box', ax = axs[0, 2], title = 'Boxcox transformation boxplot')

X_train['Total_Ct_Chng_Q4_Q1'].plot(kind = 'hist', ax = axs[1, 0], title = 'As is histogram')

X_train['Total_Ct_Chng_Q4_Q1'].apply(lambda x: np.log(x + 1)).plot(kind = 'hist', ax = axs[1, 1], title = 'Log transformation histogram')

pd.DataFrame({'Total_Ct_Chng_Q4_Q1': values}).plot(kind = 'hist', ax = axs[1, 2], title = 'Boxcox transformation histogram')

plt.show()

We can see that even boxcox could not handle all the outliers. However, in most of them the distribution became much closer to a normal distribution. Therefore, let's use it.

In [None]:
X_train_box = X_train.copy()

# Generate lambda values for features
outlier_vars = ['Avg_Open_To_Buy', 'Total_Trans_Amt', 'Total_Amt_Chng_Q4_Q1', 'Total_Ct_Chng_Q4_Q1']

lmbdas_treatment = {}

for var in outlier_vars:
    if (X_train_box[var].min() > 0):
        values, lmbda = boxcox(X_train_box[var])
        X_train_box[var] = values
        lmbdas_treatment[var] = lmbda
    else:
        values, lmbda = boxcox(X_train_box[var].apply(lambda x: x + 1))
        X_train_box[var] = values
        lmbdas_treatment[var] = lmbda

<a id="section-six"></a>
# Pre Processing

<a id="section-six-one"></a>
## One Hot Encoding Categorical Features

In [None]:
# Create encoder and fit encoder
encoder = OneHotEncoder(drop = 'first', sparse = False)

encoder = encoder.fit(X_train[cat_vars])

# Create new training set
X_train_new = pd.DataFrame(encoder.transform(X_train[cat_vars]), columns = encoder.get_feature_names(cat_vars))

X_train_new = X_train_new.join(X_train[num_vars].reset_index(drop = True))

# Create new training set with boxcox
X_train_box_new = pd.DataFrame(encoder.transform(X_train_box[cat_vars]), columns = encoder.get_feature_names(cat_vars))

X_train_box_new = X_train_box_new.join(X_train_box[num_vars].reset_index(drop = True))

<a id="section-six-two"></a>
## Data scaling

We have numerical features in all sorts of scales. For distance based algorithms, we need them to be in a compatible scale, so we do not assign a higher weight to a feature just because of its values.

We have two most used option for it: Normalization and Standardization

Normalization: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Standardization: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

### Normalization

In [None]:
normal_scaler = MinMaxScaler()

normal_scaler = normal_scaler.fit(X_train_new[num_vars])

X_train_normal_new = X_train_new.copy()

X_train_normal_new[num_vars] = normal_scaler.transform(X_train_new[num_vars])

In [None]:
normal_box_scaler = MinMaxScaler()

normal_box_scaler = normal_box_scaler.fit(X_train_box_new[num_vars])

X_train_box_normal_new = X_train_box_new.copy()

X_train_box_normal_new[num_vars] = normal_scaler.transform(X_train_box_new[num_vars])

### Standardization

In [None]:
standard_scaler = StandardScaler()

standard_scaler = standard_scaler.fit(X_train_new[num_vars])

X_train_standard_new = X_train_new.copy()

X_train_standard_new[num_vars] = normal_scaler.transform(X_train_new[num_vars])

In [None]:
standard_box_scaler = StandardScaler()

standard_scaler = standard_scaler.fit(X_train_box_new[num_vars])

X_train_box_standard_new = X_train_box_new.copy()

X_train_box_standard_new[num_vars] = normal_scaler.transform(X_train_box_new[num_vars])

<a id="section-six-three"></a>
## Balance classes

There are quite a few ways of handling imbalanced datasets
- We can just leave as it is, in such case we should choose a scoring metric accordingly (accuracy would just tell you to choose majority class)
- We can use stratified cross validation, in order to adjust variance during training step
- Undersample majority class
- Oversample minority class
- Generate synthetic data, etc.

Each method has its ups and downs. I'll apply here synthetic data generation to balance class and try to keep it as representative to reality as possible. However, I AM chaging the way the algorithm calculates probabilities, therefore choose carefully whatever approach and plan accordingly.

In [None]:
seed = 100

# Categorical features
new_cat_vars = encoder.get_feature_names(cat_vars)

cat_vars_index = []

for var in encoder.get_feature_names(cat_vars):
    cat_vars_index.append(X_train_new.columns.get_loc(var))

# Create SMOTE object to balance minority class
oversample = SMOTENC(categorical_features = cat_vars_index, random_state = seed)

# Apply SMOTE to training set without any FE
X_train_balanced, y_train_balanced = oversample.fit_resample(X_train_new, y_train)

# Apply SMOTE to training set with normalization
X_train_normal_balanced, y_train_normal_balanced = oversample.fit_resample(X_train_normal_new, y_train)

# Apply SMOTE to training set with standardization
X_train_std_balanced, y_train_std_balanced = oversample.fit_resample(X_train_standard_new, y_train)

# Apply SMOTE to training set with boxcox
X_train_box_balanced, y_train_box_balanced = oversample.fit_resample(X_train_box_new, y_train)

# Apply SMOTE to training set with boxcox and normalization
X_train_box_normal_balanced, y_train_box_normal_balanced = oversample.fit_resample(X_train_box_normal_new, y_train)

# Apply SMOTE to training set with boxcox and standardization
X_train_box_std_balanced, y_train_box_std_balanced = oversample.fit_resample(X_train_box_standard_new, y_train)

In [None]:
y_train_balanced.iloc[:,0].value_counts().plot(kind = 'bar', rot = 0, title = "Balanced target")

plt.show()

<a id="section-seven"></a>
# Comparing performances

<a id="section-seven-one"></a>
## Create models

I'll set max_depth to 4, due to training dataset size. By doing so, we at least try to avoid overfitting the model. Changing learning rate, gamma (for XGBoost), among other parameters could also help avoiding it.

In [None]:
# Definindo os valores para o n√∫mero de folds
num_folds = 10
scoring = 'roc_auc'
seed = 7

# Preparando a lista de modelos
models = []
models.append(('LR w/o FE', LogisticRegression()))
models.append(('RF w/o FE', RandomForestClassifier(max_depth = 4, criterion = 'entropy')))
models.append(('XGB w/o FE', XGBClassifier(max_depth = 4, verbosity = 0)))
models.append(('LR normal', LogisticRegression()))
models.append(('RF normal', RandomForestClassifier(max_depth = 4, criterion = 'entropy')))
models.append(('XGB normal', XGBClassifier(max_depth = 4, verbosity = 0)))
models.append(('LR std', LogisticRegression()))
models.append(('RF std', RandomForestClassifier(max_depth = 4, criterion = 'entropy')))
models.append(('XGB std', XGBClassifier(max_depth = 4, verbosity = 0)))
models.append(('LR boxcox', LogisticRegression()))
models.append(('RF boxcox', RandomForestClassifier(max_depth = 4, criterion = 'entropy')))
models.append(('XGB boxcox', XGBClassifier(max_depth = 4, verbosity = 0)))
models.append(('LR boxcox normal', LogisticRegression()))
models.append(('RF boxcox normal', RandomForestClassifier(max_depth = 4, criterion = 'entropy')))
models.append(('XGB boxcox normal', XGBClassifier(max_depth = 4, verbosity = 0)))
models.append(('LR boxcox standard', LogisticRegression()))
models.append(('RF boxcox standard', RandomForestClassifier(max_depth = 4, criterion = 'entropy')))
models.append(('XGB boxcox standard', XGBClassifier(max_depth = 4, verbosity = 0)))

# Avaliando cada modelo em um loop
results = []
names = []

for name, model in models:
    # Create KFold validation
    kfold = KFold(n_splits = num_folds, random_state = seed, shuffle = True)
    # Conditions for each treatment
    if ("w/o FE" in name):
        cv_results = cross_val_score(model, 
                                     X_train_balanced, 
                                     y_train_balanced, 
                                     cv = kfold, 
                                     scoring = scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
    elif (("boxcox" not in name) & ("normal" in name)):
        cv_results = cross_val_score(model, 
                                     X_train_normal_balanced, 
                                     y_train_normal_balanced, 
                                     cv = kfold, 
                                     scoring = scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
    elif (("boxcox" not in name) & ("std" in name)):
        cv_results = cross_val_score(model, 
                                     X_train_std_balanced, 
                                     y_train_std_balanced, 
                                     cv = kfold, 
                                     scoring = scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
    elif (("boxcox" in name) & ("normal" in name)):
        cv_results = cross_val_score(model, 
                                     X_train_box_normal_balanced, 
                                     y_train_box_normal_balanced, 
                                     cv = kfold, 
                                     scoring = scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
    elif (("boxcox" in name) & ("standard" in name)):
        cv_results = cross_val_score(model, 
                                     X_train_box_std_balanced, 
                                     y_train_box_std_balanced, 
                                     cv = kfold, 
                                     scoring = scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
    else: 
        cv_results = cross_val_score(model, 
                                     X_train_box_balanced, 
                                     y_train_box_balanced, 
                                     cv = kfold, 
                                     scoring = scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)

# Boxplot to compare algorithms
fig = plt.figure(figsize = [14, 9])
fig.suptitle('Comparing Classification Algorithms')
ax = fig.add_subplot(111)
plt.boxplot(results)
plt.xticks(rotation = 45)
ax.set_xticklabels(names)
plt.show()

<a id="section-seven-two"></a>
## Is XGB classifier overfitting?

As the XGB performance were all pretty close, for simplicity of treatment (and deployment later on) I'll use the model with no FE

In [None]:
# Transform train set
X_test_encoded = encoder.transform(X_test[cat_vars])

# Concatenate one hot encoded categorical features with numerical
X_test_new = pd.DataFrame(X_test_encoded, columns = encoder.get_feature_names(cat_vars))

X_test_new = X_test_new.join(X_test[num_vars].reset_index(drop = True))

In [None]:
# Create classifier
model = XGBClassifier(max_depth = 4)

# Create eval set with both training and test set
eval_set = [(X_train_balanced, y_train_balanced), (X_test_new, y_test)]

# Fit model 
model.fit(X_train_balanced, y_train_balanced, eval_metric=["auc", "logloss"], eval_set=eval_set, verbose=False)

# Make predictions for test data
y_pred = model.predict(X_test_new)

# Evaluate predictions
auc = roc_auc_score(y_test, y_pred)
print("AUC: %.2f%%" % (auc * 100.0))

# Retrieve performance metrics
results = model.evals_result()
epochs = len(results['validation_0']['auc'])
x_axis = range(0, epochs)

# Plot log loss
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
plt.ylabel('Log Loss')
plt.title('XGBoost Log Loss')
plt.show()

# Plot AUC
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['auc'], label='Train')
ax.plot(x_axis, results['validation_1']['auc'], label='Test')
ax.legend()
plt.ylabel('AUC')
plt.title('XGBoost AUC')
plt.show()

As we can see, log loss and auc gain on both training and test set stay pretty close over training rounds, so there is no evidence of overfitting. On the other hand, as stated previously, this is a small dataset and predictions may get a little off over time.

<a id="section-seven-three"></a>
## Measuring XGBoost Performance on Test set

In [None]:
# Create classifier
xgb = XGBClassifier(max_depth = 4, verbosity = 0)

# Fit model
xgb = xgb.fit(X_train_balanced, y_train_balanced)

# Make Predictions
y_pred = xgb.predict_proba(X_test_new)[:,1]

In [None]:
# Create ROC curve variables
false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, y_pred)

# Print model AUC
print("AUC: %.2f%%" % (roc_auc_score(y_test, y_pred) * 100.0))

# Print best threshold
optimal_idx = np.argmax(true_positive_rate - false_positive_rate)
optimal_threshold = threshold[optimal_idx]
print("Best threshold value is:", optimal_threshold)

# Ploting ROC curves
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Transform predictions to round number
y_pred_round = [1 if pred > optimal_threshold else 0 for pred in y_pred]

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred_round)

cmd = ConfusionMatrixDisplay(cm, display_labels=['not churn','churn'])
cmd.plot()
plt.show()

<a id="section-seven-four"></a>
## Feature Selection

We could lower model complexity by using only the most suitable features. This can be achieved with lots of techniques.

I'll use mutual information gain as score function. Mutual information gain is a non parametric test, which means it does not make any assumptions on the distribution  of data. Chi-square is another option here, however it does assume the features are normally distributed.

In [None]:
# Mutual info score
info_score = mutual_info_classif(X_train_balanced, y_train_balanced, n_neighbors = 3, random_state = seed)

mutual_info = pd.DataFrame({'Feature': X_train_balanced.columns.values, 'Mutual Info Gain': info_score})

fig, ax = plt.subplots(figsize = [10,6])
mutual_info.sort_values(by = 'Mutual Info Gain').plot.barh(x = 'Feature', y = 'Mutual Info Gain', ax = ax, title = "Mutual Information Gain by Feature")
plt.show()

In [None]:
# Create KBest processor
best_processor = SelectKBest(score_func = mutual_info_classif, k = 12).fit(X_train_balanced, y_train_balanced)

# Transform training set
columns = X_train_balanced.columns.values[best_processor.get_support(indices=True)]

X_train_balanced_fs = pd.DataFrame(best_processor.transform(X_train_balanced), 
                                   columns = columns)

In [None]:
# Transform test set
X_test_new_fs = X_test_new[columns]

### Train new model

In [None]:
# Create classifier
xgb = XGBClassifier(max_depth = 4, verbosity = 0)

# Tuple of sets
#eval_set = [(X_train_balanced, y_train_balanced), (X_test_new, y_test)]

# Fit model
xgb = xgb.fit(X_train_balanced_fs, y_train_balanced)

# Make Predictions
y_pred = xgb.predict_proba(X_test_new_fs)[:,1]

In [None]:
# Create ROC curve variables
false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, y_pred)

# Print model AUC
print("AUC: %.2f%%" % (roc_auc_score(y_test, y_pred) * 100.0))

# Print best threshold
optimal_idx = np.argmax(true_positive_rate - false_positive_rate)
optimal_threshold = threshold[optimal_idx]
print("Best threshold value is:", optimal_threshold)

# Ploting ROC curves
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Not bad!

In [None]:
# Transform predictions to round number
y_pred_round = [1 if pred > optimal_threshold else 0 for pred in y_pred]

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred_round)

cmd = ConfusionMatrixDisplay(cm, display_labels=['not churn','churn'])
cmd.plot()

<a id="section-eight"></a>
# Model evaluation

There should be a step of hyperparameters tuning before evaluating the final model, but the performance is so high alreay that I don't think it is necessary in this case.

So, let's jump to understanding why the classifier gave the results it did. To do so, SHAP package is one of the best.

To better understand SHAP: https://github.com/slundberg/shap and https://shap.readthedocs.io/en/latest/index.html

<a id="section-eight-one"></a>
## Final model

Using model with all features to see how changes in each of them impact the outcome

In [None]:
# Create classifier
xgb_final = XGBClassifier(max_depth = 4, verbosity = 0)

# Fit model
xgb_final = xgb_final.fit(X_train_balanced, y_train_balanced)

# Make Predictions
y_pred = xgb_final.predict_proba(X_test_new)[:,1]

<a id="section-eight-two"></a>
## How feature values impact classification?

In [None]:
# Create explainer object
explainer = shap.TreeExplainer(xgb_final)
shap_values = explainer.shap_values(X_train_balanced)

print("Business person: Why is this person being rated as such? \nData scientist: Historically, people who buy and spend approximately 1.35 times more on Q4 compared to Q1 are less prone to churning")

# Visualize prediction
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train_balanced.iloc[0,:], matplotlib=True)

In [None]:
print("Business person: Cool! Ok, so... what else affects the output in general? \nData scientist: Total transaction count greatly impacts the outcome. In general, people who make many purchases tend to churn less. On the other hand, people who spend a lot are proner to churning. An explanation to that could be that people with high average ticket sales may cancel their credit card to cut some expenses!")

# Plot 
shap.summary_plot(shap_values, X_train_balanced)

Taking a look at how the combination of values of two features could affect the logs odd of the outcome can be really insightful as well!

In [None]:
for var in X_train_balanced.columns:
    shap.dependence_plot(var, shap_values, X_train_balanced)