# Table of Contents

* [Bank Customer Churn Modelling](#modelling)
  * [Understanding the dataset](#understanding)
  * [Visualization](#visualization)
  * [Correlation](#correlation)
* [Transformation](#transformation)
  * [Log Balance](#log-balance)
  * [One-Hot Encoding](#one-hot)
* [Principal Component Analysis (optional)](#pca)
* [Partitioning](#partitioning)
* [Prediction](#prediction)
  * [Logistic Regression](#logistic)
    * [scikit](#scikit)
    * [keras](#keras)
    * [log](#log)
  * [k-NN](#knn)
  * [Decision Tree Classifier](#decision)
  * [Neural Network (optional)](#nn)
    * [FCNN](#fcnn)
    * [FCNN with SMOTE](#smote)
  * [Ensemble Methods](#ensemble)
    * [Soft Vote](#soft-vote)
    * [Stacking](#stacking)
    * [AdaBoost](#adaboost)
* [Model Selection](#selection)

# Bank Customer Churn Modelling <a class="anchor" id="modelling"></a>

Predicting customer churn using this [kaggle dataset](https://www.kaggle.com/adammaus/predicting-churn-for-bank-customers)

In [None]:
# Need to do until kaggle supports seaborn 0.11.0 by default (I use histplot)
!pip install seaborn --upgrade

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

from sklearn import tree
from sklearn.decomposition import PCA
from sklearn.ensemble import VotingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score, roc_curve, auc
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, QuantileTransformer, StandardScaler
from sklearn.tree import DecisionTreeClassifier

from imblearn.over_sampling import SMOTE
import scikitplot as skplt
from mlxtend.classifier import EnsembleVoteClassifier

from numpy.random import seed
seed(12345)
tf.random.set_seed(12345)

%matplotlib inline

## Understanding the dataset <a class="anchor" id="understanding"></a>

In [None]:
bank_data = pd.read_csv("../input/predicting-churn-for-bank-customers/Churn_Modelling.csv");
display(bank_data.shape) # rows & columns

In [None]:
bank_data.head()

In [None]:
bank_data.dtypes

There are 10000 records. Both RowNumber and CustomerId are unique throughout the entire set, therefore we will remove them. Surname also should not have any information (unless we are profiling by name, which we will not).

In [None]:
display(bank_data.isnull().sum()) # display missing values

In [None]:
display(bank_data.nunique()) # display unique values

There are 10000 rows and no missing values. CustomerID and RowNumber are both unique and can be removed. Surname will also be removed.

In [None]:
bank_data.drop(columns=['RowNumber','CustomerId','Surname'], inplace=True);

## Visualization <a class="anchor" id="visualization"></a>

In [None]:
# Display all numeric values
fig, axes = plt.subplots(2, 3, figsize=(16, 8));
sns.histplot(ax=axes[0, 0], data=bank_data["CreditScore"]);
sns.histplot(ax=axes[0, 1], data=bank_data["Age"]);
sns.histplot(ax=axes[0, 2], data=bank_data["Tenure"]);
sns.histplot(ax=axes[1, 0], data=bank_data["Balance"]);
sns.histplot(ax=axes[1, 1], data=bank_data["NumOfProducts"]);
sns.histplot(ax=axes[1, 2], data=bank_data["EstimatedSalary"]);

In [None]:
# Plot all numerical variables versus each other
bd_numeric = bank_data[["CreditScore", "Age", "Tenure", "Balance", "NumOfProducts", "EstimatedSalary", "Exited"]];
sns.pairplot(bd_numeric, hue="Exited");

Observations:

* Balance is bimodal, might need to try a log transformation.
* People with more products tend to exit
* Older people tend to exit

In [None]:
# Plot the categorical variables
fig, axes = plt.subplots(2, 3, figsize=(16, 8));
sns.countplot(ax=axes[0, 0], data=bank_data, x="Geography", hue="Exited");
sns.countplot(ax=axes[0, 1], data=bank_data, x="Gender", hue="Exited");
sns.countplot(ax=axes[0, 2], data=bank_data, x="HasCrCard", hue="Exited");
sns.countplot(ax=axes[1, 0], data=bank_data, x="IsActiveMember", hue="Exited");
sns.countplot(ax=axes[1, 1], data=bank_data, x="Exited");
fig.delaxes(axes[1][2]);

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(14, 16));
sns.boxplot(ax=axes[0, 0], data=bank_data, x="Exited", y="CreditScore");
sns.boxplot(ax=axes[0, 1], data=bank_data, x="Exited", y="Age");
sns.boxplot(ax=axes[1, 0], data=bank_data, x="Exited", y="Tenure");
sns.boxplot(ax=axes[1, 1], data=bank_data, x="Exited", y="Balance");
sns.boxplot(ax=axes[2, 0], data=bank_data, x="Exited", y="NumOfProducts");
sns.boxplot(ax=axes[2, 1], data=bank_data, x="Exited", y="EstimatedSalary");

In [None]:
display(pd.crosstab(bank_data["Exited"], bank_data["Geography"], margins=True, normalize=False))
display(pd.crosstab(bank_data["Exited"], bank_data["Geography"], margins=True, normalize=True))

In [None]:
display(pd.crosstab(bank_data["Exited"], bank_data["Gender"], margins=True, normalize=False))
display(pd.crosstab(bank_data["Exited"], bank_data["Gender"], margins=True, normalize=True))

In [None]:
display(pd.crosstab(bank_data["Exited"], bank_data["HasCrCard"], margins=True, normalize=False))
display(pd.crosstab(bank_data["Exited"], bank_data["HasCrCard"], margins=True, normalize=True))

In [None]:
display(pd.crosstab(bank_data["Exited"], bank_data["IsActiveMember"], margins=True, normalize=False))
display(pd.crosstab(bank_data["Exited"], bank_data["IsActiveMember"], margins=True, normalize=True))

## Correlation <a class="anchor" id="correlation"></a>

In [None]:
plt.figure(figsize=(16,4));
sns.heatmap(bank_data.corr(), annot=True, fmt=".2f", vmin=-1.0, vmax=1, cmap="Spectral");

# Transformation <a class="anchor" id="transformation"></a>

In [None]:
bank_data_transformed = bank_data;

## Log_Balance <a class="anchor" id="log-balance"></a>
Need to change all 0 balances to 1. This will still be valid for all of the data.

In [None]:
print("         Balance < 0: ", bank_data_transformed[bank_data_transformed["Balance"].lt(0)].shape[0])
print("         Balance = 0: ", bank_data_transformed[bank_data_transformed["Balance"].eq(0)].shape[0])
print("1 <= Balance <= 1000: ", bank_data_transformed[bank_data_transformed["Balance"].between(1, 1000)].shape[0])

In [None]:
bank_data_transformed["Balance_1"] = bank_data_transformed["Balance"].replace(0, 1);
print("         Balance_1 = 0: ", bank_data_transformed[bank_data_transformed["Balance_1"].eq(0)].shape[0])
print("         Balance_1 = 1: ", bank_data_transformed[bank_data_transformed["Balance_1"].eq(1)].shape[0])

In [None]:
bank_data_transformed["Log10_Balance_1"] = np.log10(bank_data_transformed["Balance_1"]);

In [None]:
sns.histplot(data=bank_data_transformed["Log10_Balance_1"]);

## One-Hot Encoding <a class="anchor" id="one-hot"></a>

In [None]:
bank_data_transformed = pd.concat([bank_data_transformed,pd.get_dummies(bank_data_transformed['Geography'], prefix='Geography')],axis=1)
bank_data_transformed = pd.concat([bank_data_transformed,pd.get_dummies(bank_data_transformed['Gender'], prefix='Gender')],axis=1)
bank_data_transformed.head()

# Principal Component Analysis (Optional) <a class="anchor" id="pca"></a>

In [None]:
standard_scaler = StandardScaler();
scaled_columns = ["CreditScore", "Age", "Tenure", "Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary", "Geography_France", "Geography_Germany", "Geography_Spain", "Gender_Male", "Gender_Female"]
scaled_x = pd.DataFrame(standard_scaler.fit_transform(bank_data_transformed[scaled_columns]), columns=scaled_columns)
scaled_y = bank_data_transformed["Exited"]

pca = PCA()
pca.fit(scaled_x)

fig, axes = plt.subplots(1, 2, figsize=(16, 6));

skplt.decomposition.plot_pca_component_variance(clf=pca, ax=axes[0]);
skplt.decomposition.plot_pca_2d_projection(pca, scaled_x, scaled_y, ax=axes[1]);

# Partitioning (60:20:20) <a class="anchor" id="partitioning"></a>

In [None]:
random_seed=12345
bdt_train = bank_data_transformed.sample(frac=0.6, random_state=random_seed) # Train 60%
bdt_test = bank_data_transformed.drop(bdt_train.index)

bdt_validate = bdt_test.sample(frac=0.5,random_state=random_seed)   # Validate = 50% of remaining 40% ==> 20%
bdt_test = bdt_test.drop(bdt_validate.index)                        # Test = 20%

print("training set: ", len(bdt_train))
print("validation set: ", len(bdt_validate))
print("test set: ", len(bdt_test))

display(bdt_train.head());
display(bdt_validate.head());
display(bdt_test.head());

# Prediction <a class="anchor" id="prediction"></a>

In [None]:
# Predict output, confusion matrix, and ROC curve
#def create_summary(name, data, prediction, metrics, summaries):
def create_summary(name, y, y_scores, train_accuracy, val_accuracy, summaries):
    fpr, tpr, threshold = roc_curve(y.to_numpy(), y_scores[:,1])    
    optimal_idx = np.argmax(tpr - fpr) # Youden's J-statistic
    optimal_threshold = threshold[optimal_idx]

    print("[" + name + "] Optimal threshold: {:.3}".format(optimal_threshold))
    y_hat = y_scores[:,1] > optimal_threshold
    
    name = name + " (c={:.3})".format(optimal_threshold);

    cnf_matrix = confusion_matrix(y, y_hat)
    display(cnf_matrix)
    print(classification_report(y, y_hat))
    
    y_hat_0_5 = y_scores[:,1] > 0.5
    auc_val = auc(fpr, tpr)

    summary = pd.DataFrame([[name,
                             train_accuracy,
                             val_accuracy,
                             accuracy_score(y, y_hat_0_5),
                             f1_score(y, y_hat_0_5),
                             auc_val,
                             tpr,
                             fpr,
                             cnf_matrix,
                             y_scores[:,1]
                            ]], 
                           columns=summary_column_names);
    summaries = summaries.append(summary, ignore_index=True);

    fig = sns.heatmap(cnf_matrix, annot=True, fmt="d", linewidths=.5, cmap="Blues");
    fig.set_title("Confusion Matrix (Exited=1)");
    fig.set_ylabel("Actual");
    fig.set_xlabel("Predicted");

    title = name + ": " + str(round(auc_val, 3))
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 16));
    fig.suptitle(title)
    skplt.metrics.plot_roc(ax=axes[0][0], y_true=y, y_probas=y_scores)
    skplt.metrics.plot_ks_statistic(ax=axes[0][1], y_true=np.ravel(y), y_probas=y_scores)
    skplt.metrics.plot_cumulative_gain(ax=axes[1][0], y_true=y, y_probas=y_scores)
    skplt.metrics.plot_lift_curve(ax=axes[1][1], y_true=y, y_probas=y_scores)

    return summaries;

summary_column_names = ["method", "training_accuracy", "validation_accuracy", "test_accuracy", "f1_score", "auc", "tpr", "fpr", "cnf_matrix", "y_scores"]
summaries = pd.DataFrame(columns = summary_column_names)

In [None]:
# summarize results
def summarize_grid_results(grid_result):
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']
    for mean, stdev, param in zip(means, stds, params):
        print("%f (%f) with: %r" % (mean, stdev, param))

## Logistic Regression <a class="anchor" id="logistic"></a>

In [None]:
# First I am going to scale all inputs from 0 to 1 using the min-max method
# minmax = (x_i - x_min) / (x_max - x_min)
minmax_scaler = MinMaxScaler()

lr_columns_x = ["CreditScore", "Age", "Tenure", "Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary", "Geography_France", "Geography_Germany", "Gender_Male"]
lr_columns_y = "Exited"

lr_train_x = pd.DataFrame(minmax_scaler.fit_transform(bdt_train[lr_columns_x]), columns=lr_columns_x)
lr_train_y = bdt_train[lr_columns_y]

lr_val_x = pd.DataFrame(minmax_scaler.fit_transform(bdt_validate[lr_columns_x]), columns=lr_columns_x)
lr_val_y = bdt_validate[lr_columns_y]

lr_test_x = pd.DataFrame(minmax_scaler.fit_transform(bdt_test[lr_columns_x]), columns=lr_columns_x)
lr_test_y = bdt_test[lr_columns_y]

In [None]:
log_columns_x = ["CreditScore", "Age", "Tenure", "Log10_Balance_1", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary", "Geography_France", "Geography_Germany", "Gender_Male"]
log_columns_y = "Exited"

log_train_x = pd.DataFrame(minmax_scaler.fit_transform(bdt_train[log_columns_x]), columns=log_columns_x)
log_train_y = bdt_train[log_columns_y]

log_val_x = pd.DataFrame(minmax_scaler.fit_transform(bdt_validate[log_columns_x]), columns=log_columns_x)
log_val_y = bdt_validate[log_columns_y]

log_test_x = pd.DataFrame(minmax_scaler.fit_transform(bdt_test[log_columns_x]), columns=log_columns_x)
log_test_y = bdt_test[log_columns_y]

### Simple Logistic Regression (scikit-learn) <a class="anchor" id="scikit"></a>

In [None]:
lr_model = LogisticRegression(solver='liblinear', C=10.0, random_state=0)
lr_model.fit(lr_train_x, lr_train_y)
lr_y_scores = lr_model.predict_proba(lr_test_x);

In [None]:
training_accuracy = lr_model.score(lr_train_x, lr_train_y)
validation_accuracy = lr_model.score(lr_val_x, lr_val_y)

summaries = create_summary("slr-scikit",
                           lr_test_y, 
                           lr_y_scores, 
                           training_accuracy,
                           validation_accuracy,
                           summaries);

### Simple Logistic Regression (Keras) <a class="anchor" id="keras"></a>

In [None]:
# 2-class logistic regression in Keras
lrk_callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
lrk_model = Sequential()
lrk_model.add(Dense(units=1, kernel_initializer='glorot_uniform', activation='sigmoid', input_dim=lr_train_x.shape[1]))
lrk_model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['binary_accuracy'])
lrk_history = lrk_model.fit(x=lr_train_x, y=lr_train_y, 
                            batch_size=64,
                            epochs=500, validation_data=(lr_val_x, lr_val_y), 
                            callbacks=[lrk_callback],
                            verbose=0);

**Note: This model takes around 30s to run**

In [None]:
fig = sns.lineplot(data=lrk_history.history['loss']);
fig.set(ylabel="loss", xlabel = "epoch");

In [None]:
# Predict output, confusion matrix, and ROC curve
lrk_y_scores = lrk_model.predict(lr_test_x);
lrk_y_scores = np.append(1-lrk_y_scores, lrk_y_scores, axis=1)

training_accuracy = lrk_history.history["binary_accuracy"][-1]
validation_accuracy = lrk_history.history["val_binary_accuracy"][-1]

summaries = create_summary("slr-keras",
                           lr_test_y, 
                           lrk_y_scores, 
                           training_accuracy,
                           validation_accuracy,
                           summaries);

### Logistic Regression with Log_Balance and Regularization (Optional) <a class="anchor" id="log"></a>

In [None]:
log_callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

log_parameters = {
    'epochs': 500,
    'verbose': 0
}

def create_log_model(kernel_initializer='zeros',
                     activation='sigmoid', 
                     l1_lambda=0.001,
                     l2_lambda=0.0, 
                     optimizer='sgd',
                     loss='binary_crossentropy', 
                     metrics=['accuracy']):
    model = Sequential()
    model.add(Dense(units=1,
                    kernel_initializer=kernel_initializer,
                    activation=activation,
                    input_dim=log_train_x.shape[1], 
                    kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1_lambda, l2=l2_lambda),
                    bias_initializer='zeros'))
    model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

    return model;

In [None]:
# Tuning the model

# best: sgd
#log_optimizer=['rmsprop', 'sgd', 'adam']
#log_param_grid = dict(optimizer=log_optimizer)

# best: zeros
#log_kernel_initializer=['glorot_uniform', 'glorot_normal', 'random_normal', 'zeros']
#log_param_grid = dict(kernel_initializer=log_kernel_initializer)

# best: 0.001
#log_l1_lambda=[0.0, 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3]
#log_param_grid = dict(l1_lambda=log_l1_lambda)

# best: 0.0
# log_l2_lambda=[0.0, 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3]
# log_param_grid = dict(l2_lambda=log_l2_lambda)

# perform a Grid Search
#log_model = KerasClassifier(build_fn=create_log_model, 
#                            epochs=log_parameters['epochs'], 
#                            verbose=log_parameters['verbose'])

#log_grid = GridSearchCV(estimator=log_model, param_grid=log_param_grid, n_jobs=-1)
#log_grid_result = log_grid.fit(X=log_train_x, y=log_train_y, callbacks=[log_callback])

#summarize_grid_results(log_grid_result)

In [None]:
# uses the parameters tuned above
log_model = create_log_model()
log_history = log_model.fit(x=log_train_x, y=log_train_y,
                            epochs=log_parameters['epochs'],
                            callbacks=[log_callback],
                            validation_data=(log_val_x, log_val_y),
                            verbose=log_parameters['verbose']);

**Note: This model takes ~30s to run**

In [None]:
fig = sns.lineplot(data=log_history.history['loss']);
fig.set(ylabel="loss", xlabel = "epoch");

In [None]:
# Predict output, confusion matrix, and ROC curve
log_y_scores = log_model.predict(log_test_x);
log_y_scores = np.append(1-log_y_scores, log_y_scores, axis=1)

training_accuracy = log_history.history["accuracy"][-1]
validation_accuracy = log_history.history["val_accuracy"][-1]

summaries = create_summary("slr-log-reg",
                           log_test_y, 
                           log_y_scores, 
                           training_accuracy,
                           validation_accuracy,
                           summaries);

## k-NN <a class="anchor" id="knn"></a>

In [None]:
standard_scaler = StandardScaler();
knn_x_columns = ["CreditScore", "Age", "Tenure", "Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary", "Geography_France", "Geography_Germany", "Geography_Spain", "Gender_Male", "Gender_Female"]
knn_y_columns = ["Exited"]

knn_train_x = pd.DataFrame(standard_scaler.fit_transform(bdt_train[knn_x_columns]), columns=knn_x_columns)
knn_train_y = bdt_train[knn_y_columns]

knn_val_x = pd.DataFrame(standard_scaler.fit_transform(bdt_validate[knn_x_columns]), columns=knn_x_columns)
knn_val_y = bdt_validate[knn_y_columns]

knn_test_x = pd.DataFrame(standard_scaler.fit_transform(bdt_test[knn_x_columns]), columns=knn_x_columns)
knn_test_y = bdt_test[knn_y_columns]

In [None]:
neighbor_column_names = ["n", "validation_accuracy", "f1_score"]
neighbors = pd.DataFrame(columns = neighbor_column_names)

for n in range(1, 30):
    knn_model = KNeighborsClassifier(n_neighbors=n)
    knn_model.fit(knn_train_x, np.ravel(knn_train_y))
    y_scores = knn_model.predict_proba(knn_test_x);
    y_pred = knn_model.predict(knn_test_x);
    y_scores=y_scores[:,1]
    y_pred = y_scores > 0.5    
    neighbor = pd.DataFrame([[n,
                             knn_model.score(knn_val_x, knn_val_y),
                             f1_score(knn_test_y, y_pred)
              ]], 
             columns=neighbor_column_names);
    neighbors = neighbors.append(neighbor, ignore_index=True);

In [None]:
reshaped_neighbors = pd.melt(neighbors, id_vars="n", var_name="metric", value_name="values")
plt.figure(figsize=(16,4))
sns.set_style("whitegrid")
fig = sns.lineplot(data=reshaped_neighbors, x="n", y="values", hue="metric");
fig.set_title("kNN(n = x) comparison");
fig.legend(loc='best');
fig.set_ylim([0, 1.05]);

In [None]:
max_k_neighbor = neighbors.at[neighbors["validation_accuracy"].idxmax(), "n"]
knn_model = KNeighborsClassifier(n_neighbors=max_k_neighbor)
knn_model.fit(knn_train_x, np.ravel(knn_train_y))
knn_y_scores = knn_model.predict_proba(knn_test_x);

In [None]:
title = "kNN [n=" + str(max_k_neighbor) + "]"
print("The best kNN fit was " + title)

training_accuracy = knn_model.score(knn_train_x, knn_train_y)
validation_accuracy = knn_model.score(knn_val_x, knn_val_y)

summaries = create_summary(title, 
                           knn_test_y, 
                           knn_y_scores, 
                           training_accuracy,
                           validation_accuracy,
                           summaries);


## Decision Tree Classifier <a class="anchor" id="decision"></a>

In [None]:
dtc_columns_x = ["CreditScore", "Age", "Tenure", "Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary", "Geography_France", "Geography_Germany", "Geography_Spain", "Gender_Male", "Gender_Female"]
dtc_columns_y = "Exited"

dtc_train_x = pd.DataFrame(minmax_scaler.fit_transform(bdt_train[dtc_columns_x]), columns=dtc_columns_x)
dtc_train_y = bdt_train[dtc_columns_y]

dtc_val_x = pd.DataFrame(minmax_scaler.fit_transform(bdt_validate[dtc_columns_x]), columns=dtc_columns_x)
dtc_val_y = bdt_validate[dtc_columns_y]

dtc_test_x = pd.DataFrame(minmax_scaler.fit_transform(bdt_test[dtc_columns_x]), columns=dtc_columns_x)
dtc_test_y = bdt_test[dtc_columns_y]

In [None]:
# Tuning a classification tree from scikit
# https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
clf = DecisionTreeClassifier()
path = clf.cost_complexity_pruning_path(dtc_train_x, dtc_train_y)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

fig, axes = plt.subplots(2, 2, figsize=(16, 16));
fig = sns.lineplot(ax=axes[0][0], x=ccp_alphas[:-1], y=impurities[:-1], marker='o', drawstyle="steps-post")
fig.set_xlabel("effective alpha")
fig.set_ylabel("total impurity of leaves")
fig.set_title("Total Impurity vs effective alpha for training set")

clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    clf.fit(dtc_train_x, dtc_train_y)
    clfs.append(clf)

print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
      clfs[-1].tree_.node_count, ccp_alphas[-1]))

clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]

fig=sns.lineplot(ax=axes[0][1], x=ccp_alphas, y=node_counts, marker='o', drawstyle="steps-post")
fig.set_xlabel("alpha")
fig.set_ylabel("number of nodes")
fig.set_title("Number of nodes vs alpha")

fig=sns.lineplot(ax=axes[1][0], x=ccp_alphas, y=depth, marker='o', drawstyle="steps-post")
fig.set_xlabel("alpha")
fig.set_ylabel("depth of tree")
fig.set_title("Depth vs alpha")

train_scores = [clf.score(dtc_train_x, dtc_train_y) for clf in clfs]
val_scores = [clf.score(dtc_val_x, dtc_val_y) for clf in clfs]

fig=sns.lineplot(ax=axes[1][1], x=ccp_alphas, y=train_scores, marker='o', label="train",
        drawstyle="steps-post")
fig.plot(ccp_alphas, val_scores, marker='o', label="validation",
        drawstyle="steps-post")
fig.set_xlabel("alpha")
fig.set_ylabel("accuracy")
fig.set_title("Accuracy vs alpha for training and validation sets")
fig.legend();


In [None]:
# find the highest accuracy on validation set
max_idx = np.argmax(val_scores)
max_ccp_alpha=ccp_alphas[max_idx]
max_val_accuracy=val_scores[max_idx]
#max_ccp_alpha = neighbors.at[neighbors["validation_accuracy"].idxmax(), "n"]
dtc_title = "CART [α={:.4}]".format(max_ccp_alpha)
print("The best CART ccp_alpha was " + dtc_title)
print("Best validation accuracy: {:.4} with α={:.4}".format(
      max_val_accuracy, max_ccp_alpha)
     )

In [None]:
# Using the tree selected above
dtc_model = DecisionTreeClassifier(random_state=random_seed, ccp_alpha=max_ccp_alpha)
dtc_tree = dtc_model.fit(dtc_train_x, dtc_train_y)
dtc_y_scores = dtc_model.predict_proba(dtc_test_x);

In [None]:
fig = plt.figure(figsize=(16,16))
tree.plot_tree(dtc_tree,
               feature_names=dtc_columns_x,
               class_names=["0", "1"],
               filled=True);
plt.title(dtc_title);

In [None]:
training_accuracy = dtc_model.score(dtc_train_x, dtc_train_y)
validation_accuracy = dtc_model.score(dtc_val_x, dtc_val_y)

summaries = create_summary(dtc_title, 
                           dtc_test_y, 
                           dtc_y_scores, 
                           training_accuracy,
                           validation_accuracy,
                           summaries);


## Neural Network (Optional) <a class="anchor" id="nn"></a>

In [None]:
## Inputs
## 16 hidden layers (relu)
## Dropout (0.2)
## 32 hidden layers (relu)
## Dropout (0.2)
## Output - sigmoid

nn_parameters = {
    'kernel_initializer': 'glorot_uniform',
    'bias_initializer': 'zeros',
    'l2_regularizer': 0.003,
    'first_layer': 32,
    'second_layer': 64,
    'third_layer': 0,
    'dropout': 0.3,
    'activation': 'relu',
    'learning_rate': 0.001,
    'optimizer': 'adam',
    'batch_size': 32,
    'epochs': 200,
    'verbose': 0
}

In [None]:
def create_nn_model(parameters):
    model = Sequential()
    model.add(Dense(parameters['first_layer'],
                input_dim=nn_train_x.shape[1],
                kernel_regularizer=tf.keras.regularizers.l2(parameters['l2_regularizer']),  
                activation=parameters['activation'],
                kernel_initializer=parameters['kernel_initializer'],
                bias_initializer=parameters['bias_initializer']))
    model.add(Dropout(rate=parameters['dropout'], seed=random_seed))
    model.add(Dense(parameters['second_layer'],
                kernel_regularizer=tf.keras.regularizers.l2(parameters['l2_regularizer']), 
                activation=parameters['activation'],
                kernel_initializer=parameters['kernel_initializer'],
                bias_initializer=parameters['bias_initializer']))
    model.add(Dropout(rate=parameters['dropout'], seed=random_seed))

    if (parameters['third_layer'] != 0):
        model.add(Dense(parameters['third_layer'],
                    kernel_regularizer=tf.keras.regularizers.l2(parameters['l2_regularizer']), 
                    activation=parameters['activation'],
                    kernel_initializer=parameters['kernel_initializer'],
                    bias_initializer=parameters['bias_initializer']))
        model.add(Dropout(rate=parameters['dropout'], seed=random_seed))

    model.add(Dense(1, activation='sigmoid',
                kernel_initializer=parameters['kernel_initializer'],
                bias_initializer=parameters['bias_initializer']))

    if parameters['optimizer'] == 'adam':
        parameters['optimizer'] = keras.optimizers.Adam(learning_rate=parameters['learning_rate'])
        
    model.compile(loss = 'binary_crossentropy', 
                  optimizer=parameters['optimizer'], 
                  metrics=['accuracy'])
    return model;

In [None]:
def tune_nn_model(kernel_initializer = nn_parameters['kernel_initializer'],
                  bias_initializer = nn_parameters['bias_initializer'],
                  l2_regularizer = nn_parameters['l2_regularizer'],
                  first_layer = nn_parameters['first_layer'],
                  second_layer = nn_parameters['second_layer'],
                  third_layer=nn_parameters['third_layer'],
                  dropout = nn_parameters['dropout'],
                  activation = nn_parameters['activation'],
                  optimizer = nn_parameters['optimizer'],
                  learning_rate = nn_parameters['learning_rate'],
                  batch_size = nn_parameters['batch_size'],
                  epochs = nn_parameters['epochs'],
                  verbose = nn_parameters['verbose']):
    parameters = {
        'kernel_initializer': kernel_initializer,
        'bias_initializer': bias_initializer,
        'l2_regularizer': l2_regularizer,
        'first_layer': first_layer,
        'second_layer': second_layer,
        'third_layer': third_layer,
        'dropout': dropout,
        'activation': activation,
        'optimizer': optimizer,
        'learning_rate': learning_rate,
        'batch_size': batch_size,
        'epochs': epochs,
        'verbose': verbose
    }
    return create_nn_model(parameters);
    

### FCNN <a class="anchor" id="fcnn"></a>

In [None]:
quantile_transformer = QuantileTransformer(output_distribution='normal')
nn_columns_x = ["CreditScore", "Age", "Tenure", "Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary", "Geography_France", "Geography_Germany", "Geography_Spain", "Gender_Male", "Gender_Female"]
nn_columns_y = ["Exited"]

nn_train_x = pd.DataFrame(quantile_transformer.fit_transform(bdt_train[nn_columns_x]), columns=nn_columns_x)
nn_train_y = bdt_train[nn_columns_y]

nn_val_x = pd.DataFrame(quantile_transformer.fit_transform(bdt_validate[nn_columns_x]), columns=nn_columns_x)
nn_val_y = bdt_validate[nn_columns_y]

nn_test_x = pd.DataFrame(quantile_transformer.fit_transform(bdt_test[nn_columns_x]), columns=nn_columns_x)
nn_test_y = bdt_test[nn_columns_y]

In [None]:
## Neural network architecture
## https://medium.com/finc-engineering/user-churn-prediction-using-neural-network-with-keras-c48f23ef4e8b
## http://drunkendatascience.com/predicting-customer-churn-with-neural-networks-in-keras/
## - Dropouts to decrease over fitting (if necessary_)
nn_model = KerasClassifier(create_nn_model, 
                           parameters=nn_parameters,
                           batch_size=nn_parameters['batch_size'],
                           epochs=nn_parameters['epochs'], 
                           verbose=nn_parameters['verbose']);

nn_history = nn_model.fit(nn_train_x, nn_train_y, 
                          validation_data=(nn_val_x, nn_val_y));


In [None]:
tune_nn = False
if tune_nn:
    #Best: 0.842326 using {'learning_rate': 0.001}
    #learning_rates = [0.00001, 0.00003, 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3]

    #Best: 0.845853 using {'batch_size': 128}
    #batch_sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096]
    batch_sizes = [32, 128]

    # Best fit: adam
    #optimizers=['sgd', 'rmsprop', 'adam']

    #Best: 0.853359 using {'l2_regularizer': 0.003}
    #l2_regularizers=[0, 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3]

    #Best: 0.846593 using {'first_layer': 32, 'second_layer': 64, 'third_layer': 0}
    #first_layers = [16, 32, 64]
    #second_layers = [32, 64, 128]
    #third_layers = [0, 32, 64, 128, 256]

    #Best: 0.859833 using {'dropout': 0.3}
    #dropouts = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]

    nn_param_grid = dict(
        #learning_rate=learning_rates,
        #l2_regularizer=l2_regularizers,
        #optimizer=optimizers,
        batch_size=batch_sizes,
        #first_layer=first_layers,
        #second_layer=second_layers,
        #third_layer=third_layers,
        #dropout=dropouts
    )
    
    nn_model_tune = KerasClassifier(tune_nn_model,
                                    batch_size=nn_parameters['batch_size'],
                                    epochs=nn_parameters['epochs'], 
                                    verbose=nn_parameters['verbose']);

    nn_grid = GridSearchCV(estimator=nn_model_tune, 
                           param_grid=nn_param_grid,
                           scoring='roc_auc',
                           n_jobs=2)
    
    nn_grid_result = nn_grid.fit(X=nn_train_x, y=nn_train_y,
                                 validation_data=[nn_val_x, nn_val_y])
    summarize_grid_results(nn_grid_result)

**Note: This model takes ~60s to run**

In [None]:
fig = sns.lineplot(data=nn_history.history['loss']);
fig.set(ylabel="loss", xlabel = "epoch");

In [None]:
nn_y_scores = nn_model.predict_proba(nn_test_x);

training_accuracy = nn_model.score(nn_train_x, nn_train_y)
validation_accuracy = nn_model.score(nn_val_x, nn_val_y)

summaries = create_summary("neural-network", 
                           nn_test_y, 
                           nn_y_scores, 
                           training_accuracy,
                           validation_accuracy,
                           summaries);

### FCNN w/ SMOTE <a class="anchor" id="smote"></a>

The NN seems to good a good job of classifying class 1, but we still do not see much gain in class 0. This may be because 80% of the data is class 0, so the default position of the algorithm and dataset is just to predict 0. I am going to use the Synthetic Minority Oversampling Technique (SMOTE) to augment the minority class and see if we realize improvements.

In [None]:
# Create the SMOTE dataset
smote_train_x = nn_train_x;
smote_train_y = nn_train_y;

smote_val_x = nn_val_x;
smote_val_y = nn_val_y;

smote_test_x = nn_test_x;
smote_test_y = nn_test_y;

In [None]:
print("Before SMOTE, count (y = 1): ", np.sum(smote_train_y==1))
print("Before SMOTE, count (y = 0): ", np.sum(smote_train_y==0))

sm = SMOTE(random_state=2, k_neighbors=max_k_neighbor)
smote_train_x, smote_train_y = sm.fit_sample(smote_train_x, smote_train_y)

print("After SMOTE, count (y = 1): ", np.sum(smote_train_y==1))
print("After SMOTE, count (y = 0): ", np.sum(smote_train_y==0))

In [None]:
# Increase epochs
smote_parameters = nn_parameters;
smote_parameters['epochs'] = 500;

# We use the same NN above and train it w/ the smote dataset.
smote_model = KerasClassifier(create_nn_model, 
                              parameters=smote_parameters,
                              batch_size=smote_parameters['batch_size'],
                              epochs=smote_parameters['epochs'], 
                              verbose=smote_parameters['verbose']);

smote_history = smote_model.fit(smote_train_x, smote_train_y, 
                             validation_data=(smote_val_x, smote_val_y));



**The above model takes ~5m to train**

In [None]:
# Plot the loss function
fig = sns.lineplot(data=smote_history.history['loss']);
fig.set(xlabel="loss", ylabel = "epoch");

In [None]:
# Print the summary information
smote_y_scores = smote_model.predict_proba(smote_test_x);

training_accuracy = nn_model.score(smote_train_x, smote_train_y)
validation_accuracy = nn_model.score(smote_val_x, smote_val_y)

summaries = create_summary("nn-smote", 
                           smote_test_y, 
                           smote_y_scores, 
                           training_accuracy,
                           validation_accuracy,
                           summaries);

## Ensemble <a class="anchor" id="ensemble"></a>

In [None]:
ensemble_columns_x = ["CreditScore", "Age", "Tenure", "Balance", "NumOfProducts", "HasCrCard", "IsActiveMember", "EstimatedSalary", "Geography_France", "Geography_Germany", "Geography_Spain", "Gender_Male", "Gender_Female"]
ensemble_columns_y = ["Exited"]

ensemble_train_x = pd.DataFrame(bdt_train[ensemble_columns_x], columns=ensemble_columns_x)
ensemble_train_y = bdt_train[ensemble_columns_y]

ensemble_val_x = pd.DataFrame(bdt_validate[ensemble_columns_x], columns=ensemble_columns_x)
ensemble_val_y = bdt_validate[ensemble_columns_y]

ensemble_test_x = pd.DataFrame(bdt_test[ensemble_columns_x], columns=ensemble_columns_x)
ensemble_test_y = bdt_test[ensemble_columns_y]

### Soft Vote <a class="anchor" id="soft-vote"></a>

The soft voting consensus method takes the average of the predictions.

In [None]:
all_y_scores = []
methods = ['slr-log-reg', 'kNN', 'CART', 'neural-network']

for m in methods:
    s = summaries.loc[summaries['method'].str.startswith(m), 'y_scores'].item()
    all_y_scores.append(s)    

In [None]:
soft_vote_y_scores = np.mean(all_y_scores, axis=0)
soft_vote_y_scores = soft_vote_y_scores[..., None]
soft_vote_y_scores = np.append(1-soft_vote_y_scores, soft_vote_y_scores, axis=1)

training_accuracy = 0.0
validation_accuracy = 0.0

summaries = create_summary("soft_vote", 
                           ensemble_test_y,
                           soft_vote_y_scores, 
                           training_accuracy,
                           validation_accuracy,
                           summaries);

### Stacking Classifier <a class="anchor" id="stacking"></a>

The stacking classification develops a logistic regression method based on the outputs from other classifiers.

In [None]:
# the x values are the prior probabilities
stacking_y_column = "Exited"

stacking_train_x = np.column_stack((log_model.predict(log_train_x),
                                   knn_model.predict_proba(knn_train_x)[:,1],
                                   dtc_model.predict_proba(dtc_train_x)[:,1],
                                   nn_model.predict_proba(nn_train_x)[:,1]))
stacking_train_y = bdt_train[stacking_y_column]

stacking_val_x = np.column_stack((log_model.predict(log_val_x),
                                   knn_model.predict_proba(knn_val_x)[:,1],
                                   dtc_model.predict_proba(dtc_val_x)[:,1],
                                   nn_model.predict_proba(nn_val_x)[:,1]))
stacking_val_y = bdt_validate[stacking_y_column]

stacking_test_x = np.column_stack((log_model.predict(log_test_x),
                                   knn_model.predict_proba(knn_test_x)[:,1],
                                   dtc_model.predict_proba(dtc_test_x)[:,1],
                                   nn_model.predict_proba(nn_test_x)[:,1]))
stacking_test_y = bdt_test[stacking_y_column]

In [None]:
stacking_model = LogisticRegression(solver='liblinear', C=10.0, random_state=0)
stacking_model.fit(stacking_train_x, stacking_train_y)
stacking_y_scores = stacking_model.predict_proba(stacking_test_x);

In [None]:
training_accuracy = stacking_model.score(stacking_train_x, stacking_train_y)
validation_accuracy = stacking_model.score(stacking_val_x, stacking_val_y)

summaries = create_summary("stacking",
                           stacking_test_y, 
                           stacking_y_scores, 
                           training_accuracy,
                           validation_accuracy,
                           summaries);

### AdaBoost <a class="anchor" id="adaboost"></a>

Boosting algorithm using decision trees.

In [None]:
# define the model with default hyperparameters
ada_model = AdaBoostClassifier()
# define the grid of values to search
ada_parameters = dict()
ada_parameters['n_estimators'] = [100, 500, 1000]
ada_parameters['learning_rate'] = [0.01, 0.03, 0.1, 0.3, 1, 3]

# define the grid search procedure
ada_grid = GridSearchCV(estimator=ada_model, param_grid=ada_parameters,
                        scoring='roc_auc', n_jobs=2)
    
ada_grid_result = ada_grid.fit(X=dtc_train_x, y=dtc_train_y)
summarize_grid_results(ada_grid_result)

In [None]:
# Using the tree selected above
ada_best_model = AdaBoostClassifier(learning_rate=ada_grid_result.best_params_['learning_rate'],
                                    n_estimators=ada_grid_result.best_params_['n_estimators'])
ada_best_model.fit(X=dtc_train_x, y=dtc_train_y)
ada_y_scores = ada_best_model.predict_proba(dtc_test_x);

In [None]:
training_accuracy = ada_best_model.score(dtc_train_x, dtc_train_y)
validation_accuracy = ada_best_model.score(dtc_val_x, dtc_val_y)

summaries = create_summary("ada-boost", 
                           dtc_test_y, 
                           ada_y_scores, 
                           training_accuracy,
                           validation_accuracy,
                           summaries);


# Model Selection <a class="anchor" id="selection"></a>

In [None]:
metrics = summaries[["method", "training_accuracy", "validation_accuracy", "test_accuracy", "f1_score", "auc"]]
reshaped_barplot = pd.melt(metrics, id_vars="method", var_name="metric", value_name="values")

plt.figure(figsize=(16,4))
sns.set_style("whitegrid")
fig = sns.barplot(data=reshaped_barplot, x="metric", y="values", hue="method");
fig.set_title("Method comparison");
fig.legend(loc='center left', bbox_to_anchor=(1, 0.5))
fig.set_ylim([0, 1.05]);

In [None]:
fig, axes = plt.subplots(1, 1, figsize=(16, 8));
 
for index, row in summaries.iterrows():
    sns.lineplot(ax=fig.axes[0], x=row["fpr"], y=row["tpr"], label = row["method"] + ": " + str(round(row["auc"], 3)))

plt.plot([0,1], [0,1], 'r--', label = 'Random: 0.5')
plt.plot([0,0,1], [0,1,1], 'g-', label = 'Optimal: 1.0')

plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()

In [None]:
cnf_grid_cols = 3;
cnf_n_items = summaries.shape[0]
cnf_grid_rows = np.ceil(cnf_n_items/cnf_grid_cols).astype('int')

fig, axes = plt.subplots(cnf_grid_rows, cnf_grid_cols, figsize=(16, 4*cnf_grid_rows));

for index, row in summaries.iterrows():
    r = np.trunc(index/3).astype('int')
    c = index%cnf_grid_cols;
    sub_fig = sns.heatmap(ax=axes[r,c], data=row["cnf_matrix"], annot=True, fmt="d", linewidths=.5, cmap="Blues", vmin=0, vmax=1600);
    title = row["method"]
    sub_fig.set_title(title);
    sub_fig.set_ylabel("Actual");
    sub_fig.set_xlabel("Predicted");

# delete the unused columns
cols_to_delete = (cnf_grid_cols-cnf_n_items%cnf_grid_cols)%cnf_grid_cols
for i in range(cols_to_delete):
    fig.delaxes(axes[r][cnf_grid_cols-1-i])
    
fig.suptitle("Confusion Matrix Summary (Exited=1)");
fig.subplots_adjust(hspace=0.4)

Let us calculate the value of the above charts:

We want to penalize a lazy learner (since ~80% of the customers do not churn). A lost customer costs us \\$100 in revenue. A kept customer has no net change in expected revenue. We spend $5 to reach out to customers who might churn. If contacted, 20% of the customers who would churn will not. (e.g. if 10 people will churn, and we contact all 10 of them, only 8 will leave). Based on this analysis, here is how much money we keep by using each of the algorithms.

The benefit of this equation is that correctly detecting customers who will churn (TP) is rewarded and missing customers who will churn is heavily penalized (FN). Misclassifying customers who will stay are penalized based on the cost of reaching out to customers.

In real life, a manager could then choose to optimize further by lowering the cost to reach out, or else improving the retention rate.

In [None]:
revenue_gain_customer = 0
revenue_lost_customer = 100
retention_rate = 0.20
cost_per_contact = 5

revenues = np.empty(summaries.shape[0])

for index, row in summaries.iterrows():
    cfm = row["cnf_matrix"]
    tp = cfm[0][0]  # Customer we keep
    fp = cfm[0][1]  # Customer will keep but reach out to anyways
    fn = cfm[1][0]  # Customer we will lose but we do not predict it so we do not reach out
    tn = cfm[1][1]  # Customers we predict will leave and we reach out
    
    customer_revenue = tp*revenue_gain_customer - fp*cost_per_contact - fn*revenue_lost_customer - tn*cost_per_contact - tn*(1-retention_rate)*revenue_lost_customer + tn*retention_rate*revenue_gain_customer
    revenues[index] = customer_revenue;
    
summaries["revenue"] = revenues;

plt.figure(figsize=(16,4));
chart = sns.lineplot(data=summaries, x="method", y="revenue", marker="o")
plt.xticks(
    rotation=30, 
    horizontalalignment='right',
)

max_index = summaries["revenue"].idxmax()
max_method = summaries.loc[max_index, "method"]
max_revenue = summaries.loc[max_index, "revenue"]
print(max_method + " has the highest revenue at ${:,.0f}".format(max_revenue))