# Analyzing Customer Churn

## Introduction

The [data](https://www.kaggle.com/shubh0799/churn-modelling) we are using for this analysis consists of customers subscribed to services at a company. Our goal is to explore and solve the problem of predicting **customer churn**. The dataset contains features like Age, Tenure, Salary, and Credit Score; features which could potentially give insight as to why a customer end their subscription or stop buying products from a company.
<br/><br/>The workflow will cover EDA, where we explore the features of the dataset and try to determine which features are correlated with customer churn. We will then do feature engineering and selection with the intention of creating a predictive model that is able to predict whether a given customer will churn.

### References
 * [numpy API reference](https://numpy.org/doc/stable/reference/index.html)
 * [pandas API reference](https://pandas.pydata.org/docs/reference/index.html#api)
 * [scikit-learn documentation](https://scikit-learn.org/stable/)
 * [xgboost parameter documentation](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster)
 * [original SMOTE paper](https://arxiv.org/abs/1106.1813)

In [None]:
# Let's load in the dataset then check the head
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('/kaggle/input/churn-modelling/Churn_Modelling.csv')

df.head()

In [None]:
# Let's take a look at the summary statistics
df.describe()

`RowNumber` is just a number that identifies each row. We can drop it and use the dataframe's index instead.

In [None]:
# Drop RowNumber
df.drop('RowNumber', axis=1, inplace=True)

Let's first check for null values in our dataset.

In [None]:
# Check for null values
df.isnull().values.any()

Let's now look at the `CustomerId` column in our dataset.

In [None]:
# Check number of unique values for CustomerId
df['CustomerId'].nunique()

We can assume from `CustomerId` that every entry in the dataset is a unique individual. Since `CustomerId` is unique, it does not give us any information. We can drop it.

In [None]:
# Drop CustomerId
df.drop('CustomerId', axis=1, inplace=True)

In [None]:
# Let's look at the data once again.
df.head()

Looking at the data again, we see the column `Surname`. Thinking intutively, any correlation of this column with our target variable would be completely coincidental. These relationships would be spurious since we can never really predict if a customer will churn based on their name. Thus, we will drop this column as well.

In [None]:
# Drop Surname
df.drop('Surname', axis=1, inplace=True)

## EDA

Let's create a correlation heatmap to see linear relationships between variables.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib.patches import Rectangle

fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(df.corr().round(2), annot=True, ax=ax)
highlight_color = 'blue'
ax.add_patch(Rectangle((0, 8), 9, 1, fill=False, edgecolor=highlight_color, lw=3))
ax.add_patch(Rectangle((8, 0), 1, 9, fill=False, edgecolor=highlight_color, lw=3))
ax.set_title('Feature Correlations')

for axis in [ax.get_xticklabels(), ax.get_yticklabels()]:
    label = [i for i in axis if i.get_text() == 'Exited']
    [(l.set_weight('bold'), l.set_size(25), l.set_color(highlight_color)) for l in label]

Looking at the correlation heatmap, we can see that `Age` has the highest correlation with our target variable `Exited`. This makes sense given that the older you are, the more likely you are to churn as a customer. Let's take a closer look at some of the variables.

In [None]:
def feature_bar_graph(values, title, xlab, ylab='Proportion of Exited', rotate_x=False):
    temp_df = pd.DataFrame({'Feature': values, 'Exited': df['Exited']})
    gb_obj = temp_df.groupby('Feature')['Exited'].mean()
    plt.bar(gb_obj.index.astype(str), gb_obj.values, width=0.5)
    plt.title(title, fontsize=15)
    plt.xlabel(xlab, fontsize=15)
    plt.ylabel(ylab, fontsize=15)
    if rotate_x:
        plt.xticks(rotation=45)
        
# Bin Age and plot a bar graph
bin_age = pd.qcut(df['Age'].values, q=5).astype(str)
feature_bar_graph(bin_age, 'Age Analysis', 'Binned Age', rotate_x=True)

We see from the bar graph that `Age` positively correlates with customer churn. We have binned the data and can observe that the bins with higher `Age` have a larger proportion of customers churning.

In [None]:
feature_bar_graph(df['Tenure'].values, 'Tenure Analysis', 'Tenure')

`Tenure` doesn't seem to show any trends when it comes to churn. The proportions are roughly the same for each of the values.

In [None]:
bin_cc = pd.qcut(df['CreditScore'].values, q=5).astype(str)
feature_bar_graph(bin_cc, 'Credit Score Analysis', 'Binned Credit Score', rotate_x=True)

We see the same thing when we look at `CreditScore`. There is no notable difference between any of the values.

In [None]:
bin_salary = pd.qcut(df['EstimatedSalary'].values, q=5).astype(str)
feature_bar_graph(bin_salary, 'Salary Analysis', 'Estimated Salary', rotate_x=True)

Looking at `EstimatedSalary`, the trend continues and there are no differences in the values.

In [None]:
feature_bar_graph(df['IsActiveMember'].values, 'Active Member Analysis', 'Is Active Member')

Looking at `IsActiveMember` we can see that being inactive is correlated to customer churn.

In [None]:
feature_bar_graph(df['NumOfProducts'].values, 'Number of Products Analysis', 'Number of Products')

We see a notable trend in `NumOfProducts` as well. It is not linear but when a customer has >2 products, churn seems to increase by a large amount.

In [None]:
# I will save a copy of our dataframe then one-hot encode it.
df_old = df.copy()
df = pd.get_dummies(df, drop_first=True)

In [None]:
# Separate into features and target variable
features = df.drop('Exited', axis=1)
target = df['Exited']

## Model Training Run 1

We will first do a baseline test with all features and no hyperparameter tuning to see which models perform best on our dataset. The only preprocessing we will do is feature scaling particularly to help our non tree-based models.

In [None]:
# Import all libraries that we will need and split the data into training and testing sets
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import f1_score, make_scorer
import os
rs = {'random_state': 42}

X_train, X_test, y_train, y_test = train_test_split(features, target, train_size=0.6, **rs)
X_val, X_test, y_val, y_test, = train_test_split(X_test, y_test, train_size=0.5, **rs)

We will use a variety of models and see which one performs the best.<br/>
The models include:
* Logistic Regression
* Naive Bayes
* k-nearest neighbors
* SVM
* Neural Network
* Decision Tree
* Extra Trees
* Random Forest
* XGBoost

For this training run we will use 3-fold cross validation using `cross_val_score`. All tree based models will have a random seed to ensure reproducibility.

In [None]:
def train_models(X_train, X_val, X_test, y_train, y_val, y_test):
    log_reg = LogisticRegression(**rs)
    nb = BernoulliNB()
    knn = KNeighborsClassifier()
    svm = SVC(**rs)
    mlp = MLPClassifier(max_iter=5000, **rs)
    dt = DecisionTreeClassifier(**rs)
    et = ExtraTreesClassifier(**rs)
    rf = RandomForestClassifier(**rs)
    xgb = XGBClassifier(**rs, verbosity=0)
    scorer = make_scorer(f1_score)

    clfs = [('Logistic Regression', log_reg), ('Naive Bayes', nb),
            ('K-Nearest Neighbors', knn), ('SVM', svm), 
            ('MLP', mlp), ('Decision Tree', dt), ('Extra Trees', et), 
            ('Random Forest', rf), ('XGBoost', xgb)]
    pipelines = []
    scores_df = pd.DataFrame(columns=['model', 'val_score', 'test_score'])
    test_scores = []
    for clf_name, clf in clfs:
        pipeline = Pipeline(steps=[
            ('scaler', StandardScaler()),
            ('classifier', clf)])
        pipeline.fit(X_train, y_train)
        val_score = cross_val_score(pipeline, X_val, y_val, scoring=scorer, cv=3).mean()
        print(f'{clf_name}\n{"-" * 30}\nModel Score Validation: {val_score:.4f}')
        test_score = f1_score(y_test, pipeline.predict(X_test))
        print(f'Model Score Testing: {test_score:.4f}\n\n')
        pipelines.append(pipeline)
        scores_df = scores_df.append({'model': clf_name, 
                                      'val_score': val_score, 
                                      'test_score': test_score}, ignore_index=True)
    return pipelines, scores_df

pipelines1, scores1 = train_models(X_train, X_val, X_test, y_train, y_val, y_test)

In [None]:
scores1.sort_values('test_score', ascending=False)

The MLP model performs best on the test set in our initial training run. Let's analyze its performance with a classification report and confusion matrix on the test set.

In [None]:
# Create classification report
from sklearn.metrics import classification_report, confusion_matrix
model = pipelines1[4]
print(model['classifier'])
preds = model.predict(X_test)
print(classification_report(y_test, preds))

The model is very good at predicting negative examples as seen from the `f1-score` for class 0. However, it struggles with positive examples. This can be seen from the low recall score that it gets for class 1.

In [None]:
# Create confusion matrix
cfm = confusion_matrix(y_test, preds)
print(cfm)

In [None]:
# Create confusion matrix with seaborn
def create_confusion_matrix(y_true, y_pred):
    cfm = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots(figsize=(7,7))
    sns.heatmap(cfm, annot=True, annot_kws={"size": 15}, ax=ax,
                cbar=False, square=True, cmap='Blues', fmt='d')
    sns.set(font_scale=1.5)
    plt.xlabel('Predicted', fontsize=15)
    plt.ylabel('Actual', fontsize=15)
    ax.set_xticklabels(np.unique(y_pred))
    ax.set_yticklabels(np.unique(y_pred))
    plt.title('Confusion Matrix\nChurn Data', fontsize=18)
    
create_confusion_matrix(y_test, preds)

Looking at the confusion matrix reinforces our previous analysis of the model struggling to correctly classify positive examples.

In [None]:
# Check class distribution
y_train.value_counts()

## Model Training Run 2

When looking at the class distribution of our training set, we see a significant skew where most of our data consists of negative (0) examples. To overcome this problem, we could either undesample and majority class or oversample the minority. Undersampling will leave us with less training data so we will generate synthetic examples of our minority class. There are many methods to do this but we will be using ADASYN (Adaptive Synthetic) sampling for this purpose.

In [None]:
# Generate synthetic examples of minority class
from imblearn.over_sampling import ADASYN

adasyn = ADASYN(**rs)
X_train, y_train = adasyn.fit_resample(X_train, y_train)

In [None]:
y_train.value_counts()

Looking at the generated training data, we see that our class distribution is now pretty much even. Let's try another round of training and see the results.

In [None]:
# Training with synthetic dataset
pipelines2, scores2 = train_models(X_train, X_val, X_test, y_train, y_val, y_test)

In [None]:
scores2.sort_values('test_score', ascending=False)

The XGBoost model performs the best in our second training run. Let's dig deeper and look at the feature importance of the model. 

In [None]:
# Model classification report
model = pipelines2[-1]
print(model['classifier'])
preds = model.predict(X_test)
print(classification_report(y_test, preds))

Though our f1_score remains comparable to our first training run, we notice a large increase in sensitivity when adding synthetic samples. This is beneficial to us since we would like to be able to detect when a customer will churn.

In [None]:
create_confusion_matrix(y_test, preds)

A look at the confusion matrix verifies are findings. Model sensitivity has increased although specificity has taken a small hit. This is an acceptable compromise. Let's look at feature importance.

In [None]:
feat_imp = pd.DataFrame({'feature': features.columns, 'importance': model['classifier'].feature_importances_})
feat_imp.sort_values('importance', ascending=False, inplace=True)
feat_imp

In line with what we saw in our earlier analysis, `IsActiveMember` and `NumOfProducts` play a large role in determining whether an employee will churn or not.

## Feature Selection

Let's look at some feature selection methods now. First we will try Recursive Feature Elimination with Logistic Regression. We will test the chosen features with our XGBoost model.

In [None]:
# RFE with Logistic Regression
from sklearn.feature_selection import SelectFromModel, RFE

params = [StandardScaler(), XGBClassifier(**rs, verbosity=0), X_train, X_val, X_test, y_train, y_val, y_test]

def create_pipeline(feature_selection, scaler, classifier, X_train, X_val, X_test, y_train, y_val, y_test):
    pipeline = Pipeline(steps=[('feature_selection', feature_selection(LogisticRegression(max_iter=5e3))),
                        ('scaler', scaler),
                        ('classifier', classifier)])
    scorer = make_scorer(f1_score)
    pipeline.fit(X_train, y_train)
    chosen_features = X_train.iloc[:, pipeline['feature_selection'].get_support(indices=True)]
    print(f'Feature Selection Method {feature_selection.__name__} selected {len(chosen_features.columns)} features')
    print(f'Model Score Validation: {cross_val_score(pipeline, X_val, y_val, scoring=scorer, cv=3).mean():.4f}')
    print(f'Model Score Testing: {f1_score(y_test, pipeline.predict(X_test)):.4f}')
    return pipeline, chosen_features

rfe_pipe, rfe_feats = create_pipeline(RFE, *params)


RFE chose 5 features. Our model's score did not improve. Let's try using SelectFromModel.

In [None]:
# SelectFromModel with Logistic Regression
sfm_pipe, sfm_feats = create_pipeline(SelectFromModel, *params)

In [None]:
# Check colums
sfm_feats.columns

SelectFromModel chose 1 feature. Our model's score did not improve. What is interesting to note though is we are able to achieve a respectable score with just `Age`.

We will try a different method of feature selection using our model. By getting the cumulative percentage of feature importance, we can set a percentage cutoff after which we will drop all other features. I will set it to 0.9.

In [None]:
# Perform feature selection using feature importance
feat_imp['CumPerc'] = np.cumsum(model['classifier'].feature_importances_)/sum(model['classifier'].feature_importances_)
cutoff = 0.9
new_feats = feat_imp[feat_imp['CumPerc'] < cutoff]
new_feats

In [None]:
# Create a new pipeline using new features
X_train, X_val, X_test = X_train[new_feats['feature']], X_val[new_feats['feature']], X_test[new_feats['feature']]

fi_pipe = Pipeline(steps=[('scaler', StandardScaler()),
                          ('classifier', XGBClassifier(**rs, verbosity=0))])
scorer = make_scorer(f1_score)
fi_pipe.fit(X_train, y_train)
print(f'Model Score Validation: {cross_val_score(fi_pipe, X_val, y_val, scoring=scorer, cv=3).mean():.4f}')
print(f'Model Score Testing: {f1_score(y_test, fi_pipe.predict(X_test)):.4f}')

Using feature importance has improved our f1 score by a respectable margin. We will use this feature set moving forward. Let's check the classification report.

In [None]:
print(classification_report(y_test, fi_pipe.predict(X_test)))

Sensitivity has increased yet again. Our model is getting better at detecting churn among customers.

## Hyperparameter Tuning

Let's create a `weight` to try to put more weight on positive classes.

In [None]:
# Create class_weight dict and pass this as an argument when creating the classifier

weight = (y_val == 0).sum() / (y_val == 1).sum()
cw_pipe = Pipeline(steps=[('scaler', StandardScaler()),
                          ('classifier', XGBClassifier(scale_pos_weight=weight, **rs, verbosity=0))])
cw_pipe.fit(X_train, y_train)
print(f'Model Score Validation: {cross_val_score(cw_pipe, X_val, y_val, scoring=scorer, cv=3).mean():.4f}')
print(f'Model Score Testing: {f1_score(y_test, cw_pipe.predict(X_test)):.4f}')

Weighting the positive class seems to decrease the model's overall f1_score. Let's check the classification report.

In [None]:
print(classification_report(y_test, cw_pipe.predict(X_test)))

When adding the `weight` parameter, we are able to make improvements to sensitivity, but specificity takes a large hit. It is debatable whether we would keep a change like this in our model. Using it would certainly allow us to detect more churn among customers but this would be at the cost of an increase in type 1 errors (false positives). I will opt to leave this out moving forward.

## Model Analysis

In [None]:
test_preds = fi_pipe.predict(X_test)
print(classification_report(y_test, test_preds))

In [None]:
# Create confusion matrix
cfm = confusion_matrix(y_test, test_preds)
print(cfm)

In [None]:
# Create confusion matrix with seaborn
create_confusion_matrix(y_test, test_preds)

Finally, let's create an ROC plot.

In [None]:
# Plot roc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, test_preds)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, fi_pipe.predict_proba(X_test)[:,1])

fig = plt.figure()
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.plot(false_positive_rate, true_positive_rate, label=f'XGBoost (area = {logit_roc_auc:.2f})')
plt.plot([0, 1], [0, 1], '--', color='grey')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic plot')
plt.legend(loc="lower right")
plt.show()

## Out of Sample Prediction

With our final model we are able to address some of the problems regarding sensitivity and arrive at a respectable f1_score. Let's make predictions on out-of-sample data.

In [None]:
# Create out-of-sample data
oos = pd.DataFrame({'RowNumber': [10000, 10001, 10002, 10003, 10004], 
                    'CustomerId': [15849068, 15784210, 15690576, 15984739, 15893045],
                   'Surname': ['Garcia', 'Miller', 'Rodriguez', 'Lee', 'Hill'],
                   'CreditScore': [145, 566, 392, 669, 478],
                   'Geography': ['France', 'Germany', 'Spain', 'France', 'France'],
                    'Gender': ['Male', 'Female', 'Male', 'Male', 'Female'],
                   'Age': [46, 32, 25, 66, 47],
                   'Tenure': [1, 5, 4, 4, 8],
                   'Balance': [14569.43, 0.00, 129804.44, 1589.04, 0.00],
                   'NumOfProducts': [3, 3, 1, 3, 2],
                   'HasCrCard': [1, 1, 1, 1, 1],
                   'IsActiveMember': [0, 1, 1, 0, 1],
                   'EstimatedSalary': [164032.87, 56890.44, 98349.51, 57098.64, 122548.65]})

oos

In [None]:
# Let's write a function to preprocess the dataset and make sure it matches with our original
def preprocess_data(df):
    df = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
    df = pd.get_dummies(df, drop_first=True)
    df = df[new_feats['feature']]
    return df

oos_processed = preprocess_data(oos)
assert all(oos_processed.columns == X_train.columns)

In [None]:
# Predict and append to the dataframe
oos['predictions'] = fi_pipe.predict(oos_processed)
oos