# 1. Introduction

## 1.1 Premise

A manager at a bank is concerned that more and more customers are leaving the bank's credit card services. 

The bank would really appreciate it if someone could help it predict who is going to churn, so that it can proactively approach such customers to offer better services, and turn them back.

## 1.2 Plan

- Perform **exploratory data analysis** to learn the *properties/characteristics* of the features present.
- Fit several **classification models** to predict whether a customer will churn or not.
- Apply *hyper-parameter optimization* techniques.
- **Evaluate performance** of fitted models.

In [None]:
import math
import numpy as np 
import pandas as pd
from scipy import stats

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Preprocessing
from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# Modelling
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# 2. Exploratory Data Analysis

In [None]:
# Load the data
data = pd.read_csv(
    '/kaggle/input/credit-card-customers/BankChurners.csv',
    index_col='CLIENTNUM',
    na_values=['Unknown']  # interpret "Unknown" as missing 
)
data.head()

## 2.1 Summary

The dataset consists of 10,127 rows and 20 columns.

In [None]:
# Drop the last 2 columns as advised in the dataset's description
# See https://www.kaggle.com/sakshigoyal7/credit-card-customers
data = data.iloc[:, :-2]
data.info()

## 2.2 Missing Values

`Education_Level`, `Marital_Status` and `Income_Category` have missing values. 

In [None]:
# Check for missing values
missing = data.isna().sum()
pd.DataFrame({
    'No. of missing values': missing,
    '% missing': missing.apply(lambda x: f'{x/len(data):.2%}')
}).style.background_gradient()

### Strategy for Handling Missing Values

A common and straight-forward way of dealing with missing values is to *drop affected rows or columns*. The advantage of this is that you'll be left with genuine, unaltered data. The disadvantage is that you lose some data; which is especially undesirable if the dataset is small, or large proportions of its values are missing.

Another common tactic is *imputation*, which involves determining values to fill in the blanks. The advantage here is that no data is thrown out. But then, depending on the method used, the imputed values might be misleading.

Removing rows with missing values would in this case leave only 7,081 rows for modelling. That is rather small, so we'll use imputation to get as much of the data as possible. This will be implemented as a component in the model fitting pipeline.

## 2.3 Numeric Columns

There are 14 numeric columns.

In [None]:
numeric_cols = data.select_dtypes(include='number')
numeric_cols.columns

### 2.3.1 Summary Statistics

In [None]:
# Get summary statistics for numeric columns
numeric_cols.describe()

### 2.3.2 Histograms

In [None]:
# Plot histograms of numeric columns
histograms = numeric_cols.hist(figsize=(12, 12))

>`Credit_Limit`, `Avg_Open_To_Buy`, `Total_Amt_Chng_Q4_Q1`, `Total_Trans_Amt`, `Total_Ct_Chng_Q4_Q1` and `Avg_Utilization_Ratio` are skewed to the right (positively skewed).

>`Total_Revolving_Bal` has a curious peak close to the origin, which is investigated below:

In [None]:
total_rev_bal = data['Total_Revolving_Bal'].value_counts()
print(f'A very large number of customers ({total_rev_bal[0]:,}) have 0 Total_Revolving_Bal.')
print(total_rev_bal.nlargest(5))  # top 5 frequencies
print('\nFrequencies of the first 5 values confirm that the peak is specifically at 0:')
print(total_rev_bal.sort_index().head())

### 2.3.3 Box Plots

In [None]:
# Plot boxplots
fig, axes = plt.subplots(nrows=math.ceil(numeric_cols.shape[1] / 3),
                         ncols=3, figsize=(12, 18))

fig.tight_layout(h_pad=3)  # Add padding to sub-plots

for col, ax in zip(numeric_cols.columns, axes.flatten()):
    numeric_cols[col].plot.box(ax=ax,)

> `Credit_Limit`, `Avg_Open_To_Buy`, `Total_Amt_Chng_Q4_Q1`, `Total_Trans_Amt` and `Total_Ct_Chng_Q4_Q1` have a very large number of outliers.

### 2.3.4 Normal Probability Plots

In [None]:
# Plot probability plots (qq-plots)
fig, axes = plt.subplots(nrows=math.ceil(numeric_cols.shape[1] / 3),
                         ncols=3, figsize=(12, 18))

fig.tight_layout(h_pad=5)  # Add padding to sub-plots

for col, ax in zip(numeric_cols.columns, axes.flatten()):
    stats.probplot(numeric_cols[col], dist='norm', plot=ax)
    ax.set_title(col)

- `Customer_Age`, `Dependent_count`, `Months_on_book` and `Total_Trans_Ct` are somewhat normally distributed, which is good.

## 2.4 Categorical columns

There are 6 categorical columns.

In [None]:
categorical_cols = data.select_dtypes(include='O')
categorical_cols.columns

### 2.4.1 Summary Statistics

In [None]:
# Get summary statistics for categorical columns
categorical_cols.describe()

### 2.4.2 Count Plots

In [None]:
# Plot countplots of categorical columns
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10, 15), )

for col, ax in zip(categorical_cols.drop('Attrition_Flag', 1), axes.flatten()):
    sns.countplot(x=col, ax=ax, hue='Attrition_Flag', data=categorical_cols)
    ax.tick_params(axis='x', rotation=45)
    ax.set_title(col, size=16)
plt.tight_layout()

# 3. Modelling & Prediction

Let's now attempt to fit several classification models to predict whether a customer will leave.

The target variable - `Attrition_Flag` - is heavily imbalanced, with one class having significantly higher occurences than the rest.

In [None]:
attrition_bar_plot = data['Attrition_Flag'].value_counts().plot.bar()
_ = attrition_bar_plot.set_title('Bar Plot of Attrition_Flag', size=14,
                                 fontweight='bold')

> There exist strategies for handling such situations, some of which can be implemented using the [imbalanced-learn][1] package.

> In this case, we'll use [Randomized over-sampling][2].

[1]: https://imbalanced-learn.org/stable/user_guide.html
[2]: https://imbalanced-learn.org/stable/over_sampling.html#naive-random-over-sampling

In [None]:
# Select the features and target
X = data.drop('Attrition_Flag', axis=1)
y = data['Attrition_Flag']

numeric_cols = X.select_dtypes(include='number').columns
categorical_cols = X.select_dtypes(include='O').columns

# Prepare a training and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Apply random oversampling to counteract imbalance
random_oversampler = RandomOverSampler(random_state=0)
X_resampled, y_resampled = random_oversampler.fit_resample(X_train, y_train)

# Preprocessing pipeline for numeric cols
numeric_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # fill blanks with column's mean
    ('scaler', StandardScaler())  # normalize values
])

# Preprocessing pipeline for categorical cols
categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # fill blanks with column's mode
    ('encoder', OneHotEncoder())  # one-hot-encode values
])

# Combined preprocessing pipeline
combined_features = ColumnTransformer([
    ('numeric', numeric_pipe, numeric_cols),
    ('categorical', categorical_pipe, categorical_cols)
])


def fit_and_evaluate(classifier, params=None):
    """Fit a classification model and print metrics.
    
    Parameters
    ----------
    classifier: An instance of a classification predictor that follows the sklearn API.
        The model to fit.
    params: dict
        A dictionary of hyper-parameter values to pass to RandomizedSearchCV 
        for model tuning.
    
    Returns
    -------
    The model with hyper-parameters yielding the highest cross-validated score.
    """
    pipe = Pipeline([
        ('features', combined_features),
        ('classifier', classifier)
    ])
    
    params = {} if not params else params  # set empty dict as default
    
    model = RandomizedSearchCV(estimator=pipe, param_distributions=params,
                               scoring='roc_auc', n_jobs=2, cv=4, random_state=0)
    model.fit(X_resampled, y_resampled)
    
    print(f'Cross Validation AUC: {model.best_score_:%}')
    print(f'Test AUC: {model.score(X_test, y_test):%}')
    print('\nClassification Report:\n' + '-'*58)
    print(classification_report(model.predict(X_test), y_test))
    
    return model.best_estimator_

## 3.1 Random Forest Classifier

In [None]:
clf = RandomForestClassifier(random_state=7)
params = {'classifier__max_depth': range(3, 8),
          'classifier__class_weight': ["balanced", "balanced_subsample"]}

rf_model = fit_and_evaluate(clf, params)

## 3.2 Gradient Boosting Classifier

In [None]:
clf = GradientBoostingClassifier(random_state=0)
params = {'classifier__max_depth': range(3, 8),
          'classifier__n_estimators': range(100, 500, 100)}

gb_model = fit_and_evaluate(clf, params)

## 3.3 Support Vector Classifier

In [None]:
clf = SVC(class_weight='balanced', random_state=0)
params = {'classifier__C': np.logspace(1, 4, 10)}

svc_model = fit_and_evaluate(clf, params)

## 3.4 Logistic Regression

In [None]:
clf = LogisticRegression(random_state=2, class_weight='balanced')
params = {'classifier__C': np.logspace(1, 4, 10)}

lr_model = fit_and_evaluate(clf, params)

# 4. Conclusion

Among the models tested above, the `GradientBoostingClassifier` seems the most promising.

Let's visualise sample predictions to check if the predictions from the various models are consistent:

In [None]:
sample = X.sample(25, random_state=5)

results = pd.DataFrame({
    'Random Forest Classifier': rf_model.predict(sample),
    'Gradient Boosting Classifier': gb_model.predict(sample),
    'Support Vector Classifier': svc_model.predict(sample),
    'Logistic Regression': lr_model.predict(sample)
})


def color_code(cell):
    """Set a DataFrame cell's background color according to its value."""
    if cell == 'Existing Customer':
        color = 'aqua'
    else:
        color = 'orangered'
    return f'background-color: {color}'

results.style.applymap(color_code)