# Bank Churners

Intro: In this notebook I'll use some ordinary ML algos in order to predict predict bank churners

Imports:

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE

Let us read the data.

Note: I'll delete the last two columns as the dataset description states.

In [None]:
df = pd.read_csv ("/kaggle/input/credit-card-customers/BankChurners.csv")
columns = df.columns.tolist()
cols_to_use = columns[:len(columns) - 2]  # drop the last two columns
df = df[cols_to_use]
print("The data shape is : {} ".format(df.shape))

## Data Preprocessing

In [None]:
df.info()

Let's take a look over the categorical columns:

In [None]:
cat_columns = df.select_dtypes(include = ['object']).nunique(dropna=False)
print(cat_columns)

In [None]:
for col in cat_columns.index:
    print("Feature: ", col)
    print("   Vals: ", df[col].unique())

Let's handle categorical cols that can be interpreted as ordinal features:

In [None]:
# Education level
# Change 'College'=14 'Doctorate'=21 'Graduate'=16 'High School'=12 'Post-Graduate'=18 'Uneducated'=8 'Unknown'= Mode
df.loc[df['Education_Level'] == 'College',       'Education_Level'] = 14
df.loc[df['Education_Level'] == 'Doctorate',     'Education_Level'] = 21
df.loc[df['Education_Level'] == 'Graduate',      'Education_Level'] = 16
df.loc[df['Education_Level'] == 'High School',   'Education_Level'] = 12
df.loc[df['Education_Level'] == 'Post-Graduate', 'Education_Level'] = 18
df.loc[df['Education_Level'] == 'Uneducated',    'Education_Level'] = 8
df.loc[df['Education_Level'] == 'Unknown',       'Education_Level'] = df['Education_Level'].mode()

In [None]:
# Income 
df.loc[df['Income_Category'] == 'Less than $40K', 'Income_Category'] = 30
df.loc[df['Income_Category'] == '$40K - $60K',    'Income_Category'] = 50
df.loc[df['Income_Category'] == '$60K - $80K',    'Income_Category'] = 70
df.loc[df['Income_Category'] == '$80K - $120K',   'Income_Category'] = 100
df.loc[df['Income_Category'] == '$120K +',        'Income_Category'] = 200
df.loc[df['Income_Category'] == 'Unknown',        'Income_Category'] = 0
df.loc[df['Income_Category'] == 0, 'Income_Category'] = df['Income_Category'].mode()[0]

In [None]:
# Card Category
df.loc[df['Card_Category'] == 'Blue', 'Card_Category'] = 1
df.loc[df['Card_Category'] == 'Silver', 'Card_Category'] = 2
df.loc[df['Card_Category'] == 'Gold', 'Card_Category'] = 3
df.loc[df['Card_Category'] == 'Platinum', 'Card_Category'] = 4

Let's divide the data to features and label and split it to train-test accordingly:

In [None]:
Y = df['Attrition_Flag'].to_numpy()
Y[Y=='Existing Customer'] = 1
Y[Y=='Attrited Customer'] = 2
Y = Y.astype('int')

X = pd.get_dummies(df.drop(['CLIENTNUM', 'Attrition_Flag'], axis=1))

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

Let's check target balance:

In [None]:
print(np.unique(Y, return_counts=True))

The label is imbalanced. Let's use SMOTE algo in order to balance the data:

In [None]:
sm = SMOTE(random_state=42)
x_res, y_res = sm.fit_resample(x_train, y_train)

Check that the data is now more balanced:

In [None]:
print(np.unique(y_res, return_counts=True))

## Modeling
I'll use serveral common classification ML algos. I'll use the Recall score as my main performance measurement since it's much more meaningful for this specific task of churn prediction.

In [None]:
def ClassPrediction(classifier, mdl):
  model = classifier.fit(x_res, y_res)
  y_hat = model.predict(x_train)
  acc = recall_score(y_train, y_hat, pos_label=2)
  results.loc[mdl, 'Train'] = acc
  y_hat = model.predict(x_test)
  acc = recall_score(y_test, y_hat, pos_label=2)
  results.loc[mdl, 'Test'] = acc

# Storing Results
results = pd.DataFrame()

# Models
stage = 'Classification Model'
ClassPrediction(DecisionTreeClassifier(), 'Decision Tree')
ClassPrediction(RandomForestClassifier(), 'Random Forest')
ClassPrediction(GradientBoostingClassifier(), 'Gradient Boost')
ClassPrediction(AdaBoostClassifier(), 'AdaBoost')
print(results)

Conclusion: Gradient Boost had achieved the best Recall Score of ~88.9% with respect to the test set with no visible overfitting.