# Bank Customer Churn Prediction

A bank is losing customers. They know that customer retention is cheaper than CAC(customer acquisition cost). Hence, they are trying to predict when a customer will churn based on historic data.

In [618]:
import pandas as pd
import numpy as np

Note: We will utilise the following modules later on:
1. `scikit-learn`

In [619]:
try:
    bank = pd.read_csv('/datasets/Churn.csv')
    print('Loaded!')
except:
    print('Could not load data set, please check path.')

Loaded!


## Preparing the Data

### Understanding the Data

Let's explore the data a little and find any inconsistencies

In [620]:
bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


### Cleaning and Filling in Missing Values

In [621]:
bank.columns = bank.columns.str.lower()

In [622]:
bank.head()

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


We only see missing values for tenure. The documentation says that tenure is the: *period of maturation for a customer’s fixed deposit (years)*. Let's take the safest bet and assume that the tenure is 0 and customers do not have fixed deposits where the values are NaN. 

In [623]:
bank[bank['tenure'].isna()]

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.00,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.00,1,0,0,84509.57,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9944,9945,15703923,Cameron,744,Germany,Male,41,,190409.34,2,1,1,138361.48,0
9956,9957,15707861,Nucci,520,France,Female,46,,85216.61,1,1,0,117369.52,1
9964,9965,15642785,Douglas,479,France,Male,34,,117593.48,2,0,0,113308.29,0
9985,9986,15586914,Nepean,659,France,Male,36,,123841.49,2,1,0,96833.00,0


In [624]:
bank['tenure'] = bank['tenure'].fillna(0)

We can begin to prepare the data. Keeping in mind that this is a classifaction problem, we are likely to use Logistic Regression or Random Forest Regressor to train our model. 

We can get rid of some unnecessary columns like `customerid`, `rownumber`, `surname` as these are not indicators that affect the churn likelihood and we do not want the model to give any importance to them.

In [625]:
bank = bank.drop(['rownumber','customerid','surname'], axis = 1)

In [626]:
bank.head()

Unnamed: 0,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


Let's carry out some One Hot Encoding to encode columns such as `geography` and `gender`. 

In [627]:
bank_ohe = pd.get_dummies(bank, drop_first = True)

In [628]:
bank_ohe.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


Now, let's split the data into 3 sets: Train, Validate and Test(3:1:1 ratio). We'll carry out feature scaling further along the way.  

In [629]:
features = bank_ohe.drop(['exited'],axis = 1)
features.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,1,0


In [630]:
target = bank_ohe['exited']
target.head()

0    1
1    0
2    1
3    0
4    0
Name: exited, dtype: int64

### Splitting the Dataset

In [631]:
from sklearn.model_selection import train_test_split

In [632]:
features_train,features_valid_test,target_train,target_valid_test = train_test_split(features, target, 
                                                                                     test_size = 0.4)

In [633]:
features_valid_test.shape

(4000, 11)

In [634]:
features_train.shape

(6000, 11)

In [635]:
features_valid, features_test, target_valid, target_test = train_test_split(features_valid_test,target_valid_test,
                                                                           test_size = 0.5)

In [636]:
features_valid.shape

(2000, 11)

In [637]:
target_test.shape

(2000,)

### Feature Scaling

Let's carry out feature scaling to avoid weightages to features on the basis of their scale for Logistic Regression Models

In [638]:
numeric = ['creditscore','age','tenure','balance','estimatedsalary']

In [639]:
from sklearn.preprocessing import StandardScaler

In [640]:
scaler = StandardScaler()
scaler.fit(features_train[numeric])

StandardScaler(copy=True, with_mean=True, with_std=True)

In [641]:
#Avoid SettingWithCopyWarning
pd.options.mode.chained_assignment = None

In [642]:
features_train[numeric] = scaler.transform(features_train[numeric])

In [643]:
features_valid[numeric] = scaler.transform(features_valid[numeric])

In [644]:
features_test[numeric] = scaler.transform(features_test[numeric])

Now that the features data is prepared, let's check the target data for imbalances. 

In [645]:
target_train.value_counts(normalize = True)

0    0.7975
1    0.2025
Name: exited, dtype: float64

Only ~20% of the data is positive while the rest is negative. There is some imbalance in the data which will influence the final results. We will attempt to generate a model without correcting the imbalance. If the results are not satisfactory(i.e. worse than the sanity check), we will then correct the imbalance and try again.

## Building and Validating Models

### Logistic Regression

In [646]:
from sklearn.linear_model import LogisticRegression

In [647]:
logistic_model = LogisticRegression(random_state = 12345, solver = 'liblinear', max_iter = 500)

In [648]:
logistic_model.fit(features_train, target_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=12345, solver='liblinear', tol=0.0001,
                   verbose=0, warm_start=False)

In [649]:
predicted_valid = logistic_model.predict(features_valid)

In [650]:
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

In [651]:
score = f1_score(target_valid, predicted_valid)
print(f'F1 Score for Imbalanced Model, Logistic Regression: {score:.3f}')

F1 Score for Imbalanced Model, Logistic Regression: 0.331


In [652]:
probabilities_valid = logistic_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print(f'AUC-ROC Score for Imbalanced Model, Logistic Regression: {auc_roc:.3f}')

AUC-ROC Score for Imbalanced Model, Logistic Regression: 0.754


The AUC-ROC score is not perfect but it's higher than 0.5.

### Random Forest Classifier

Let's try the same with a RandomForestClassifier. 

In [653]:
from sklearn.ensemble import RandomForestClassifier

In [654]:
forest_model = RandomForestClassifier(random_state = 12345, n_estimators = 300, max_depth = 15)

In [655]:
forest_model.fit(features_train, target_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=15, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

In [656]:
predicted_valid = forest_model.predict(features_valid)

In [657]:
score = f1_score(target_valid, predicted_valid)
print(f'F1 Score for Imbalanced Model, Random Forest Classifier: {score:.3f}')

F1 Score for Imbalanced Model, Random Forest Classifier: 0.597


In [658]:
probabilities_valid = forest_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print(f'AUC-ROC Score for Imbalanced Model, Random Forest Classifier: {auc_roc:.3f}')

AUC-ROC Score for Imbalanced Model, Random Forest Classifier: 0.859


This is much better but still lower than our requisite F1-Score of 0.59. We could make the model deeper but lets try balancing the data to attempt to get better results

## Balancing Data

We will experiment with 2 methods of balancing: Upsampling and Downsampling and compare the results. 

### Upsampling

The ratio of negative to positive samples was approximately 4 : 1. We will upsample positive samples by a factor of 4 to balance the data.

In [659]:
#We will of course need to shuffle the new data
from sklearn.utils import shuffle

In [660]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345
    )

    return features_upsampled, target_upsampled

In [661]:
features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

In [662]:
features_upsampled.shape

(9645, 11)

In [663]:
target_upsample.value_counts()

0    4815
1    4740
Name: exited, dtype: int64

In [664]:
logistic_model = LogisticRegression(random_state = 12345, solver = 'liblinear', max_iter = 800) 

In [665]:
logistic_model.fit(features_upsampled, target_upsampled)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=800,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=12345, solver='liblinear', tol=0.0001,
                   verbose=0, warm_start=False)

In [666]:
predicted_valid = logistic_model.predict(features_valid)

In [667]:
score = f1_score(target_valid, predicted_valid)
print(f'F1 Score for Upsampled Model, Logistic Regression: {score:.3f}')

F1 Score for Upsampled Model, Logistic Regression: 0.479


Much better! Let's see what a forest can do :-D

In [668]:
model_forest = RandomForestClassifier(random_state = 12345, n_estimators = 300, max_depth = 15)

In [669]:
model_forest.fit(features_upsampled, target_upsampled)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=15, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

In [670]:
predicted_valid = model_forest.predict(features_valid)

In [671]:
score = f1_score(target_valid, predicted_valid)
print(f'F1 Score for Upsampled Model, Random Forest: {score:.3f}')

F1 Score for Upsampled Model, Random Forest: 0.624


Excellent! The model crosses the expected minimum threshold for the F-1 Score. But life is all aboout improvement. Can we do better?

In [672]:
probabilities_valid = logistic_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:,1]

In [673]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print(f'AUC-ROC Score for Upsampled Model, Random Forest Classifier: {auc_roc:.3f}')

AUC-ROC Score for Upsampled Model, Random Forest Classifier: 0.758


### Downsampling

Let's try to downsample this time

In [674]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)]
        + [features_ones]
    )
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)]
        + [target_ones]
    )

    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345
    )

    return features_downsampled, target_downsampled


In [675]:
features_downsampled, target_downsampled = downsample(features,target,0.25)

In [676]:
target_downsampled.value_counts()

1    2037
0    1991
Name: exited, dtype: int64

In [677]:
logistic_model.fit(features_downsampled, target_downsampled)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=800,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=12345, solver='liblinear', tol=0.0001,
                   verbose=0, warm_start=False)

In [678]:
predicted_valid = logistic_model.predict(features_valid)

In [679]:
score = f1_score(target_valid, predicted_valid)
print(f'F1 Score for Downsampled Model, Logistic Regression: {score:.3f}')

F1 Score for Downsampled Model, Logistic Regression: 0.448


Worse!

In [691]:
model_forest_downsampled = RandomForestClassifier(random_state = 12345, n_estimators = 350, max_depth = 15)

In [693]:
model_forest_downsampled.fit(features_downsampled, target_downsampled)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=15, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

In [694]:
predicted_valid = model_forest_downsampled.predict(features_valid)

In [695]:
score = f1_score(target_valid, predicted_valid)
print(f'F1 Score for Downsampled Model, Random Forest: {score:.3f}')

F1 Score for Downsampled Model, Random Forest: 0.252


In [696]:
probabilities_valid = logistic_model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:,1]

In [685]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print(f'AUC-ROC Score for Upsampled Model, Random Forest Classifier: {auc_roc:.3f}')

AUC-ROC Score for Upsampled Model, Random Forest Classifier: 0.738


We get a terrible F-1 Score of 0.292. This is likely due to the lack of data, as we only have about 4,000 data points to train the dataset on. Let's test the best fit model on the test set and see if we can get the same F-1 score.

In [697]:
predicted_test = model_forest.predict(features_test) 

In [698]:
score = f1_score(target_test, predicted_test)
print(f'F1 Score for Upsampled Model, Random Forest, Test Data: {score:.3f}')

F1 Score for Upsampled Model, Random Forest, Test Data: 0.596


In [699]:
probabilities_test = model_forest.predict_proba(features_test)
probabilities_one_test = probabilities_test[:,1]

In [700]:
auc_roc = roc_auc_score(target_test, probabilities_one_test)
print(f'AUC-ROC Score for Upsampled Model, Random Forest Classifier: {auc_roc:.3f}')

AUC-ROC Score for Upsampled Model, Random Forest Classifier: 0.846


Fantastic! The model has a F1-score of 0.596 on the test set and AUC-ROC of 0.846.

**Conclusion:**
The best model is the Random Forest Classifier that utilises a dataset balanced with upsampling of the rare target value.

## Conclusions

### The data: 

1. We had historic data on customer churn for customers from a bank which included several fields such as `age`, `balance`, `tenure of fixed deposit`, `esitmated salary` etc.


2. We filled in missing values for the `tenure` field and dropped certain fields such as `surname`, `row_number` and `customer_id` that do not directly impact churn risk


3. We scaled features to avoid undue influence of scale Logistic Regression


4. We attempted to build Logistic Regression and Random Forest Classifier without balancing the data. The F1 score was below the required minimum of 0.59


5. We balanced the dataset using two methods: Upsampling and Downsampling


6. Upsampling provided the best results with a F1 score *0.596* on the test dataset which clears out threshold.