# Predicting the Upcoming Bank Client Churn

A considerable number of clients started leaving the bank that provided the data and it's best to retain them because as it turns out it's more cost-efficient than running a new advertising campaign to grow the client base anew.

## Overview

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

data = pd.read_csv(/Churn.csv")
print(data.shape)

(10000, 14)


In [2]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


So there are 10000 rows and 14 columns. 3 columns are odd for the future model (RowNumber, CustomerId, Surname), one column (Tenur) has empty values which should be filled, there's no need to change the types of data in the columns.

In [4]:
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1)

Getting rid of columns that are irrelevant for machine learning, they have appeared due to data extraction operations from the database, where data come from, and may be needed to identify the person.

In [5]:
data

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1
9998,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1


There are 11 columns and 10000 rows now.

## Data Preparation

In [6]:
data = pd.get_dummies(data, drop_first=True) 
#categorical data transformation with One Hot Encoding, avoiding dummy variable trap

In [7]:
print(np.isnan(data).sum().sort_values(ascending=False))

Tenure               909
Gender_Male            0
Geography_Spain        0
Geography_Germany      0
Exited                 0
EstimatedSalary        0
IsActiveMember         0
HasCrCard              0
NumOfProducts          0
Balance                0
Age                    0
CreditScore            0
dtype: int64


There are 909 NaN values in *Tenure*

In [8]:
data['Tenure'].value_counts(dropna=False).sort_index(ascending = False)

10.0    446
9.0     882
8.0     933
7.0     925
6.0     881
5.0     927
4.0     885
3.0     928
2.0     950
1.0     952
0.0     382
NaN     909
Name: Tenure, dtype: int64

Tenure is a number of properties. So I would assume that the person has no property in case it's not mentioned.

In [9]:
data['Tenure'] = data['Tenure'].fillna(0) #filling NA with zeros

In [10]:
print(np.isnan(data).sum())

CreditScore          0
Age                  0
Tenure               0
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
Geography_Germany    0
Geography_Spain      0
Gender_Male          0
dtype: int64


There are no empty value cells in the dataframe.

## Exploring the Task

In [11]:
target = data['Exited']
features = data.drop('Exited', axis=1)
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.4, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid, target_valid, test_size=0.5, random_state=12345)

<p>Defining the features. 'Exited' is a target variable.</p>
<p>I split the number of rows in train, validation and test sets. 60%, 20%, and 20% of the whole dataset accordingly.</p>

In [12]:
print("Set Sizes")
print("Training Set:    ", features_train.shape, target_train.shape)
print("Validation Set:  ", features_valid.shape, target_valid.shape)
print("Test Set:        ", features_test.shape, target_test.shape)

Set Sizes
Training Set:     (6000, 11) (6000,)
Validation Set:   (2000, 11) (2000,)
Test Set:         (2000, 11) (2000,)


In [13]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=12345)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)

accuracy = accuracy_score(target_valid, predicted_valid)
F1 = f1_score(target_valid, predicted_valid)

probabilities = model.predict_proba(features_valid)
probabilities_one = probabilities[:, 1]
ROCAUC = roc_auc_score(target_valid, probabilities_one)

print("Decision Tree Classifier:")
print("Accuracy =", '{:.4f}'.format(round(accuracy,4)))
print("F1 Score =", '{:.4f}'.format(round(F1,4)))
print("ROC-AUC  =", '{:.4f}'.format(round(ROCAUC,4)))

Decision Tree Classifier:
Accuracy = 0.7870
F1 Score = 0.4818
ROC-AUC  = 0.6717


In [14]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)

accuracy = accuracy_score(target_valid, predicted_valid)
F1 = f1_score(target_valid, predicted_valid)

probabilities = model.predict_proba(features_valid)
probabilities_one = probabilities[:, 1]
ROCAUC = roc_auc_score(target_valid, probabilities_one)

print("Unbalanced Logistic Regression:")
print("Accuracy =", '{:.4f}'.format(round(accuracy,4)))
print("F1 Score =", '{:.4f}'.format(round(F1,4)))
print("ROC-AUC  =", '{:.4f}'.format(round(ROCAUC,4)))

Unbalanced Logistic Regression:
Accuracy = 0.7815
F1 Score = 0.0839
ROC-AUC  = 0.6728


In [15]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=12345, class_weight = 'balanced', solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)

accuracy = accuracy_score(target_valid, predicted_valid)
F1 = f1_score(target_valid, predicted_valid)

probabilities = model.predict_proba(features_valid)
probabilities_one = probabilities[:, 1]
ROCAUC = roc_auc_score(target_valid, probabilities_one)

print("Balanced Logistic Regression:")
print("Accuracy =", '{:.4f}'.format(round(accuracy,4)))
print("F1 Score =", '{:.4f}'.format(round(F1,4)))
print("ROC-AUC  =", '{:.4f}'.format(round(ROCAUC,4)))

Balanced Logistic Regression:
Accuracy = 0.7010
F1 Score = 0.4975
ROC-AUC  = 0.7565


The higher *ROC-AUC* is than 0.5, the less random are the predictions, which means that balanced logistic regression's model predictions are less random than imbalanced logistic regression and decision tree models. *F1* as a harmonic mean of recall and precision is also higher, which means that the prediction quality of balanced logistic regression is better too, despite that *Accuracy score* turned out to be lower.
<p>I can now conclude the class imbalance. Let's see the random forest.</p>

In [16]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=12345, n_estimators = 10)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)

accuracy = accuracy_score(target_valid, predicted_valid)
F1 = f1_score(target_valid, predicted_valid)

probabilities = model.predict_proba(features_valid)
probabilities_one = probabilities[:, 1]
ROCAUC = roc_auc_score(target_valid, probabilities_one)

print("Random Forest Classifier:")
print("Accuracy =", '{:.4f}'.format(round(accuracy,4)))
print("F1 Score =", '{:.4f}'.format(round(F1,4)))
print("ROC-AUC  =", '{:.4f}'.format(round(ROCAUC,4)))

Random Forest Classifier:
Accuracy = 0.8530
F1 Score = 0.5559
ROC-AUC  = 0.8131


Random Forest is showing the best result in all parameters with the standard number of estimators (10).

Although the model is imbalanced, one can see the superiority of Random Forest Classifier, it means, I need to use this model for further research.

## Fighting Imbalanced Classes

In [17]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
 
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    return features_upsampled, target_upsampled

In [18]:
for repeat in range(1,11,1):
    features_upsampled, target_upsampled = upsample(features_train, target_train, repeat)
    model = RandomForestClassifier(random_state=12345, n_estimators = 10)
    model.fit(features_upsampled, target_upsampled)
    predicted_valid = model.predict(features_valid) 
    F1 = f1_score(target_valid, predicted_valid)
    print("Repeat =", str(repeat).zfill(2), "F1:", '{:.4f}'.format(round(F1, 4)))

Repeat = 01 F1: 0.5542
Repeat = 02 F1: 0.5839
Repeat = 03 F1: 0.5787
Repeat = 04 F1: 0.5848
Repeat = 05 F1: 0.5753
Repeat = 06 F1: 0.5661
Repeat = 07 F1: 0.5729
Repeat = 08 F1: 0.5676
Repeat = 09 F1: 0.5722
Repeat = 10 F1: 0.5508


With upsampling of a smaller class the dataset is balanced better with 4 time repeat (the highest F1 score).

In [19]:
features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

In [20]:
for estim in range(10, 51, 5):
    for depth in range(6,12,1):
        model = RandomForestClassifier(n_estimators=estim, max_depth=depth, random_state=12345)
        model.fit(features_upsampled, target_upsampled)
        predicted_valid = model.predict(features_valid)
        
        accuracy = accuracy_score(target_valid, predicted_valid)
        F1 = f1_score(target_valid, predicted_valid)
        probabilities = model.predict_proba(features_valid)
        probabilities_one = probabilities[:, 1]
        ROCAUC = roc_auc_score(target_valid, probabilities_one)
        
        print("n_estimators =", str(estim).zfill(2),
              "max_depth =", str(depth).zfill(2),
              "Accuracy =", '{:.4f}'.format(round(accuracy, 4)),
              "ROC-AUC =",'{:.4f}'.format(round(ROCAUC, 4)),
              "F1 Score =", '{:.4f}'.format(round(F1, 4)))

n_estimators = 10 max_depth = 06 Accuracy = 0.8105 ROC-AUC = 0.8511 F1 Score = 0.6152
n_estimators = 10 max_depth = 07 Accuracy = 0.8035 ROC-AUC = 0.8496 F1 Score = 0.5977
n_estimators = 10 max_depth = 08 Accuracy = 0.8015 ROC-AUC = 0.8446 F1 Score = 0.5928
n_estimators = 10 max_depth = 09 Accuracy = 0.8035 ROC-AUC = 0.8392 F1 Score = 0.5876
n_estimators = 10 max_depth = 10 Accuracy = 0.8175 ROC-AUC = 0.8384 F1 Score = 0.5967
n_estimators = 10 max_depth = 11 Accuracy = 0.8280 ROC-AUC = 0.8355 F1 Score = 0.6143
n_estimators = 15 max_depth = 06 Accuracy = 0.8080 ROC-AUC = 0.8506 F1 Score = 0.6113
n_estimators = 15 max_depth = 07 Accuracy = 0.8005 ROC-AUC = 0.8491 F1 Score = 0.5933
n_estimators = 15 max_depth = 08 Accuracy = 0.8135 ROC-AUC = 0.8490 F1 Score = 0.6221
n_estimators = 15 max_depth = 09 Accuracy = 0.8085 ROC-AUC = 0.8429 F1 Score = 0.5947
n_estimators = 15 max_depth = 10 Accuracy = 0.8180 ROC-AUC = 0.8391 F1 Score = 0.6043
n_estimators = 15 max_depth = 11 Accuracy = 0.8300 ROC

After 0.8265 and 0.826 the model has shown the best third result in *Accuracy score* which is equal to 0.8245 when number of estimators is 50 and max depth is 8, but with these hyperparameters the quality of the model turned to be the highest in *ROC-AUC* and *F1 Score* metrics.

In [21]:
from sklearn.utils import shuffle

def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    

    return features_downsampled, target_downsampled

In [22]:
for repeat in np.arange(0.9, 1.01, 0.01):
    features_downsampled, target_downsampled = downsample(features_upsampled, target_upsampled, repeat)
    model = RandomForestClassifier(random_state=12345, n_estimators=50, max_depth = 8)
    model.fit(features_downsampled, target_downsampled)
    predicted_valid = model.predict(features_valid)
    
    accuracy = accuracy_score(target_valid, predicted_valid)
    probabilities = model.predict_proba(features_valid)
    probabilities_one = probabilities[:, 1]
    ROCAUC = roc_auc_score(target_valid, probabilities_one)
    F1 = f1_score(target_valid, predicted_valid)
    print("Repeat =", '{:.2f}'.format(round(repeat, 2)),
          "Accuracy =", '{:.4f}'.format(round(accuracy,4)),
          "ROC-AUC =", '{:.4f}'.format(round(ROCAUC, 4)), 
          "F1 =", '{:.4f}'.format(round(F1, 4)))

Repeat = 0.90 Accuracy = 0.8035 ROC-AUC = 0.8524 F1 = 0.6090
Repeat = 0.91 Accuracy = 0.8005 ROC-AUC = 0.8491 F1 = 0.6038
Repeat = 0.92 Accuracy = 0.8065 ROC-AUC = 0.8552 F1 = 0.6126
Repeat = 0.93 Accuracy = 0.8145 ROC-AUC = 0.8550 F1 = 0.6203
Repeat = 0.94 Accuracy = 0.8065 ROC-AUC = 0.8554 F1 = 0.6111
Repeat = 0.95 Accuracy = 0.8105 ROC-AUC = 0.8559 F1 = 0.6183
Repeat = 0.96 Accuracy = 0.8115 ROC-AUC = 0.8511 F1 = 0.6133
Repeat = 0.97 Accuracy = 0.8135 ROC-AUC = 0.8546 F1 = 0.6198
Repeat = 0.98 Accuracy = 0.8105 ROC-AUC = 0.8522 F1 = 0.6129
Repeat = 0.99 Accuracy = 0.8135 ROC-AUC = 0.8519 F1 = 0.6182
Repeat = 1.00 Accuracy = 0.8105 ROC-AUC = 0.8545 F1 = 0.6137


Accuracy and F1 are showing the best results with repeat = 0.93, ROC-AUC is comparably great too, only slightly lower than the other iterations and still higher than most of them.

I need the following hyperparameters in order to improve the quality of the model:

- n_estimators = 50 
- max_depth = 8

and for the right class balance I need to upsample the small class in 4 times and downsample the big one in 0.93.

## Test

In [23]:
features = pd.concat([features_train, features_valid])
target = pd.concat([target_train, target_valid]) #concatenating training and validation sets

features_upsampled, target_upsampled = upsample(features, target, 4)
features_downsampled, target_downsampled = downsample(features_upsampled, target_upsampled, 0.93) #upsampling + downsampling

Joining the train and validation data sets by considering the class imbalance before building the prediction model.

In [24]:
model = RandomForestClassifier(random_state=12345, n_estimators=50, max_depth = 8)
model.fit(features_downsampled, target_downsampled)
predicted_test = model.predict(features_test)

probabilities = model.predict_proba(features)
probabilities_one = probabilities[:, 1]
accuracy = accuracy_score(target_test, predicted_test)
ROCAUC = roc_auc_score(target, probabilities_one)
F1 = f1_score(target_test, predicted_test)

print("Accuracy =", '{:.4f}'.format(round(accuracy, 4)),
      "ROC-AUC =",  '{:.4f}'.format(round(ROCAUC, 4)), 
      "F1 =", '{:.4f}'.format(round(F1, 4)))

Accuracy = 0.8040 ROC-AUC = 0.9168 F1 = 0.6164


The quality of metrics gained during the test is similar to the validation. The model works.

## Conclusion

According to the test, the model could almost reach the same metric results with Accuracy (0.804) and F1 Scores (0.6164) as during validation (0.8105 and 0.6202, accordingly), ROC-AUC score has grown due to the larger dataset size, the model's predictions are generated less randomly than before.