# Background

This bank's customers are leaving slowly, but surely. It's been identified that it is cheaper to maintain of existing customers than to attract new ones. The data provided are clients’ past behavior and termination of contracts with the bank. A model will be needed to predict if a customer is on the verge of leaving the bank.

# Preparing Data

In [1]:
import pandas as pd 
import numpy as np
df = pd.read_csv('Churn.csv')
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [2]:
columnsWithNaN = []
for x in df.columns:
    if df[df[x].isnull()].empty == False: 
            columnsWithNaN.append(x)
columnsWithNaN

['Tenure']

We see that the Tenure column has Not A Number values that need to be fixed to continue to modeling. The median is very close to 0 so I believe filling in the null values to the median should be appropriate for the model to assess as an appropriate characteristic.

In [3]:
df['Tenure'] = df['Tenure'].fillna(df['Tenure'].median())

### Prep: One-Hot Encoding and Feature Scaling

In [4]:
print(df['Geography'].value_counts()); print()
print(df['Gender'].value_counts())

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

Male      5457
Female    4543
Name: Gender, dtype: int64


I have identified the categorical features in this data set as Geography and Gender. These columns will need to be converted to dummy variables for the model to understand categorical data.

In [5]:
print("\n----------- Minimum -----------\n")
print(df[['CreditScore', 'Age', 'Tenure', 'Balance','NumOfProducts','EstimatedSalary']].min())
 
print("\n----------- Maximum -----------\n")
print(df[['CreditScore', 'Age', 'Tenure', 'Balance','NumOfProducts','EstimatedSalary']].max())


----------- Minimum -----------

CreditScore        350.00
Age                 18.00
Tenure               0.00
Balance              0.00
NumOfProducts        1.00
EstimatedSalary     11.58
dtype: float64

----------- Maximum -----------

CreditScore           850.00
Age                    92.00
Tenure                 10.00
Balance            250898.09
NumOfProducts           4.00
EstimatedSalary    199992.48
dtype: float64


The range of these features are fairly large. Feature scaling is required here otherwise the features that have lower numerical values will be heavily overshadowed by the larger features in terms of significance in the training model. This will be performed using the sklearn StandardScaler module

In [6]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Define which columns should be encoded vs scaled
columns_to_encode = ['Geography','Gender']
columns_to_scale  = ['CreditScore','Age','Tenure', 'Balance','NumOfProducts','EstimatedSalary']

# Instantiate encoder/scaler
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False)

# Scale and Encode Separate Columns
scaled_columns  = scaler.fit_transform(df[columns_to_scale]) 
encoded_columns = ohe.fit_transform(df[columns_to_encode]) #purposefully left out Surname 

# Concatenate (Column-Bind) Processed Columns Back Together
df2 = np.concatenate([scaled_columns, encoded_columns], axis=1)

# https://stackoverflow.com/questions/43798377/one-hot-encode-categorical-variables-and-scale-continuous-ones-simultaneouely 

I have modified categorial columns Geography and Gender into dummy variable columns. Scaled data can be found in continuous numerical columns CreditScore, Age, Tenure, Balance, Number of Products, and Estimated Salary so there is greater significance balance between them. A new dataframe called "df2" was created that concatenated these now encoded and scaled columns now.

In [7]:
df2 = pd.DataFrame(df2, columns=['CreditScore','Age','Tenure', 'Balance','NumOfProducts','EstimatedSalary',
                                'Geography_France','Geography_Germany','Geography_Spain','Female','Male'])
df2 = df2.drop(['Geography_France', 'Female'], axis=1) #To avoid dummy trap 
df2['HasCrCard'] = pd.Series(df['HasCrCard'])
df2['IsActiveMember'] = pd.Series(df['IsActiveMember'])
df2['Exited'] = pd.Series(df['Exited'])
df2.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,EstimatedSalary,Geography_Germany,Geography_Spain,Male,HasCrCard,IsActiveMember,Exited
0,-0.326221,0.293517,-1.086246,-1.225848,-0.911583,0.021886,0.0,0.0,0.0,1,1,1
1,-0.440036,0.198164,-1.448581,0.11735,-0.911583,0.216534,0.0,1.0,0.0,0,1,0
2,-1.536794,0.293517,1.087768,1.333053,2.527057,0.240687,0.0,0.0,0.0,1,0,1
3,0.501521,0.007457,-1.448581,-1.225848,0.807737,-0.108918,0.0,0.0,0.0,0,0,0
4,2.063884,0.388871,-1.086246,0.785728,-0.911583,-0.365276,0.0,1.0,0.0,1,1,0


The dataframe df2 was modified to remove columns Geography_France and Female to prevent the dummy trap. Binary columns Has Credit Card, Is Active Member, and Exited were then added with no changes to them because it would be inaccurate to scale them. Columns that would act as noise to the model such as surname, rownumber, and customerid were removed from the dataframe to improve accuracy. 

### Prep: How is the balance?

In [8]:
df2['Exited'].value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

In [9]:
2037 / (2037 + 7963)

0.2037

There is a significant class imbalance. The majority is the 0 classification for 'Exited' where the customer has not left; shown around 80% of the dataset. The minority is the 1 classification for 'Exited' where the customer has left; shown as 20% of the dataset. Training the data with a large amount of 0 classifications will cause the model to prefer predicting observations as 0.

# Training

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

target = df2['Exited']
features = df2.drop('Exited', axis=1)

features_train, features_test, target_train, target_test = train_test_split(features, target, 
                                                                            test_size=0.2, random_state=12345)
features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train, 
                                                                       test_size=0.25, random_state=12345) 

In [11]:
for d in range(10,21,2):
    model = RandomForestClassifier(random_state=12345, max_depth = d, n_estimators = 1000)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    auc_roc = roc_auc_score(target_valid, model.predict(features_valid))

    print('depth: ',d,', training accuracy:', model.score(features_train, target_train), 
          ',validation accuracy:', model.score(features_valid, target_valid))
    print('f1:', f1_score(target_valid, predicted_valid), 'auc:', auc_roc)

depth:  10 , training accuracy: 0.9045 ,validation accuracy: 0.864
f1: 0.548172757475083 auc: 0.696702849540389
depth:  12 , training accuracy: 0.9336666666666666 ,validation accuracy: 0.8615
f1: 0.5436573311367381 auc: 0.6951490894409484
depth:  14 , training accuracy: 0.9706666666666667 ,validation accuracy: 0.863
f1: 0.5537459283387622 auc: 0.7009214472937552
depth:  16 , training accuracy: 0.9945 ,validation accuracy: 0.862
f1: 0.551948051948052 auc: 0.7002999432539789
depth:  18 , training accuracy: 0.9995 ,validation accuracy: 0.8605
f1: 0.5521669341894061 auc: 0.7013037279115716
depth:  20 , training accuracy: 1.0 ,validation accuracy: 0.861
f1: 0.5544871794871795 auc: 0.7025825002900882


Through hyperparameter looping, I did identify that max_depth = 18 and n_estimators = 1000 was the optimal configuration for the imbalanced random forest classifier model. These hyperparameters generated an output of f1 = 0.55 and auc_roc = 0.7013. 

# Dealing with imbalance 

In [12]:
from sklearn.utils import shuffle
def balance_resample(features, target, direction, factor):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    
    if direction == 'up':
        #this will upsample the count of ones
        features_resample = pd.concat([features_zeros] + [features_ones] * factor)
        target_resample = pd.concat([target_zeros] + [target_ones] * factor)
        features_resample, target_resample = shuffle(features_resample, target_resample, random_state=12345)
    elif direction == 'down': 
        #this will downsample the count of zeroes 
        features_resample = pd.concat([features_zeros.sample(frac=factor, random_state=12345)] + [features_ones])
        target_resample = pd.concat([target_zeros.sample(frac=factor, random_state=12345)] + [target_ones])
    else: 
        print('no direction')
    return features_resample, target_resample

Here is a function for both upsampling and downsampling. I am aware that is this is automatically assuming that classification 0 is the majority and classification 1 is the minority. 

### Dealing with imbalance: Upsampling

In [13]:
features_upsampled_t, target_upsampled_t = balance_resample(features_train, target_train, 'up', 4)

for depth in range(8,17,2):
    model = RandomForestClassifier(random_state=12345, max_depth = depth, n_estimators = 1000)
    model.fit(features_upsampled_t, target_upsampled_t)
    predicted_valid = model.predict(features_valid)
    auc_roc = roc_auc_score(target_valid, model.predict(features_valid))

    print('       d:',depth, '-------------')
    print('training accuracy:', model.score(features_upsampled_t, target_upsampled_t), 
          ',validation accuracy:', model.score(features_valid, target_valid))
    print('f1:', f1_score(target_valid, predicted_valid), 'Area under curve score:', auc_roc)

       d: 8 -------------
training accuracy: 0.8486072279175727 ,validation accuracy: 0.805
f1: 0.5788336933045356 Area under curve score: 0.7597402081323248
       d: 10 -------------
training accuracy: 0.9046287666977322 ,validation accuracy: 0.824
f1: 0.5999999999999999 Area under curve score: 0.7676767034535597
       d: 12 -------------
training accuracy: 0.9715232473853164 ,validation accuracy: 0.835
f1: 0.5975609756097561 Area under curve score: 0.756120861077157
       d: 14 -------------
training accuracy: 0.9953401677539608 ,validation accuracy: 0.844
f1: 0.5948051948051948 Area under curve score: 0.7462260716970875
       d: 16 -------------
training accuracy: 0.999585792689241 ,validation accuracy: 0.848
f1: 0.5880758807588076 Area under curve score: 0.7370958435526507


In [14]:
target_upsampled_t.value_counts()

1    4876
0    4781
Name: Exited, dtype: int64

When it came to upsampling, the classification of 1 observations were multiplied by 4 times. We see that max_depth = 12 and n_estimators = 1000 achieved the highest f1 and auc_roc scores. These scores are better than the imbalanced model and have a better balance of classes.

### Dealing with imbalance: Downsampling

In [15]:
features_downsampled_t, target_downsampled_t = balance_resample(features_train, target_train, 'down', 0.6)

for depth in range(18,27,2):
        model = RandomForestClassifier(random_state=12345, max_depth = depth, n_estimators = 1000)
        model.fit(features_downsampled_t, target_downsampled_t)
        predicted_valid = model.predict(features_valid)
        auc_roc = roc_auc_score(target_valid, model.predict(features_valid))

        print('      d:',depth,'---------------')
        print('training accuracy:', model.score(features_downsampled_t, target_downsampled_t), 
              '| validation accuracy:', model.score(features_valid, target_valid))
        print('f1:', f1_score(target_valid, predicted_valid), '| Area under curve score:', auc_roc)

      d: 18 ---------------
training accuracy: 1.0 | validation accuracy: 0.851
f1: 0.5872576177285319 | Area under curve score: 0.7341202538788368
      d: 20 ---------------
training accuracy: 1.0 | validation accuracy: 0.8515
f1: 0.5846153846153845 | Area under curve score: 0.7315269448228395
      d: 22 ---------------
training accuracy: 1.0 | validation accuracy: 0.851
f1: 0.5872576177285319 | Area under curve score: 0.7341202538788368
      d: 24 ---------------
training accuracy: 1.0 | validation accuracy: 0.853
f1: 0.5916666666666667 | Area under curve score: 0.7363312823170178
      d: 26 ---------------
training accuracy: 1.0 | validation accuracy: 0.852
f1: 0.5877437325905293 | Area under curve score: 0.7337737375599848


In [16]:
target_downsampled_t.value_counts()

0    2869
1    1219
Name: Exited, dtype: int64

To achieve an f1 score great than 0.59, we require a 60% cut in customers classified as 0 to create an equal balance between the classes. The hyperparameters required to get this score included max_depth = 24 and n_estimators = 1000. While we did achieve the required score, downsampling seems to be a poor move overall because it is removing rows and also there is still significant class imbalance. Observations of 0's heavily outweight 1's, so I will be running the downsampling process again to achieve closer balance. 

In [17]:
features_downsampled_t, target_downsampled_t = balance_resample(features_train, target_train, 'down', 0.27)

for depth in range(18,27,2):
        model = RandomForestClassifier(random_state=12345, max_depth = depth, n_estimators = 1000)
        model.fit(features_downsampled_t, target_downsampled_t)
        predicted_valid = model.predict(features_valid)
        auc_roc = roc_auc_score(target_valid, model.predict(features_valid))

        print('      d:',depth,'---------------')
        print('training accuracy:', model.score(features_downsampled_t, target_downsampled_t), 
              '| validation accuracy:', model.score(features_valid, target_valid))
        print('f1:', f1_score(target_valid, predicted_valid), '| Area under curve score:', auc_roc)

      d: 18 ---------------
training accuracy: 1.0 | validation accuracy: 0.7765
f1: 0.558736426456071 | Area under curve score: 0.7565476483781288
      d: 20 ---------------
training accuracy: 1.0 | validation accuracy: 0.774
f1: 0.557729941291585 | Area under curve score: 0.7569299289959451
      d: 22 ---------------
training accuracy: 1.0 | validation accuracy: 0.773
f1: 0.5557729941291585 | Area under curve score: 0.7553404045975404
      d: 24 ---------------
training accuracy: 1.0 | validation accuracy: 0.772
f1: 0.5546875 | Area under curve score: 0.7547189005577641
      d: 26 ---------------
training accuracy: 1.0 | validation accuracy: 0.7725
f1: 0.555229716520039 | Area under curve score: 0.7550296525776523


In [18]:
target_downsampled_t.value_counts()

0    1291
1    1219
Name: Exited, dtype: int64

While there is a greater balance between observations of 0's and 1's, the f1 and Area Under the Curve score have taken a hit. This is most likely due to the cost of deleting data to have a less biased training model to reach a balance between the classes. Overall, both downsampling attempts have proven to be poor.

### Dealing with imbalance: Logistic Regression Parameter Balance

Another class balancing approach I wanted to test was in logistic regression model. Below is the before and after the class weight balance seen in the parameter change.

#### Before class weight balance

In [19]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
auc_roc = roc_auc_score(target_valid, model.predict(features_valid))

print('training accuracy:', model.score(features_train, target_train), 
      ',validation accuracy:', model.score(features_valid, target_valid))
print('f1:', f1_score(target_valid, predicted_valid), 'Area under curve score:', auc_roc)

training accuracy: 0.813 ,validation accuracy: 0.8145
f1: 0.30131826741996237 Area under curve score: 0.5836566690880421


#### After class weight balance

In [20]:
model = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
auc_roc = roc_auc_score(target_valid, model.predict(features_valid))

print('training accuracy:', model.score(features_train, target_train), 
      ',validation accuracy:', model.score(features_valid, target_valid))
print('f1:', f1_score(target_valid, predicted_valid), 'Area under curve score:', auc_roc)

training accuracy: 0.7115 ,validation accuracy: 0.705
f1: 0.4741532976827095 Area under curve score: 0.6956537634374419


The original logistic regression performed very poorly with an f1 score of 0.3 and auc_roc score of 0.58. 

With the introduction of class weight balance, we achieve an improved f1 score of 0.47 and auc_roc score of 0.69. Unfortunately, this is still well below the other class balance approaches we have seen.

# The Test 

Out of all the balance approaches, **upsampling** proved to perform the best with have the highest f1 and area under curve ROC score. The upsampling method with hyperparameters of max_depth = 12 and n_estimators = 1000 will be used with the test data set.

In [21]:
features_upsampled_t, target_upsampled_t = balance_resample(features_train, target_train, 'up', 4)

model = RandomForestClassifier(random_state=12345, max_depth = 12, n_estimators = 1000)
model.fit(features_upsampled_t, target_upsampled_t)

predicted_test = model.predict(features_test)
auc_roc = roc_auc_score(target_test, model.predict(features_test))

print('      [Test] accuracy of', model.score(features_test, target_test))
print('f1:', f1_score(target_test, predicted_test), ', Area under curve score:', auc_roc)

      [Test] accuracy of 0.8405
f1: 0.6345933562428407 , Area under curve score: 0.7706369636324927


With the parameters from upsampling, the test model was able to provide an output of f1 = 0.634 and auc_roc = 0.77. This is fairly better than the imbalanced model. The model is now more accurate at testing when a customer will be leaving the bank soon.