<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-preparation" data-toc-modified-id="Data-preparation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data preparation</a></span></li><li><span><a href="#Problem-exploration" data-toc-modified-id="Problem-exploration-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Problem exploration</a></span></li><li><span><a href="#Fighting-imbalance" data-toc-modified-id="Fighting-imbalance-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Fighting imbalance</a></span></li><li><span><a href="#Model-testing" data-toc-modified-id="Model-testing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Model testing</a></span></li></ul></div>

# Customer churn

Clients began to leave Beta Bank. Every month. A little, but noticeable. Bank marketers have calculated that it is cheaper to retain current customers than to attract new ones.

It is necessary to predict whether the client will leave the bank in the near future or not. You are provided with historical data on customer behavior and termination of contracts with the bank.

Build a model with a large value of the *F1*-measure. To pass the project successfully, you need to bring the metric to 0.59. Check the *F1*-measure on the test sample yourself.

Additionally, measure *AUC-ROC*, compare its value with the *F1*-measure.

Data source: [https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling)

## Data preparation

In [74]:
import pandas as pd

data = pd.read_csv('/datasets/Churn.csv')

In [75]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [76]:
df = data.drop(columns=['RowNumber', 'CustomerId', 'Surname'])

In [77]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           9091 non-null   float64
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


In [78]:
df.isna().sum()

CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

In [79]:
df['Tenure'].fillna(df['Tenure'].median(), inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  float64
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


In [80]:
df.columns = df.columns.str.lower()
df.rename(columns={'numofproducts': 'num_of_products',
                   'hascrcard': 'has_credit_card',
                   'isactivemember': 'is_active_member',
                   'estimatedsalary': 'estimated_salary'}, inplace=True)
df.columns

Index(['creditscore', 'geography', 'gender', 'age', 'tenure', 'balance',
       'num_of_products', 'has_credit_card', 'is_active_member',
       'estimated_salary', 'exited'],
      dtype='object')

In [81]:
df.duplicated().sum()

0

In [82]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
creditscore,10000.0,650.5288,96.653299,350.0,584.0,652.0,718.0,850.0
age,10000.0,38.9218,10.487806,18.0,32.0,37.0,44.0,92.0
tenure,10000.0,4.9979,2.76001,0.0,3.0,5.0,7.0,10.0
balance,10000.0,76485.889288,62397.405202,0.0,0.0,97198.54,127644.24,250898.09
num_of_products,10000.0,1.5302,0.581654,1.0,1.0,1.0,2.0,4.0
has_credit_card,10000.0,0.7055,0.45584,0.0,0.0,1.0,1.0,1.0
is_active_member,10000.0,0.5151,0.499797,0.0,0.0,1.0,1.0,1.0
estimated_salary,10000.0,100090.239881,57510.492818,11.58,51002.11,100193.915,149388.2475,199992.48
exited,10000.0,0.2037,0.402769,0.0,0.0,0.0,0.0,1.0


In [83]:
df['gender'].unique()

array(['Female', 'Male'], dtype=object)

In [84]:
df['geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [85]:
df.shape

(10000, 11)

The data was checked for missing values, data types, duplicates, and unnecessary columns. Data with user ids and last names should not affect the classification model, so they were removed. The index column is already there, which means "RowNumber" is not needed. Gaps in the column with the duration of cooperation with the bank (in years) "Tenure" have been replaced with 0. It should also be noted that the balance and estimated_salary columns have a large range and can greatly affect the model, so in the future we will use scaling.

## Problem exploration

In [86]:
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [87]:
df['exited'].value_counts(normalize=True)

0    0.7963
1    0.2037
Name: exited, dtype: float64

Classes are unbalanced, 20% of clients have left.

In [88]:
df_ohe = pd.get_dummies(df, drop_first=True)
df_ohe.head()

Unnamed: 0,creditscore,age,tenure,balance,num_of_products,has_credit_card,is_active_member,estimated_salary,exited,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


In [89]:
from sklearn.preprocessing import StandardScaler

features = df_ohe.drop(columns=['exited'])
target = df_ohe['exited']

features_train, features_valid, target_train, target_valid = train_test_split(
features, target, test_size=0.4, random_state=12345)

features_valid, features_test, target_valid, target_test = train_test_split(
features_valid, target_valid, test_size=0.5, random_state=12345)

scaler = StandardScaler()
scaler.fit(features_train[['balance', 'estimated_salary']])
features_train[['balance', 'estimated_salary']] = scaler.transform(features_train[['balance', 'estimated_salary']])
features_valid[['balance', 'estimated_salary']] = scaler.transform(features_valid[['balance', 'estimated_salary']])
features_test[['balance', 'estimated_salary']] = scaler.transform(features_test[['balance', 'estimated_salary']])

pd.options.mode.chained_assignment = None

In [90]:
logreg = LogisticRegression(random_state=12345, solver='liblinear', max_iter=1000)
logreg.fit(features_train, target_train)
predicted_valid = logreg.predict(features_valid)
predicted_prob = logreg.predict_proba(features_valid)[:, 1]

f1_score = f1_score(target_valid, predicted_valid)
auc_roc_score = roc_auc_score(target_valid, predicted_prob)

print('f1_score:', f1_score)
print('auc_roc_score:', auc_roc_score)

f1_score: 0.3316412859560068
auc_roc_score: 0.758238617460788


In [91]:
from sklearn import metrics

In [92]:
predicted_test = logreg.predict(features_test)
predicted_prob = logreg.predict_proba(features_test)[:, 1]

f1_score = metrics.f1_score(target_test, predicted_test)
auc_roc_score = roc_auc_score(target_test, predicted_prob)

print('f1_score:', f1_score)
print('auc_roc_score:', auc_roc_score)

f1_score: 0.2792321116928447
auc_roc_score: 0.7381103360811667


In [93]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
encoder.fit(df)
df_ord = encoder.transform(df)
df_ord = pd.DataFrame(df_ord, columns=df.columns)
df_ord.head()

Unnamed: 0,creditscore,geography,gender,age,tenure,balance,num_of_products,has_credit_card,is_active_member,estimated_salary,exited
0,228.0,0.0,0.0,24.0,2.0,0.0,0.0,1.0,1.0,5068.0,1.0
1,217.0,2.0,0.0,23.0,1.0,743.0,0.0,0.0,1.0,5639.0,0.0
2,111.0,0.0,0.0,24.0,8.0,5793.0,2.0,1.0,0.0,5707.0,1.0
3,308.0,0.0,0.0,21.0,1.0,0.0,1.0,0.0,0.0,4704.0,0.0
4,459.0,2.0,0.0,25.0,2.0,3696.0,0.0,1.0,1.0,3925.0,0.0


In [94]:
features = df_ord.drop(columns=['exited'])
target = df_ord['exited']

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345)

forest = None
max_depth = 0
n_est = 0
f1_best = 0
for est in range(10, 81, 10):
    for depth in range(3, 7):
        forest = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
        forest.fit(features_train, target_train)
        predicted_valid = forest.predict(features_valid)
        result = metrics.f1_score(target_valid, predicted_valid)
    
        if result > f1_best:
            f1_best = result
            n_est = est
            max_depth = depth
            forest_best = forest

f1_score = metrics.f1_score(target_valid, predicted_valid)
predicted_prob = forest.predict_proba(features_valid)[:, 1]
auc_roc_score = roc_auc_score(target_valid, predicted_prob)

print('f1_score:', f1_score)
print('auc_roc_score:', auc_roc_score)

f1_score: 0.5222929936305732
auc_roc_score: 0.8539259470642792


In [95]:
print('max_depth', max_depth, '| number of estimators', n_est)

max_depth 6 | number of estimators 60


Without taking into account class imbalance, the logistic regression model scores poorly, but the random forest has an f1 score of 0.52. We will use the model in further research.

## Fighting imbalance

In [97]:
target_train.value_counts(normalize=True)

0.0    0.800667
1.0    0.199333
Name: exited, dtype: float64

Upsampling

In [98]:
from sklearn.utils import shuffle

features_zeros = features_train[target_train==0]
features_ones = features_train[target_train==1]
target_zeros = target_train[target_train==0]
target_ones = target_train[target_train==1]

features_upsampled = pd.concat([features_zeros] + [features_ones]*4)
target_upsampled = pd.concat([target_zeros] + [target_ones]*4)
features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)

In [99]:
f1_score = 0
n_estimators = 0
max_depth = 0
best_forest = None
for est in range(10, 91, 10):
    for depth in range(3, 10):
        forest = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
        forest.fit(features_upsampled, target_upsampled)
        predicted_valid = forest.predict(features_valid)
        result = metrics.f1_score(target_valid, predicted_valid)
        if result > f1_score:
            f1_score = result
            n_estimators = est
            max_depth = depth
            best_forest = forest
        
predicted_prob = forest.predict_proba(features_valid)[:, 1]
auc_roc_score = roc_auc_score(target_valid, predicted_prob)

print('f1_score:', f1_score)
print('auc_roc_score:', auc_roc_score)

f1_score: 0.6208333333333333
auc_roc_score: 0.8503574906695539


In [100]:
print('max_depth', max_depth, '| number of estimators', n_est)

max_depth 8 | number of estimators 60


Balanced class weight

In [101]:
f1_score = 0
n_estimators = 0
max_depth = 0
best_forest = None
for est in range(10, 91, 10):
    for depth in range(3, 10):
        forest = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth, class_weight='balanced')
        forest.fit(features_train, target_train)
        predicted_valid = forest.predict(features_valid)
        result = metrics.f1_score(target_valid, predicted_valid)
        if result > f1_score:
            f1_score = result
            n_estimators = est
            max_depth = depth
            best_forest = forest
        
predicted_prob = forest.predict_proba(features_valid)[:, 1]
auc_roc_score = roc_auc_score(target_valid, predicted_prob)

print('f1_score:', f1_score)
print('auc_roc_score:', auc_roc_score)

f1_score: 0.6223132036847493
auc_roc_score: 0.8497556239754657


In [102]:
print('max_depth', max_depth, '| number of estimators', n_estimators)

max_depth 6 | number of estimators 20


In [103]:
features = df_ohe.drop(columns=['exited'])
target = df_ohe['exited']

features_train, features_valid, target_train, target_valid = train_test_split(
features, target, test_size=0.4, random_state=12345)

features_valid, features_test, target_valid, target_test = train_test_split(
features_valid, target_valid, test_size=0.5, random_state=12345)

scaler = StandardScaler()
scaler.fit(features_train[['balance', 'estimated_salary']])
features_train[['balance', 'estimated_salary']] = scaler.transform(features_train[['balance', 'estimated_salary']])
features_valid[['balance', 'estimated_salary']] = scaler.transform(features_valid[['balance', 'estimated_salary']])
features_test[['balance', 'estimated_salary']] = scaler.transform(features_test[['balance', 'estimated_salary']])


In [104]:
features_zeros = features_train[target_train==0]
features_ones = features_train[target_train==1]
target_zeros = target_train[target_train==0]
target_ones = target_train[target_train==1]

features_upsampled = pd.concat([features_zeros] + [features_ones]*4)
target_upsampled = pd.concat([target_zeros] + [target_ones]*4)
features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)

In [105]:
logreg = LogisticRegression(random_state=12345, solver='liblinear', max_iter=1000)
logreg.fit(features_upsampled, target_upsampled)
predicted_valid = logreg.predict(features_valid)
predicted_prob = logreg.predict_proba(features_valid)[:, 1]

f1_score = metrics.f1_score(target_valid, predicted_valid)
auc_roc_score = roc_auc_score(target_valid, predicted_prob)

print('f1_score:', f1_score)
print('auc_roc_score:', auc_roc_score)

f1_score: 0.4892703862660944
auc_roc_score: 0.7633393620817933


In [106]:
logreg = LogisticRegression(random_state=12345, solver='liblinear', max_iter=1000, class_weight='balanced')
logreg.fit(features_train, target_train)
predicted_valid = logreg.predict(features_valid)
predicted_prob = logreg.predict_proba(features_valid)[:, 1]

f1_score = metrics.f1_score(target_valid, predicted_valid)
auc_roc_score = roc_auc_score(target_valid, predicted_prob)

print('f1_score:', f1_score)
print('auc_roc_score:', auc_roc_score)

f1_score: 0.488013698630137
auc_roc_score: 0.7632365305863209


Several methods were used to combat imbalance: scaling, upsampling, and adding the class_weight='balanced' parameter. The model performed better than the previous one, its f1 score is 0.62 on the validation set.

## Model testing

In [107]:
features = df_ord.drop(columns=['exited'])
target = df_ord['exited']

features_train, features_valid, target_train, target_valid = train_test_split(
features, target, test_size=0.4, random_state=12345)

features_valid, features_test, target_valid, target_test = train_test_split(
features_valid, target_valid, test_size=0.5, random_state=12345)

scaler = StandardScaler()
scaler.fit(features_train[['balance', 'estimated_salary']])
features_train[['balance', 'estimated_salary']] = scaler.transform(features_train[['balance', 'estimated_salary']])
features_valid[['balance', 'estimated_salary']] = scaler.transform(features_valid[['balance', 'estimated_salary']])
features_test[['balance', 'estimated_salary']] = scaler.transform(features_test[['balance', 'estimated_salary']])

In [108]:
predicted_test = best_forest.predict(features_test)
predicted_prob = best_forest.predict_proba(features_test)[:, 1]

f1_score = metrics.f1_score(target_test, predicted_test)
auc_roc_score = roc_auc_score(target_test, predicted_prob)

print('f1_score:', f1_score)
print('auc_roc_score:', auc_roc_score)

f1_score: 0.601015228426396
auc_roc_score: 0.8522953328806078


The model performed well on the test sample as well. Judging by the roc_auc_score metric, the model will perform 0.85-0.5=0.35=35% better than a random model.

**Final conclusion**:

Data was provided about the bank's clients and their characteristics, such as client activity, account balance, availability of a credit card, etc. Many bank clients stop cooperating. Therefore, our task was to teach the model to identify loyal and leaving customers. Random forest and logistic regression models were trained with balanced and unbalanced classes and various hyperparameters. As a result, the best model turned out to be a random forest with parameters n_estimators=20, max_depth=6, class_weight='balanced'. The test sample score is 60%