![](https://i.ibb.co/yQ3D64X/face.jpg)

# Customer churn



*It is necessary to predict whether the client will leave the bank in the near future or not.
We have been provided with historical data on customer behavior and termination of agreements with the bank*


## General information about data in operation and preinspection

### Importing required libraries

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score, roc_curve, precision_score, recall_score
from sklearn.metrics import precision_recall_curve, accuracy_score
from catboost import CatBoostClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from catboost import Pool, cv
from sklearn.utils import shuffle
import warnings
warnings.filterwarnings("ignore")



### Exploring Data Set

In [None]:
df = pd.read_csv('../input/bank-customer-churn-modeling/Churn_Modelling.csv')
df.head()

Features: 

- `RowNumber` - the index of the row in the data
- `CustomerId` - unique customer identifier
- `Surname` - surname
- `CreditScore` - credit rating
- `Geography` - country of residence
- `Gender` - gender
- `Age` - age
- `Tenure` - how many years a person has been a client of the bank
- `Balance` - account balance
- `NumOfProducts` - the number of bank products used by the client
- `HasCrCard` - availability of a credit card
- `IsActiveMember` - client activity
- `EstimatedSalary` - estimated salary

Target column:

- `Exited` - the fact of the client's departure

For our convenience, we will lower the name of the columns and bring them to the serpentine register

In [None]:
df.columns = df.columns.str.lower()
df.columns

In [None]:
df.columns = ['row_number', 'customer_id', 'surname', 'creditscore', 'geography',
       'gender', 'age', 'tenure', 'balance', 'num_of_products', 'has_crcard',
       'isactive_member', 'estimated_salary', 'exited']
df.columns

Let's see general information about data in work

In [None]:
print(df.shape)
df.info()

We have no NaN in our set


Let's take a look at the data types in our dataframe separately:

In [None]:
df.dtypes

we need `tenure` to be integre

In [None]:
df.describe().T

At first glance, we do not observe anomalies in the data, significant outliers. For example, by age, the minimum age is 18 - the maximum is 92, which may be true.

Let's see the target column `exited`

In [None]:
df['exited'].value_counts().to_frame()

In [None]:
print('Percentage of positive marks: {:.2%}'.format(df['exited'].mean()))

There is a class imbalance. We will train the model on the initial data, then we will try to overcome the imbalance and train again. Let's see the results later.


In [None]:
df.hist(bins=50, figsize=(20,15), edgecolor='black', linewidth=2)
plt.show()

In the process of feature engeneering for our model, it will be possible to delete three columns - `customer_id`,` row_number` and `surname`, which do not carry the payload in our case

In [None]:
df.duplicated().sum()

### Conclusion

Customer churn is the loss of customers, expressed in the absence of purchases or payments over a period of time. Churn rate is extremely important for companies with a subscription and transactional business model that means recurring payments to the company.

We've previewed our dataset:

- no duplicates found, no need to delete lines
- in the process of preparing features for analysis - remove the columns `customer_id`,` row_number` and `surname`


We can start preparing the features

## Research of task

We are faced with the task of classification - it is necessary to determine whether the client will leave in the near future or not. Thus, to achieve the goals of this task, I propose to use the algorithms of Logistic Regression, Random Forest and Catboost.

To evaluate the models, we will use the F1 measure (`F1 score`) (let us apply the good values is > 0.59)

To evaluate the final model, we use the ROC curve with its area (`ROC-AUC`).

As we found out, we have an imbalance of classes, accuracy does not suit us.

### Features engeneering

In [None]:
df.head()

Let's remove unnecessary features and form a new date set so as not to overwrite variables

In [None]:
data = df.drop(['row_number', 'customer_id', 'surname'], axis=1).copy()
data.head()

### One-hot Encoding

The categorical features `geography` and` gender` must be converted to numerical ones using the direct coding technique, or display (English One-Hot Encoding, OHE). We need quantitative features to be more accurate

In [None]:
data['geography'].value_counts()

In [None]:
data['gender'].value_counts()

In [None]:
# OHE of features
gender_ohe = pd.get_dummies(df["gender"], drop_first=True)
country_ohe = pd.get_dummies(df["geography"], drop_first=True)

# delete catfeatures
data.drop(["gender", "geography"], axis=1, inplace=True)

#concat new sets
df_ohe = pd.concat([data, gender_ohe, country_ohe], axis=1)

df_ohe.head()

In [None]:
df_ohe.info()

Columns were coded. 

It is also necessary to standardize the characteristics, since the quantitative values ​​vary greatly. We will not apply standardization to the columns `tenure`,` num_of_products`, `has_crcard`,` isactive_member` and to the target with transformed categorical

### Split data set

We have prepared features. Now we will divide our samples into training, validation for the selection of hyperparameters and test, on which we will test our model. We will not touch the test sample to the end, we will work out the best model on it

In [None]:
def split_data(data, target_column):
    return data.drop(columns=[target_column], axis=1), data[target_column]

In [None]:
features, target = split_data(df_ohe,'exited')

We get a validation sample of 20% and divide the remaining 80% again to obtain a test sample. We will conduct training on 60% of the data

In [None]:
features_df, features_valid, target_df, target_valid = ( 
                                train_test_split(
                                features, target, test_size=0.20, random_state=42)
)


In [None]:
features_train, features_test, target_train, target_test = ( 
                                train_test_split(
                                features_df, target_df, test_size=0.25, random_state=42)
)


In [None]:
print('Objects of train:', len(features_train))
print('Objects of valid:', len(features_valid))
print('Objects of test:', len(features_test))
print('Sum of objects:', len(features_train) + len(features_test) + len(features_test))
print()
print('Objects of original set (check sum):', len(df_ohe))

The sample was divided, we can proceed to trial training of the models. In our task, there is a strong class imbalance, which has a bad effect on training the model. Let's look at the results, we will evaluate the model by the F1 measure - it is a good candidate for a formal metric for assessing the quality of the classifier. It reduces to one number two other fundamental metrics: `precision` and` recall`

### Scaling

Scaling features across the entire dataset can lead to a data leak. You only need to train the scaler on the train.

We will train and then apply to our samples

In [None]:
numeric = ['creditscore', 'age', 'balance', 'estimated_salary']
scaler = StandardScaler()
scaler.fit(features_train[numeric])
pd.options.mode.chained_assignment = None
features_train[numeric] = scaler.transform(features_train[numeric])
features_train.head()

Let's apply a trained scaller to the validation set

In [None]:
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_valid.head()

Let's apply a trained scaller to the test set

In [None]:
features_test[numeric] = scaler.transform(features_test[numeric])
features_test.head()

### Trial training of models without considering class imbalance

#### Logistic regression

Let's start with basic logistic regression. We do not indicate the weight of the classes

In [None]:
model = LogisticRegression(random_state=42, solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))


Poor enough indicator. Let's try to specify `class_weight = 'balanced' '

In [None]:
model = LogisticRegression(random_state=42, solver='liblinear', class_weight='balanced')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

Better now. At this stage, we will not select the hyperparameters, we will move on to the next algorithm

#### Random forest

In [None]:
model = RandomForestClassifier(random_state=42, n_estimators=10)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

The random forest did better in terms of class imbalance. Similar to logistic regression, let's try setting the `class_weight` parameter

In [None]:
model = RandomForestClassifier(random_state=42, n_estimators=10, class_weight='balanced')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

The indicator has worsened. also now we will not change the hyperparameters, we will return to this after we fix the imbalance problem

#### Catboost

In [None]:
model = CatBoostClassifier(verbose=100, random_state=42)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

F1 is a fairly high measure, let's look at the results after we select the hyperparameters and test the model on a test sample

### Conclusion

We are faced with the task of classification. In order to improve the forecasting results and facilitate the training of the model, we have transformed the data:

- removed unnecessary features - such as surname, customer id and line number
- carried out coding of categorical variables
- carried out scaling of quantitative variables
- divided the samples in a ratio of 60%: 20%: 20% - training, validation for the selection of hyperparameters and model verification, test - for the final model verification and evaluation

We tried to train the models on objects with class imbalance. Now let's try to get rid of this problem, select the model hyperparameters.

## Dealing with imbalances and improving models

Classes are not represented in the same way in our problem, let's look again:

In [None]:
df['exited'].value_counts().to_frame()

Let's try to solve this problem in three ways. We will choose the best one and use it to improve our model.

### Upsampling

To do this, let's use a function that performs the following transformations:

- divide the training sample into negative and positive objects
- copy positive objects several times
- taking into account the received data, we will create a new training sample
- shuffle the data 

In [None]:
def upsample(features, target, repeat):
    
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled = shuffle(features_upsampled, random_state=12345)
    target_upsampled = shuffle(target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

    
    
features_upsampled, target_upsampled = upsample(features_train, target_train, 5)

print(features_upsampled.shape)
print(target_upsampled.shape)

#### Logistic regression

In [None]:
model = LogisticRegression(random_state=42, solver='liblinear')
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

We observe a slight increase in the metric, close to the one we got by specifying the `class_weight` parameter

#### Random Forest

In [None]:
model = RandomForestClassifier(random_state=42, n_estimators=10)
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

There is also an improvement here

#### Catboost

In [None]:
model = CatBoostClassifier(verbose=100, random_state=42)
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

Catboost Shows Better Results Again On Validation Set

Let's try another way - decreasing the sample

### Downsampling

To do this, let's use a function that performs the following transformations:

- divide the training sample into negative and positive objects
- randomly discard some of the negative objects
- taking into account the received data, we will create a new training sample
- shuffle the data

In [None]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_sample = features_zeros.sample(frac=0.1, random_state=12345)
    target_sample = target_zeros.sample(frac=0.1, random_state=12345)
    
    features_downsampled = pd.concat([features_sample] + [features_ones])
    target_downsampled = pd.concat([target_sample] + [target_ones])
    
    features_downsampled = shuffle(features_downsampled, random_state=12345)
    target_downsampled = shuffle(target_downsampled, random_state=12345)
    

    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

print(features_downsampled.shape)
print(target_downsampled.shape)


#### Logistic regression

In [None]:
model = LogisticRegression(random_state=42, solver='liblinear')
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

#### Random forest

In [None]:
model = RandomForestClassifier(random_state=42, n_estimators=10)
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

#### Catboost

In [None]:
model = CatBoostClassifier(verbose=100, random_state=42)
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))



`Downsampling` shows worse results than` upsampling` for all three algorithms.

Let's try changing the threshold and see what the metrics will be - this time we'll turn to `recall` and` precision`

### Change threshold

For convenience, we will translate the proximity to the classes into the probability of classes (we have two classes - 0 and 1). The probability of class "1" is enough for us. By default it is equal to 0.5 - let's try different parameters, for example, up to 0.95

#### Logistic regression

In [None]:
model = LogisticRegression(random_state=42, solver='liblinear')
model.fit(features_train, target_train)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

for threshold in np.arange(0, 0.95, 0.05):
    predicted_valid = probabilities_one_valid > threshold
    precision = precision_score(target_valid, predicted_valid)
    recall = recall_score(target_valid, predicted_valid)
    f1 = f1_score(target_valid, predicted_valid)
    print("Threshold = {:.2f} | Precision = {:.3f}, Recall = {:.3f} | F1-score = {:.3f}".format(
        threshold, precision, recall, f1))

precision, recall, thresholds = precision_recall_curve(target_valid, probabilities_valid[:, 1])    
plt.figure(figsize=(10, 10))
plt.step(recall, precision, where='post')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('PR curve')
plt.show() 

For a threshold of 0, the completeness is 1 - all answers are positive. At a threshold of 0.85, the model stops giving correct answers. The highest F1 value is observed with a threshold of 0.25

#### Random forest

In [None]:
model = RandomForestClassifier(random_state=42, n_estimators=10)
model.fit(features_train, target_train)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

for threshold in np.arange(0, 0.95, 0.05):
    predicted_valid = probabilities_one_valid > threshold
    precision = precision_score(target_valid, predicted_valid)
    recall = recall_score(target_valid, predicted_valid)
    f1 = f1_score(target_valid, predicted_valid)
    print("Threshold = {:.2f} | Precision = {:.3f}, Recall = {:.3f} | F1-score = {:.3f}".format(
        threshold, precision, recall, f1))

precision, recall, thresholds = precision_recall_curve(target_valid, probabilities_valid[:, 1])    
plt.figure(figsize=(10, 10))
plt.step(recall, precision, where='post')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('PR curve')
plt.show() 

The highest indicator is reached at a threshold of 0.2

Let's choose upsampling increasing the sample. On it we will train our models and select the hyperparameters. We will not change the threshold or reduce the sample

### Train Models and Tuning Hyperparameters

We will train the model on an enlarged sample, check the parameters on a validation sample and evaluate it by the F1-measure, we will not use cross-validation for logistic regression and a random forest.

The parameters will be selected through `GridSearchCV`. loop and enumeration will not be used

#### Logistic regression

In [None]:
par_grid_logist = {
                   'intercept_scaling': [0.5, 1.0, 1.5],
                   'class_weight': [None, 'balanced'],
                   'C': [0.5, 1, 1.5]
                   }
model = LogisticRegression(solver='liblinear',random_state=42)

grid_search = GridSearchCV(model, par_grid_logist, cv=5,
                           scoring='f1')
grid_search.fit(features_upsampled, target_upsampled)

In [None]:
grid_search.best_params_

Let's apply our parameters and see the result:

In [None]:
model_lreg = LogisticRegression(C=1.5, class_weight=None, intercept_scaling=0.5,
                                solver='liblinear', random_state=42
)
model_lreg.fit(features_upsampled, target_upsampled)
predicted_valid = model_lreg.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))


Below the threshold of 0.59, let's see how the model will behave during testing

In [None]:
probabilities_valid = model_lreg.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

fpr, tpr, thresholds = roc_curve(target_valid, probabilities_one_valid) 

plt.figure(figsize=(10, 10))
plt.plot(fpr, tpr, linestyle='-')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC-curve')

plt.show()

auc_roc = roc_auc_score (target_valid, probabilities_one_valid)

print("AUC:", auc_roc)

AUC greater than 0.5, our model is better than random

#### Random forest

In [None]:
par_grid_ensemble = {'n_estimators': [3, 10, 30],
                     'criterion': ['gini', 'entropy'],
                     'min_samples_split': range(5, 15)
                    }
model = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(model, par_grid_ensemble, cv=5,
                           scoring='accuracy'
                          )
grid_search.fit(features_upsampled, target_upsampled)

In [None]:
grid_search.best_params_

In [None]:
model_rfc = RandomForestClassifier(random_state=42, criterion='gini', 
                               min_samples_split=5, n_estimators=30
                              )
model_rfc.fit(features_upsampled, target_upsampled)
predicted_valid = model_rfc.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

Above the threshold of 0.59 on the validation set. Let's try on a sample test and see how the model behaves on unfamiliar data

In [None]:
probabilities_valid = model_rfc.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

fpr, tpr, thresholds = roc_curve(target_valid, probabilities_one_valid) 

plt.figure(figsize=(10, 10))
plt.plot(fpr, tpr, linestyle='-')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC-curve')
plt.show()

auc_roc = roc_auc_score (target_valid, probabilities_one_valid)

print("AUC:", auc_roc)

The AUC also tells us that the model is better, random, and better than logistic regression.

#### Catboost (bonus)

Let's try to configure Catboost using cross-validation. We will get the basic model, we will check it by the F1-score. Excluding class imbalance

In [None]:
model_cat = CatBoostClassifier(
                           custom_loss=['F1'],
                           random_seed=42,
                           logging_level='Silent'
)

In [None]:
model_cat.fit(
          features_train, target_train,
          eval_set=(features_valid, target_valid)

)

Get the grid of parameters and cross-validate using the built-in Pool function

In [None]:
cv_params = model_cat.get_params()
cv_params.update({
                 'loss_function': 'Logloss'
})
cv_data = cv(
             Pool(features_train, target_train),
             cv_params
)

In [None]:
print('F1-score: {}'.format(np.max(cv_data['test-F1-mean'])))

In [None]:
probabilities_valid = model_cat.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

fpr, tpr, thresholds = roc_curve(target_valid, probabilities_one_valid) 

plt.figure(figsize=(10, 10))
plt.plot(fpr, tpr, linestyle='-')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC-curve')
plt.show()

auc_roc = roc_auc_score (target_valid, probabilities_one_valid)

print("AUC:", auc_roc)

We got a very good result, let's see how the model will behave on the test set. Model is better than random

## Model testing and validation

In [None]:
# collect indicators in lists

table_of_model = []
table_of_prec = []
table_of_acc = []

### Testing Models

#### Logistic regression

In [None]:
predictions_test = model_lreg.predict(features_test)
test_f1 = f1_score(target_test, predictions_test)
test_acc = accuracy_score(target_test, predictions_test)

print("Accuracy")
print("Test set:", test_acc)
print("F1-мера")
print("Test set:", test_f1)

table_of_acc.append(round(test_acc, 2))
table_of_prec.append(round(test_f1, 2))
table_of_model.append('LogisticRegression')

#### Random forest

In [None]:
predictions_test = model_rfc.predict(features_test)
test_f1 = f1_score(target_test, predictions_test)
test_acc = accuracy_score(target_test, predictions_test)

print("Accuracy")
print("Test set:", test_acc)
print("F1-мера")
print("Test set:", test_f1)

table_of_acc.append(round(test_acc, 2))
table_of_prec.append(round(test_f1, 2))
table_of_model.append('RandomForestClassifier')

In [None]:
model_rfc.feature_importances_

In [None]:
features_test.columns

In [None]:
fi = pd.DataFrame({'name':features_test.columns,'fi':model_rfc.feature_importances_})
fi.sort_values('fi',ascending=False)

#### Catboost

In [None]:
predictions = model_cat.predict(features_test)
test_f1 = f1_score(target_test, predictions_test)
test_acc = accuracy_score(target_test, predictions_test)

print("Accuracy")
print("Test set:", test_acc)
print("F1-мера")
print("Test set:", test_f1)

table_of_acc.append(round(test_acc, 2))
table_of_prec.append(round(test_f1, 2))
table_of_model.append('Catboost')

In [None]:
model_cat.feature_importances_

In [None]:
fi_cat = pd.DataFrame({'name':features_test.columns,'fi_cat':model_cat.feature_importances_})
fi_cat.sort_values('fi_cat',ascending=False)

#### Conclusion


For convenience, we will display a table of our parameters by model:

In [None]:
table_of_models = (pd.DataFrame({'Model':table_of_model, 'Accuracy':table_of_acc, 
                                'F1 score':table_of_prec}).sort_values(by='F1 score', ascending=False).
                  reset_index(drop=True))
table_of_models['Threshold of testing'] = (
                   table_of_models['F1 score'].apply(lambda x: 'good model' if x>0.59 else 'bad model')
)
table_of_models

The best result was obtained on the Random forest - 0.61, Catboost takes the second place - but this is without correcting the imbalance problem! Logistic regression could not overcome the F1-score threshold of 0.59

We also looked at what features are important for classification models:
age, expected salary, credit rate, balance and number of products - age is in the lead

### Senity test

#### Comparison with constant

Let's compare our models with a constant model: it predicts class "0" for any object

In [None]:
target_const = target*0
acc_const = accuracy_score(target, target_const)


print("Accuracy")
print("const:", acc_const)


Random Forest and Catboost have been validated. The accuracy of our models is higher than that of the random one. We also looked at ROC-AUC validations, our models performed better.

## Conclusion

We were provided with historical data on customer behavior and termination of agreements with the bank. Based on this data, we formed features for training the model in order to predict customer churn. We have achieved the best results with a model based on the Random Forest algorithm - F1 measure - `0.61`.

Based on the analysis (using the best model as an example):

In [None]:
fi.sort_values('fi',ascending=False).reset_index(drop=True).head()

The most important signs to look out for are:

 - client's age
 - credit speed
 - expected profit
 - account balance
 - number of products
 
 
 To predict churn, you can use a model based on the Random Forest algorithm