# Customer churn

Clients began to leave Beta Bank. Every month. A little, but noticeable. Bank marketers have calculated that it is cheaper to retain current customers than to attract new ones.

It is necessary to predict whether the client will leave the bank in the near future or not. We are provided with historical data on customer behavior and termination of contracts with the bank.

Let's build a model with an extremely large *F1*-measure. To pass the project successfully, you need to bring the metric to 0.59. Let's check the *F1*-measure on the test sample ourselves.

Additionally, we measure *AUC-ROC* and compare its value with the *F1*-measure.

Data source: [https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling)

# Description of data

- The data is located in the file /datasets/Churn.csv (English: “customer outflow”). Download dataset
- Signs
- RowNumber — row index in the data
- CustomerId — unique client identifier
- Surname - surname
- CreditScore - credit rating
- Geography - country of residence
- Gender - gender
- Age - age
- Tenure - how many years a person has been a bank client
- Balance - account balance
- NumOfProducts — number of bank products used by the client
- HasCrCard - availability of a credit card
- IsActiveMember — client activity
- EstimatedSalary - estimated salary

<b>Target feature</b>

- Exited—the fact that the client has left

## Data preparation

Import the necessary libraries and look at the data

In [1]:
import pandas as pd

from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('/datasets/Churn.csv')

print(df.info())
display(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB
None


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


Let's delete objects with an empty value in the Tenure column

In [2]:
df = df.query('Tenure.isna() == False')

Let's change data types

In [3]:
df.Tenure = df.Tenure.astype(int)

We delete the columns RowNumber, CustomerId, Surname. They won't be useful

In [4]:
df = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

In [5]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Let's split the columns using the get_dummies() function

In [6]:
df = pd.get_dummies(df, columns=['Geography', 'Gender'], drop_first=True)

In [7]:
display(df.head())

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1,0


<b>Output</b>

- Worked out the gaps
- Changed the data type
- Split the data using the OHE method
- Also removed columns that will not be needed in the future to build the model

## Problem research

Let's look at the ratio of target indicators

In [8]:
df.Exited.value_counts(normalize=True)

0    0.796062
1    0.203938
Name: Exited, dtype: float64

There is a bias towards 0. Approximate ratio 4:1

Let's analyze the data into samples

In [9]:
df_features = df.drop(['Exited'], axis=1) # признаки
df_target = df['Exited'] # цель

We divide it into samples (training and validation)

In [10]:
features_train, features_test_valid, \
target_train, target_test_valid = train_test_split(df_features, df_target, \
    test_size=0.4, random_state=12345)

Let's give half of the validation sample to the test sample

In [11]:
features_test, features_valid, target_test, target_valid = train_test_split(features_test_valid, target_test_valid,\
    test_size=0.5,
    random_state=12345)

Let's look at the number of objects in each sample

In [12]:
print('Размеры выборок')
print(df_target.shape, '- генеральная выборка')
print(target_train.shape, '- обучающая выборка')
print(target_test.shape, '- тестовая выборка')
print(target_valid.shape, '- валидационная выборка')

Размеры выборок
(9091,) - генеральная выборка
(5454,) - обучающая выборка
(1818,) - тестовая выборка
(1819,) - валидационная выборка


The signs contain different scales. We standardize them using the StandardScaler method

In [13]:
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary']

scaler = StandardScaler()
scaler.fit(features_train[numeric])

features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

pd.options.mode.chained_assignment = None

display(features_train.head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.p

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Gender_Male
9344,0.809075,-1.039327,-1.025995,0.554904,1,1,0,0.019508,0,0,0
3796,-1.152518,-1.227561,0.696524,0.480609,1,0,0,0.056167,0,0,1
7462,-0.398853,0.090079,1.385532,-1.23783,1,1,1,0.848738,0,0,1
1508,-0.749875,-0.286389,0.35202,-1.23783,2,1,1,-0.894953,0,0,1
4478,-1.028628,-0.756975,-0.336987,-1.23783,2,0,1,-1.284516,0,0,1


Let's find the best hyperparameters for our decision tree model

In [14]:
f1_best = 0
max_depth_best = 1
auc_roc_best = 0
caracter_best = []

for max_depth_i in range(2, 11):
    for caracter_i in ['gini', 'entropy']:
        model = DecisionTreeClassifier(random_state=12345, max_depth=max_depth_i, criterion=caracter_i)
        model.fit(features_train, target_train)
        prediction = model.predict(features_valid)
        f1 = f1_score(target_valid, prediction)
        if f1  > f1_best:
            f1_best = f1
            max_depth_best = max_depth_i
            caracter_best = caracter_i
        
        probabilities_valid = model.predict_proba(features_valid)
        probabilities_one_valid = probabilities_valid[:, 1]
        roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
        if roc_auc > auc_roc_best:
            auc_roc_best = roc_auc
            
        print('Глубина', max_depth_i)
        print('Критерий', caracter_i)
        print('f1', f1)
        print('auc_roc', roc_auc)
        print('*' * 20)
        
print('Лучшая модель с глубиной -', max_depth_best, ', с критерием -', caracter_best)
print('f1 - ', f1_best)
print('auc_roc - ', auc_roc_best)

Глубина 2
Критерий gini
f1 0.509274873524452
auc_roc 0.7404758300534868
********************
Глубина 2
Критерий entropy
f1 0.5072697899838449
auc_roc 0.7410978364656838
********************
Глубина 3
Критерий gini
f1 0.41290322580645167
auc_roc 0.8032004704348029
********************
Глубина 3
Критерий entropy
f1 0.4127659574468085
auc_roc 0.7982176265555013
********************
Глубина 4
Критерий gini
f1 0.5261261261261261
auc_roc 0.8231803643928984
********************
Глубина 4
Критерий entropy
f1 0.5231316725978647
auc_roc 0.8234307195478856
********************
Глубина 5
Критерий gini
f1 0.48192771084337344
auc_roc 0.8270928914661885
********************
Глубина 5
Критерий entropy
f1 0.48412698412698413
auc_roc 0.8397736711769409
********************
Глубина 6
Критерий gini
f1 0.5284403669724772
auc_roc 0.8321931499724416
********************
Глубина 6
Критерий entropy
f1 0.5350089766606821
auc_roc 0.8301049162765783
********************
Глубина 7
Критерий gini
f1 0.55348047538200

Let's find the best parameters for Random Forest

In [15]:
f1_best = 0
max_depth_best = 1
auc_roc_best = 0
n_estimators_best = []

for max_depth_i in range(2, 11, 1):
    for n_estimators_i in range(10, 101, 10):
        model = RandomForestClassifier(random_state=12345, max_depth=max_depth_i, n_estimators=n_estimators_i)
        model.fit(features_train, target_train)
        prediction = model.predict(features_valid)
        f1 = f1_score(target_valid, prediction)
        if f1  > f1_best:
            f1_best = f1
            max_depth_best = max_depth_i
            n_estimators_best = n_estimators_i
            
        probabilities_valid = model.predict_proba(features_valid)
        probabilities_one_valid = probabilities_valid[:, 1]
        roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
        if roc_auc > auc_roc_best:
            auc_roc_best = roc_auc
            
        print('Глубина', max_depth_i)    
        print('n_estimators', n_estimators_i)
        print('f1', f1)
        print('auc_roc', roc_auc)
        print('*' * 20)
            
print('Лучшая модель с глубиной -', max_depth_best, ', n_estimators -', n_estimators_best)
print('f1 - ', f1_best)
print('auc_roc - ', auc_roc_best)

Глубина 2
n_estimators 10
f1 0.20408163265306126
auc_roc 0.8020544648610045
********************
Глубина 2
n_estimators 20
f1 0.17142857142857143
auc_roc 0.802281531164365
********************
Глубина 2
n_estimators 30
f1 0.11290322580645161
auc_roc 0.8069509459155236
********************
Глубина 2
n_estimators 40
f1 0.15748031496062995
auc_roc 0.8210174122980664
********************
Глубина 2
n_estimators 50
f1 0.1671018276762402
auc_roc 0.8170117298182693
********************
Глубина 2
n_estimators 60
f1 0.13793103448275862
auc_roc 0.8159792573961511
********************
Глубина 2
n_estimators 70
f1 0.13297872340425532
auc_roc 0.8170350186698961
********************
Глубина 2
n_estimators 80
f1 0.13793103448275862
auc_roc 0.820135347042704
********************
Глубина 2
n_estimators 90
f1 0.12299465240641712
auc_roc 0.8230940015681161
********************
Глубина 2
n_estimators 100
f1 0.13793103448275862
auc_roc 0.8230115202186046
********************
Глубина 3
n_estimators 10
f1 0.2

<b>Output</b>

f1 for the Decision Tree model is 0.55, for the Sluay Forest model 0.58

Nothing reaches the required value of 0.59.

Target ratio is 4:1

Next we will work with imbalance

## Fighting imbalance

Let's write a function to increase the sample

In [16]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

Let's write a function to reduce the sample

In [17]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

Let's increase the training sample with target indicator 1 by 4 times

In [18]:
features_train_upsampled, target_train_upsampled = upsample(features_train, target_train, 4)

In [19]:
target_train_upsampled.value_counts(normalize=True)

1    0.509964
0    0.490036
Name: Exited, dtype: float64

<b>Let's test the models</b>

Decision tree

In [20]:
f1_best = 0
max_depth_best = 1
auc_roc_best = 0
caracter_best = []

for max_depth_i in range(2, 11):
    for caracter_i in ['gini', 'entropy']:
        model = DecisionTreeClassifier(random_state=12345, max_depth=max_depth_i, criterion=caracter_i)
        model.fit(features_train_upsampled, target_train_upsampled)
        prediction = model.predict(features_valid)
        f1 = f1_score(target_valid, prediction)
        if f1  > f1_best:
            f1_best = f1
            max_depth_best = max_depth_i
            caracter_best = caracter_i
        
        probabilities_valid = model.predict_proba(features_valid)
        probabilities_one_valid = probabilities_valid[:, 1]
        roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
        if roc_auc > auc_roc_best:
            auc_roc_best = roc_auc
            
        print('Глубина', max_depth_i)
        print('Критерий', caracter_i)
        print('f1', f1)
        print('auc_roc', roc_auc)
        print('*' * 20)
        
print('Лучшая модель с глубиной -', max_depth_best, ', с критерием -', caracter_best)
print('f1 - ', f1_best)
print('auc_roc - ', auc_roc_best)

Глубина 2
Критерий gini
f1 0.4983888292158968
auc_roc 0.7410978364656838
********************
Глубина 2
Критерий entropy
f1 0.4983888292158968
auc_roc 0.7410978364656838
********************
Глубина 3
Критерий gini
f1 0.5138248847926268
auc_roc 0.7938849297841124
********************
Глубина 3
Критерий entropy
f1 0.4983888292158968
auc_roc 0.7989842179215476
********************
Глубина 4
Критерий gini
f1 0.5146771037181996
auc_roc 0.8137542016969811
********************
Глубина 4
Критерий entropy
f1 0.5468245425188375
auc_roc 0.8267969289767655
********************
Глубина 5
Критерий gini
f1 0.5653631284916201
auc_roc 0.8374962155616106
********************
Глубина 5
Критерий entropy
f1 0.5495867768595041
auc_roc 0.8413320834982961
********************
Глубина 6
Критерий gini
f1 0.537864077669903
auc_roc 0.8373244602808636
********************
Глубина 6
Критерий entropy
f1 0.5325779036827195
auc_roc 0.8344191760404294
********************
Глубина 7
Критерий gini
f1 0.5384615384615384


Random forest

In [21]:
f1_best = 0
max_depth_best = 1
auc_roc_best = 0
n_estimators_best = []

for max_depth_i in range(1, 11, 1):
    for n_estimators_i in range(10, 101, 10):
        model = RandomForestClassifier(random_state=12345, max_depth=max_depth_i, n_estimators=n_estimators_i)
        model.fit(features_train_upsampled, target_train_upsampled)
        prediction = model.predict(features_valid)
        f1 = f1_score(target_valid, prediction)
        if f1  > f1_best:
            f1_best = f1
            max_depth_best = max_depth_i
            n_estimators_best = n_estimators_i
            
        probabilities_valid = model.predict_proba(features_valid)
        probabilities_one_valid = probabilities_valid[:, 1]
        roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
        if roc_auc > auc_roc_best:
            auc_roc_best = roc_auc
            
        print('Глубина', max_depth_i)    
        print('n_estimators', n_estimators_i)
        print('f1', f1)
        print('auc_roc', roc_auc)
        print('*' * 20)
            
print('Лучшая модель с глубиной -', max_depth_best, ', n_estimators -', n_estimators_best)
print('f1 - ', f1_best)
print('auc_roc - ', auc_roc_best)

Глубина 1
n_estimators 10
f1 0.47602441150828245
auc_roc 0.7917976664570671
********************
Глубина 1
n_estimators 20
f1 0.508646998982706
auc_roc 0.7959421116778065
********************
Глубина 1
n_estimators 30
f1 0.4904214559386973
auc_roc 0.7916453185526756
********************
Глубина 1
n_estimators 40
f1 0.5
auc_roc 0.806957738497248
********************
Глубина 1
n_estimators 50
f1 0.4941634241245136
auc_roc 0.8043416241645125
********************
Глубина 1
n_estimators 60
f1 0.49272550921435504
auc_roc 0.8079164628892149
********************
Глубина 1
n_estimators 70
f1 0.4995064165844027
auc_roc 0.8045570460420597
********************
Глубина 1
n_estimators 80
f1 0.501984126984127
auc_roc 0.8093108828803651
********************
Глубина 1
n_estimators 90
f1 0.49609375
auc_roc 0.8075516042137295
********************
Глубина 1
n_estimators 100
f1 0.49951219512195116
auc_roc 0.8106858954951598
********************
Глубина 2
n_estimators 10
f1 0.5452631578947368
auc_roc 0.8224

Let's reduce the training sample with target indicator 0 by 4 times

In [22]:
features_train_downsample, target_train_downsample = downsample(features_train, target_train, 0.25)

In [23]:
target_train_downsample.value_counts(normalize=True)

1    0.509964
0    0.490036
Name: Exited, dtype: float64

<b>Let's test the models</b>

Decision tree

In [24]:
f1_best = 0
max_depth_best = 1
auc_roc_best = 0
caracter_best = []

for max_depth_i in range(2, 11):
    for caracter_i in ['gini', 'entropy']:
        model = DecisionTreeClassifier(random_state=12345, max_depth=max_depth_i, criterion=caracter_i)
        model.fit(features_train_downsample, target_train_downsample)
        prediction = model.predict(features_valid)
        f1 = f1_score(target_valid, prediction)
        if f1  > f1_best:
            f1_best = f1
            max_depth_best = max_depth_i
            caracter_best = caracter_i
        
        probabilities_valid = model.predict_proba(features_valid)
        probabilities_one_valid = probabilities_valid[:, 1]
        roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
        if roc_auc > auc_roc_best:
            auc_roc_best = roc_auc
            
        print('Глубина', max_depth_i)
        print('Критерий', caracter_i)
        print('f1', f1)
        print('auc_roc', roc_auc)
        print('*' * 20)
        
print('Лучшая модель с глубиной -', max_depth_best, ', с критерием -', caracter_best)
print('f1 - ', f1_best)
print('auc_roc - ', auc_roc_best)

Глубина 2
Критерий gini
f1 0.4692737430167598
auc_roc 0.7374249904903858
********************
Глубина 2
Критерий entropy
f1 0.4983888292158968
auc_roc 0.7410978364656838
********************
Глубина 3
Критерий gini
f1 0.4692737430167598
auc_roc 0.7892213372458604
********************
Глубина 3
Критерий entropy
f1 0.4983888292158968
auc_roc 0.7982176265555012
********************
Глубина 4
Критерий gini
f1 0.5362776025236593
auc_roc 0.8197559328349519
********************
Глубина 4
Критерий entropy
f1 0.5483870967741936
auc_roc 0.826211796579644
********************
Глубина 5
Критерий gini
f1 0.5281306715063522
auc_roc 0.823915903956776
********************
Глубина 5
Критерий entropy
f1 0.5405921680993314
auc_roc 0.839826071093101
********************
Глубина 6
Критерий gini
f1 0.5485148514851486
auc_roc 0.8290210143071178
********************
Глубина 6
Критерий entropy
f1 0.545273631840796
auc_roc 0.8382763920911058
********************
Глубина 7
Критерий gini
f1 0.5476190476190476
auc

Random forest

In [25]:
f1_best = 0
max_depth_best = 1
auc_roc_best = 0
n_estimators_best = []

for max_depth_i in range(2, 11, 1):
    for n_estimators_i in range(10, 101, 10):
        model = RandomForestClassifier(random_state=12345, max_depth=max_depth_i, n_estimators=n_estimators_i)
        model.fit(features_train_downsample, target_train_downsample)
        prediction = model.predict(features_valid)
        f1 = f1_score(target_valid, prediction)
        if f1  > f1_best:
            f1_best = f1
            max_depth_best = max_depth_i
            n_estimators_best = n_estimators_i
            
        probabilities_valid = model.predict_proba(features_valid)
        probabilities_one_valid = probabilities_valid[:, 1]
        roc_auc = roc_auc_score(target_valid, probabilities_one_valid)
        if roc_auc > auc_roc_best:
            auc_roc_best = roc_auc
            
        print('Глубина', max_depth_i)    
        print('n_estimators', n_estimators_i)
        print('f1', f1)
        print('auc_roc', roc_auc)
        print('*' * 20)
            
print('Лучшая модель с глубиной -', max_depth_best, ', n_estimators -', n_estimators_best)
print('f1 - ', f1_best)
print('auc_roc - ', auc_roc_best)

Глубина 2
n_estimators 10
f1 0.49789915966386555
auc_roc 0.7919014959205695
********************
Глубина 2
n_estimators 20
f1 0.5275181723779855
auc_roc 0.8152524511516337
********************
Глубина 2
n_estimators 30
f1 0.5305699481865285
auc_roc 0.8236781635964197
********************
Глубина 2
n_estimators 40
f1 0.5389473684210527
auc_roc 0.8273228688760024
********************
Глубина 2
n_estimators 50
f1 0.5403141361256545
auc_roc 0.8269599509381527
********************
Глубина 2
n_estimators 60
f1 0.537344398340249
auc_roc 0.8238955262116026
********************
Глубина 2
n_estimators 70
f1 0.5227963525835867
auc_roc 0.819406600060551
********************
Глубина 2
n_estimators 80
f1 0.5295315682281059
auc_roc 0.822758253957164
********************
Глубина 2
n_estimators 90
f1 0.5321100917431193
auc_roc 0.8243739180387681
********************
Глубина 2
n_estimators 100
f1 0.5346938775510204
auc_roc 0.8243069625903413
********************
Глубина 3
n_estimators 10
f1 0.5110663983

<b>Output</b>

The Random Forest model has the best f1 result when changing the imbalance by increasing a smaller value.
- f1 - 0.599
- auc_roc - 0.86

The best method for dealing with imbalance for our task turned out to be upsample

auc_roc almost always grows with f1. But if for f1 - 0.599 the best parameters turned out to be max_depth - 9 and n_estimators - 90, auc_roc with these parameters is equal to - 0.8612. The best parameters for auc_roc are 0.8619, max_depth - 8, n_estimators - 100. The difference in auc_roc is insignificant and is approximately 0.0007

Further, when testing, we will use the result obtained

## Model testing

Testing using the Random Forest model

In [26]:
model = RandomForestClassifier(random_state=12345, max_depth=9, n_estimators=90)
model.fit(features_train_upsampled, target_train_upsampled)
prediction = model.predict(features_test)

f1 = f1_score(target_test, prediction)

probabilities_test = model.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
roc_auc = roc_auc_score(target_test, probabilities_one_test)

print('f1 -', f1)
print('roc_auc_score -', roc_auc)

f1 - 0.6254295532646048
roc_auc_score - 0.8691356024864844


## General conclusion

After working through data about Beta-Bank clients, we created a model to predict customer churn.
For current data, the Random Forest model and the upsample method of working with imbalance are most suitable

We encountered an imbalance in the current data, because... the share of those who left the bank was significantly less than those who still remained.

After all the transformations, we made a customer churn prediction model that satisfied us and the task.

The best parameters for the Random Forest model turned out to be
- max_depth - 9
- n_estimators - 90

With these parameters we get:
- f1 - 0.6254295532646048
- roc_auc_score - 0.8691356024864844