Project Description:

Beta Bank, due to the gradual leaving of their clients, is interested in creating a machine learning model that will predict whether a customer will leave or stay with the bank since it is cheaper to retain a customer than to find a new one. We are given a dataset with their clients' past behavior and contract terminatations that will be divided and used for model training and testing. The model that will be considered successful will need to have a testing set F1 score of at least 0.59. We are requested to compare the F1 score with the AUC-ROC metric. 


In [5]:
#importing libraries 
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.utils import shuffle



In [6]:
#importing dataset 
try:
    df = pd.read_csv("/Users/sallyhuang/Downloads/Churn.csv") 
except:
    df = pd.read_csv("/datasets/Churn.csv")

    

In [7]:
df.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


visualizing dataframe; there are some columns that we don't think is needed and should be dropped as it does not likely contribute to the prediction like row number, customerid number, and their surnames, those factors will not affect whether they decide to stay with Beta Bank or not. 

In [8]:
#dropping the columns

df = df.drop(["RowNumber", "CustomerId", "Surname"], axis = 1)

df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


df now have 11 columns. 

In [9]:
df.duplicated().sum()

0

checking for duplicates in df dataframe and lastly we will need to check for missing values before working with the data

In [10]:
#checking for missing values 

df.isnull().sum()

CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

We checked for missing values in the df dataframe and see that the Tenure column has MANY missing values so we will be filling in the missing values with the mean Tenure value because it is a column that is needed because it could be an factor that affects whether a client would leave or not and should be retained instead of just droppping. 

In [11]:
df['Tenure'] = df['Tenure'].fillna(df['Tenure'].mean()) 

print(df.isnull().sum())

CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64


We decided to fill the missing values with the mean value of the Tenure column to preserve the relationship it could have with the other variables.
Now that we verified the columns have no missing value, we can move onto transforming the categorical columns. 

In [12]:
#transform categorical columns -- Gender and Geography 

df = pd.get_dummies(df, columns = ['Geography', 'Gender'], drop_first = True)
df


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.00000,0.00,1,1,1,101348.88,1,0,0,0
1,608,41,1.00000,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.00000,159660.80,3,1,0,113931.57,1,0,0,0
3,699,39,1.00000,0.00,2,0,0,93826.63,0,0,0,0
4,850,43,2.00000,125510.82,1,1,1,79084.10,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,39,5.00000,0.00,2,1,0,96270.64,0,0,0,1
9996,516,35,10.00000,57369.61,1,1,1,101699.77,0,0,0,1
9997,709,36,7.00000,0.00,1,0,1,42085.58,1,0,0,0
9998,772,42,3.00000,75075.31,2,1,0,92888.52,1,1,0,1


using One Hot Encoding on categorial columns Geography and Gender, the resulting dataframe is printed and we see new columns 'Geography_Germany', 'Geography_Spain', and 'Gender_Male'. Geography_France and Gender_Female are implied from those columns. We decided to use OHE to transform those columns beacuse the rest of the data is in numerical form and possibly can interfer or may not even be suitable for direct use when attempting to train models. 

In [13]:
print(df['Exited'].value_counts())

0    7963
1    2037
Name: Exited, dtype: int64


we printed and examined the total number of 0s and 1s for the Exited column. in the context of the project, this means 7,963 clients are still with Beta Bank and 2,037 clients has terminated contract with Beta Bank. So there is a class imbalance since there is way more 0 answers than 1 and we will have to consider that when running and training the models. We can consider upsampling the minority class, which is 1 (clients who have exited) and making more instances of it so the amount of instances for '0' and '1' are more balance. Alternatively, we can go with downsampling where we will random discard some instances of '0's so that the differences between the two class is reduced. We should try both techniques on the models we will be building to see which one will give us the best score. 

In [14]:
#splitting dataset now for training set, validatin set, and testing set 

features = df.drop(['Exited'], axis = 1) 
target = df['Exited']

features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size = 0.4, random_state = 12345 ) #splitting into training and validation set
features_valid, features_test, target_valid, target_test = train_test_split(features_valid, target_valid, test_size = 0.5, random_state = 12345) #splitting into validation and testing sets



getting ready to build and train our models, df is split into training, validation and testing sets, 60% of the data is alloted for training and the resulting 40% is split evenly into validiation sets and testing sets.

In [39]:
#logistic regression model 

model = LogisticRegression(random_state=12345, 
                           solver='liblinear', 
                           class_weight = 'balanced')

model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)

predicted_valid_proba = model.predict_proba(features_valid)[:, 1]
#using predict_proba for calculating probabilities of the positive class('1' in this case, clients leaving) for AUC-ROC


print('Validaton set:')
print('F1:', f1_score(target_valid, predicted_valid))
print('AUC-ROC:', roc_auc_score(target_valid, predicted_valid_proba)) 




Validaton set:
F1: 0.49166666666666664
AUC-ROC: 0.7537231050272503


The first machine learning model we build is a Logistic Regression model and we added random state for reproducibilty and set class_weight to 'balanced' to help mitigate the class imbalance. 
we trained the LR model with the training dataset then use the model to predict target with the validation set. F1 score is 0.49 which fails to meet our project threshold F1 score of 0.59, so right off the bat we know this model is not good enough. 
With the F1 score being low at 0.45, it means that our model has low precision/recall so it is making many false positives predictions (saying a client will leave but actually does) or missing too many true positive predictions (saying a client will stay but actually leaves). AUC-ROC score is 0.75 which is decent and means that the model is able to distingush between clients who will leave soon and those who won't. Dispite that, the F1 score tells us that the model is making lots of false predictions. 
We will definately need to work on improveing the model, so we will focus on class weight imbalance issue with upsampling or downsampling.

In [40]:
#logistic model with upsampling 

def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)

    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

model_up = LogisticRegression(random_state=12345, 
                           solver='liblinear')

model_up.fit(features_upsampled, target_upsampled)
predicted_valid_up = model_up.predict(features_valid) #f1 calculation

predicted_valid_up_proba = model_up.predict_proba(features_valid)[:, 1] #probs of 1 for auc-roc

print('Validation set:')
print('F1:', f1_score(target_valid, predicted_valid_up))
print('AUC-ROC:', roc_auc_score(target_valid, predicted_valid_up_proba))




Validation set:
F1: 0.4512489927477841
AUC-ROC: 0.7202726244412317


In the above code chunk, we created an upsample function that takes feature and target data and separate them into minority and majority classes, then we upsampled the data by concatenating the minority class 'repeat' number of times with the majority class; we saved those upsampled data in new variables "features_upsampled" and "target_upsampled". We then shuffle the upsampled data with random_state = 12345 for reproducibility. 
After creating the function we call it with the training dataset and repeat value of 4(we manually played around with different values) and save that generate data in "features_upsampled" and "target_upsampled" variables which will now be used for model training. We train the LR model with fit method and the upsampled data. We then use the model to make predictions on the validation set data. 
The resulting F1 score and AUC_ROC score is 0.45 and 0.72, respectively. This model with upsampled data is still not meeting our criteria and is performing worst than our previous LR model without upsampling. 
We will downsample the dataset next to see if there is any improvement. 

In [42]:
#logistic regression with downsampling 

def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state=12345)]+ [features_ones])
    target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])

    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=12345)

    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.3)

model_down = LogisticRegression(random_state=12345, 
                           solver='liblinear')

model_down.fit(features_downsampled, target_downsampled)
predicted_valid_down = model_down.predict(features_valid) 
predicted_test_down = model_down.predict(features_test) 

predicted_valid_down_proba = model_down.predict_proba(features_valid)[:, 1] #probs of 1 for auc-roc

print('F1:', f1_score(target_valid, predicted_valid_down))
print('AUC-ROC:', roc_auc_score(target_valid, predicted_valid_down_proba))


F1: 0.4601449275362319
AUC-ROC: 0.7152837846829463


In the above codes we trained the LR model with downsampling technigue to hope balance the classes more and return a better F1 score. The downsample function takes features, target, and fraction parameter that specify the fraction of majority class to use. Similar to upsampling, the function separates the feature and target variables for the two classes and randomly samples a fraction of the majority class. The data is then shuffled and returned. 
We call the function with the "features_downsampled" and 'target_downsampled' data with a fraction of 0.3. 
We train the LR model with the downsampled data and then used the trained model to make predictions on the validation set. We are returned with F1 score of 0.46 and AU-ROC of 0.71. 
Alas none of the Logistic Regression models obtained a F1 score close to our project of 0.59. 

In [43]:
#randomforestclassifier 

model_RFC = RandomForestClassifier(random_state = 12345, 
                                   class_weight = 'balanced', 
                                   n_estimators = 100, 
                                   min_samples_split = 30,
                                   )
model_RFC.fit(features_train, target_train)


predicted_RFC_valid = model_RFC.predict(features_valid)
predicted_RFC_valid_proba = model_RFC.predict_proba(features_valid)[:, 1] #probs of 1 for auc-roc


print('Validation set:')
print('F1:', f1_score(target_valid, predicted_RFC_valid))
print('AUC-ROC:', roc_auc_score(target_valid, predicted_RFC_valid_proba))



Validation set:
F1: 0.6245847176079734
AUC-ROC: 0.8496270846061252


We created a RandomForestClassifier model to predict the target. We set random_state for reproducibility and balanced class weight, 30 minimum sample splits and n_estimator set to 100. The RFC model is then trained with the training set with fit method and used to make predictions on the validation set. We get an F1 score of 0.63 and AUC-ROC of 0.77, which does pass our threshold F1 score of 0.59 for this project! This means that the model is doing reasonably well. A perfect AUC-ROC score is 1 and our score of 0.84 is indicating that the model's predictive performance is pretty good at distingushing between the clients that will leave and will stay; it is correctly assigning higher predicted probabilities to the right clients (those who will leave), showing that the model is effective in distingushing between those who are likely to leave. 

Although this model is good, we want to explore if upsampling/downsampling will improve it even more. 

In [44]:
#randomforestclassifier with upsampling 

def upsample_RFC(features, target, repeat):
    features_RFC_zeros = features[target == 0]
    features_RFC_ones = features[target == 1]
    target_RFC_zeros = target[target == 0]
    target_RFC_ones = target[target == 1]

    features_RFC_upsampled = pd.concat([features_RFC_zeros] + [features_RFC_ones] * repeat)
    target_RFC_upsampled = pd.concat([target_RFC_zeros] + [target_RFC_ones] * repeat)

    features_RFC_upsampled, target_RFC_upsampled = shuffle(features_RFC_upsampled, target_RFC_upsampled, random_state=12345)

    return features_RFC_upsampled, target_RFC_upsampled

features_RFC_upsampled, target_RFC_upsampled = upsample_RFC(features_train, target_train, 4)

model_RFC_up = RandomForestClassifier(random_state = 12345, 
                                   n_estimators = 100,
                                   min_samples_split = 30,
                                   )
model_RFC_up.fit(features_RFC_upsampled, target_RFC_upsampled)
predicted_RFC_valid_up = model_RFC_up.predict(features_valid)

predicted_RFC_valid_up_proba = model_RFC_up.predict_proba(features_valid)[:, 1] #probs of 1 for auc-roc


print('F1:', f1_score(target_valid, predicted_RFC_valid_up))
print('AUC-ROC:', roc_auc_score(target_valid, predicted_RFC_valid_up_proba))

F1: 0.6258351893095768
AUC-ROC: 0.8479394382980783


Similar to what we did with upsampling the data for LR model, we created a upsample_RFC function that will separate the classes into minority and majority and then we upsampled the data by concatenating the minior class and then shuffled the data so that the model doesn't know the data. We called the upsampled_RFC function on training set and repeated it 4 times like previously. We made predictions on the validation set. 
We get a F1 score of 0.625 and AUC-ROC of 0.847, which is essentially the same as our RFC model without upsampling but weight_class set to 'balanced'. 
This model performs similarly with approx. 62% of correct predictions for the positive (clients leaving) class. We still have a AUC-ROC of 0.84 so the model has the ability to identify the clients who have left and those who hasn't. 
Overall, pretty decent model but does not give a better result than RFC without upsampling. 

In [45]:
#randomforestclassifier with downsampling 

def downsample_RFC(features, target, fraction):
    features_RFC_zeros = features[target == 0]
    features_RFC_ones = features[target == 1]
    target_RFC_zeros = target[target == 0]
    target_RFC_ones = target[target == 1]

    features_RFC_downsampled = pd.concat([features_RFC_zeros.sample(frac=fraction, random_state=12345)]+ [features_RFC_ones])
    target_RFC_downsampled = pd.concat([target_RFC_zeros.sample(frac=fraction, random_state=12345)] + [target_RFC_ones])

    features_RFC_downsampled, target_RFC_downsampled = shuffle(features_RFC_downsampled, target_RFC_downsampled, random_state=12345)

    return features_RFC_downsampled, target_RFC_downsampled

features_RFC_downsampled, target_RFC_downsampled = downsample_RFC(features_train, target_train, 0.3)

model_RFC_down = RandomForestClassifier(random_state = 12345, 
                                   n_estimators = 100,
                                   min_samples_split = 30,
                                   )

model_RFC_down.fit(features_RFC_downsampled, target_RFC_downsampled)
predicted_RFC_valid_down = model_RFC_down.predict(features_valid)

predicted_RFC_valid_down_proba = model_RFC_down.predict_proba(features_valid)[:, 1] #probs of 1 for auc-roc


print('F1:', f1_score(target_valid, predicted_RFC_valid_down))
print('AUC-ROC:', roc_auc_score(target_valid, predicted_RFC_valid_down_proba))


F1: 0.6134969325153374
AUC-ROC: 0.8510243831622498


In the above code, we created a RFC with downsampling on the target and features variable. In the downsample_RFC function we created two separate dataset for the two class and then it samples a fraction of the majority class ('1' clients who stay) to help balance the dataset. We initialized the model with the same hyperparameters and the model is trained on the downsampled data. We make predictions on the validation set and print the F1 and AUC-ROC score. 
We obtained a F1 score of 0.61 and AUC-ROC of 0.85, which is lower than the F1 score we obtained with RFC model and RFC_model with upsampling. This model performs relatively similarly like the others but not the best of the RFC models. It still passes the threshold of 0.59 which means it's ability to identify clients who left or not is reasonably well, but it is predicting less accurately than previous two. The AUC-ROC score is better, ever so slightly, than the other two RFC models as well but overall not the bestest. 
We will create DecisionTree models to explore how it performs. 

In [30]:
#decicisonttreeclassifier 

model_DTC = DecisionTreeClassifier(random_state = 12345, 
                                   class_weight = 'balanced', 
                                   min_samples_split = 100)

model_DTC.fit(features_train, target_train)
predicted_DTC_valid = model_DTC.predict(features_valid)

predicted_DTC_valid_proba = model_DTC.predict_proba(features_valid)[:, 1] #probs of 1 for auc-roc


print('F1:', f1_score(target_valid, predicted_DTC_valid))
print('AUC-ROC:', roc_auc_score(target_valid, predicted_DTC_valid_proba))


F1: 0.5551330798479086
AUC-ROC: 0.8125654038555762


In the code above we created a DecisonTreeClassifier model with hyperparameters random_state = 12345 for reproduciability, class weight to 'balanced' and min_sample_split to 100 since we found it to return a higher f1 score when manually tuning the hyperparameters. We train the model with the training set and made predictions with the validation set. 
The F1 score for this model is 0.55 and AUC-ROC of 0.81, the F1 score falls short of the threshold of 0.59 so this model does not pass. However, the model is performing well in terms of distingushing between those who will leave and those who will stay. 
We will create models with up and downsampling techinques to explore if it improves the F1 score. 

In [46]:
#decisiontreeclassifier with upsampling 

def upsample_DTC(features, target, repeat):
    features_DTC_zeros = features[target == 0]
    features_DTC_ones = features[target == 1]
    target_DTC_zeros = target[target == 0]
    target_DTC_ones = target[target == 1]

    features_DTC_upsampled = pd.concat([features_DTC_zeros] + [features_DTC_ones] * repeat)
    target_DTC_upsampled = pd.concat([target_DTC_zeros] + [target_DTC_ones] * repeat)

    features_DTC_upsampled, target_DTC_upsampled = shuffle(features_DTC_upsampled, target_DTC_upsampled, random_state=12345)

    return features_DTC_upsampled, target_DTC_upsampled

features_DTC_upsampled, target_DTC_upsampled = upsample_DTC(features_train, target_train, 4)


model_DTC_up = DecisionTreeClassifier(random_state = 12345, min_samples_split = 100)
model_DTC_up.fit(features_DTC_upsampled, target_DTC_upsampled)
predicted_DTC_valid_up = model_DTC_up.predict(features_valid)

predicted_DTC_valid_up_proba = model_DTC_up.predict_proba(features_valid)[:, 1] #probs of 1 for auc-roc

print('F1:', f1_score(target_valid, predicted_DTC_valid_up))
print('AUC-ROC:', roc_auc_score(target_valid, predicted_DTC_valid_up_proba))



F1: 0.5473684210526316
AUC-ROC: 0.7858949969452996


We created a upsample_DTC function that will upsample the miniority class, clients who left, replicate it 'repeat' (4) number of times and shuffle it. the upsampled data is then used to train the DecisionTree model with the same hyperparameters except weight class. After training we make predictions on the validation set and calculated the F1 and AUC-ROC metric. 
We get a F1 score of 0.54 and AUC-ROC of 0.78, which still falls short of passing our threshold.
We will create the last model to see if that will produce a F1 score that passes the project. 

In [47]:
#decision tree classifier with downsampling 

def downsample_DTC(features, target, repeat):
    features_DTC_zeros = features[target == 0]
    features_DTC_ones = features[target == 1]
    target_DTC_zeros = target[target == 0]
    target_DTC_ones = target[target == 1]

    features_DTC_downsampled = pd.concat([features_DTC_zeros.sample(frac=repeat, random_state=12345)]+ [features_DTC_ones])
    target_DTC_downsampled = pd.concat([target_DTC_zeros.sample(frac=repeat, random_state=12345)] + [target_DTC_ones])

    features_DTC_downsampled, target_DTC_downsampled = shuffle(features_DTC_downsampled, target_DTC_downsampled, random_state=12345)

    return features_DTC_downsampled, target_DTC_downsampled

features_DTC_downsampled, target_DTC_downsampled = downsample_DTC(features_train, target_train, .3)

model_DTC_down = DecisionTreeClassifier(random_state = 12345, min_samples_split = 100)
model_DTC_down.fit(features_DTC_downsampled, target_DTC_downsampled)

predicted_DTC_valid_down = model_DTC_down.predict(features_valid)
predicted_DTC_test_down = model_DTC_down.predict(features_test) #making predictions on testing set, will be call on to calculate F1/AUC-ROC

predicted_DTC_test_down_proba = model_DTC_down.predict_proba(features_valid)[:, 1] #probs of 1 for auc-roc


print('F1:', f1_score(target_valid, predicted_DTC_valid_down))
print('AUC-ROC:', roc_auc_score(target_valid, predicted_DTC_test_down_proba))



F1: 0.5764023210831721
AUC-ROC: 0.812093588758703


We created function downsampled_DTC to take features and target and repeat parameter which will be the ratio of classes after downsampling. The function downsamples the data by randomly selecting a fraction (repeat) of the majority class (clients who stay) and concatenating them with all of the minority class. It is then shuffled for better usage. the downsampled_DTC function is called on to downsample the training set. the model is then trained with the downsample data. The model is used to make predictions on the validation set and F1 and AUC-ROC metrics are calcuated to evalute how well this model is. F1 score of 0.57 and AUC-ROC score of 0.81; which is so close to meeting our criteria of F1 score of at least 0.59. 

DecisionTree models is not the best model to use in this project since we know that the RandomForestClassifier models have f1 scores > 0.59 on their validation set; however, DecisionTreeClassifier models did perform better than Logistic Regression models in terms of F1 score. 

So, the last thing to do is perform final testing with the model that returned the best F1 score from validation set -- RandomForestClassifier with upsampling technique. 

In [48]:
#Final testing 

predicted_RFC_test_up = model_RFC_up.predict(features_test)
predicted_RFC_test_up_proba = model_RFC_up.predict_proba(features_test)[:, 1] 

print("RFC model with upsampling: ", "F1 = ", round(f1_score(target_test, predicted_RFC_test_up), 2), " | ", 
      "AUC-ROC =", round(roc_auc_score(target_test, predicted_RFC_test_up_proba), 2))


RFC model with upsampling:  F1 =  0.62  |  AUC-ROC = 0.86


We printed the F1 and AUC-ROC score for testing sets for the RandomForestClassifier model with upsampling and obstained:F1 = 0.62 and AUC-ROC = 0.86. These scores indicate that the model is performing reasonably well and exceeds our projects threshold of 0.59 F1 score on the testing set. The model achieved reasonable accuracy of 62% in predicting clients who will leave, indicating it's pretty good at identifying clients who will leave. Additionally,the AUC-ROC score of 0.86 is relatively high since 1 is the perfect score, it suggest the the model is effective in differentiating the clients who will likely leave and those who will likely stay. 

In conclusion, RandomForestClassifier with upsampling proves to be the best choice of all our models built and that should be the model that Beta Bank should employ if there is class imbalance. It produces the highest F1 score and good AUC-ROC meaning the model is pretty good at predicting if a client will leave while minimizing false positives (indicating a client will leave but actually stay). 
