# Project description

- Beta Bank customers are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.
- We need to predict whether a customer will leave the bank soon. You have the data on clients’ past behavior and termination of contracts with the bank.
- Build a model with the maximum possible F1 score. To pass the project, you need an F1 score of at least 0.59. Check the F1 for the test set.
- Additionally, measure the AUC-ROC metric and compare it with the F1.

## Research question:
***Whether a customer will leave the bank soon?***

## Model type: CLASSIFICATION
Because we want to classify if a customer belongs to "leave" or "not leave" categories

## Potential models:
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression

## Model performance metrics:
- F1 score
- AUC-ROC 

# Step 1: Open the data file and study the general information

Load necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate

Load data and look at basic info

In [2]:
churn = pd.read_csv('https://code.s3.yandex.net/datasets/Churn.csv')
churn.info()
churn.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


# Conclusions for Step 1
- There are 10000 observations, describing the info for each customer of Beta Bank.
- The identifier variables include: RowNumber, Customerid and Surname - these variables are not necessary for model building 
- The features that are crucial for model building include: 
    - CreditScore (integer): credit score
    - Geography (string): country of residence
    - Gender (string): gender
    - Age (integer): age in years
    - Tenure (float): period of maturation for a customer’s fixed deposit (years)
    - Balance (float): account balance
    - NumOfProducts (integer): number of banking products used by the customer
    - HasCrCard (integer): customer has a credit card
    - IsActiveMember (integer): customer’s activeness
    - EstimatedSalary (float): estimated salary
- The target is the **Exited** variable: сustomer has left (left - 1, Not left - 0).
- Missing values only occur under the 'Tenure' variable
- We need to create dummy variables for categorical variables, "Geography" and "Gender"
### We need to preprocess the data in Step 2

# Step 2: Prepare the data for training the model
## Preprocess the data
Based on our initial investigation in Step 1, we need to take the following steps to prepare the data:
- Make all column names lowercase for faster typing 
- Remove unneccesary variables (RowNumber, Customerid and Surname)
- Investigate duplicated rows

## Prepare the features
- Create dummy variables using end hot code using one-hot encoding
- Scale the numeric features 

## Deal with missing values
- Impute missing values using mode because the variable is integer.

## Resampling the data to fix the imbalance (will do together with Step 3)
- Investigate the balance of classes
- Train the model without taking into account the imbalance. Briefly describe your findings.
- Improve the quality of the model. Make sure you use at least two approaches to fixing class imbalance.

## 2.1 Preprocess the data

Make all column names lowercase for faster typing

In [3]:
churn.columns = map(str.lower, churn.columns)
churn.head()

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


Investigate if there are any duplicated rows

In [4]:
churn[churn.duplicated()]
#No duplicates

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited


Remove unneccesary variables (RowNumber, Customerid and Surname)

In [5]:
churn_df = churn.iloc[:,3:]
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
creditscore        10000 non-null int64
geography          10000 non-null object
gender             10000 non-null object
age                10000 non-null int64
tenure             9091 non-null float64
balance            10000 non-null float64
numofproducts      10000 non-null int64
hascrcard          10000 non-null int64
isactivemember     10000 non-null int64
estimatedsalary    10000 non-null float64
exited             10000 non-null int64
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


## 2.2 Prepare the features

### Create dummy variables using one-hot encoding

Investigate potential categorical variables

In [6]:
churn_df['geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [7]:
churn_df['gender'].unique()

array(['Female', 'Male'], dtype=object)

In [8]:
churn_ohe = pd.get_dummies(churn_df, drop_first=True)
churn_ohe.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


### Deal with missing values

In [9]:
churn_ohe.isna().mean()

creditscore          0.0000
age                  0.0000
tenure               0.0909
balance              0.0000
numofproducts        0.0000
hascrcard            0.0000
isactivemember       0.0000
estimatedsalary      0.0000
exited               0.0000
geography_Germany    0.0000
geography_Spain      0.0000
gender_Male          0.0000
dtype: float64

Missing values occur in the **tenure** variable, accounting for 9% of the whole observation. We are going impute missing values with the mode (because the value in this variable is integer).

Impute missing values with mode value because tenure is inherently a integer variable

In [10]:
churn_ohe['tenure'].unique()

array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0., nan])

In [11]:
churn_ohe_imputena = churn_ohe
churn_ohe_imputena['tenure'] = churn_ohe_imputena['tenure'].fillna(churn_ohe_imputena['tenure'].mode()[0])
churn_ohe_imputena['tenure'].unique()

array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0.])

In [12]:
type(churn_ohe_imputena)

pandas.core.frame.DataFrame

### So now the new dataframe (whose missing values are imputed) is churn_ohe_imputena (10000 observations)

### Scale the numeric features
- Look at the range of the numeric variables first to see if there are any anomalies
- Scale them

In [13]:
churn_ohe.describe() #nothing abnormal

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,650.5288,38.9218,4.6343,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037,0.2509,0.2477,0.5457
std,96.653299,10.487806,2.989725,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769,0.433553,0.431698,0.497932
min,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0,0.0,0.0,0.0
25%,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0,0.0,0.0,0.0
50%,652.0,37.0,4.0,97198.54,1.0,1.0,1.0,100193.915,0.0,0.0,0.0,1.0
75%,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0,1.0,0.0,1.0
max,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0,1.0,1.0,1.0


There are no obvious anomalies in this dataset

In [14]:
#Save numeric variables 
churn_numeric = ['creditscore', 'age', 'tenure', 'balance', 'estimatedsalary']

#Save the scaler 
scaler = StandardScaler()

#Fit the scaler into the data and transform it
churn_ohe_imputena[churn_numeric] = scaler.fit_transform(churn_ohe_imputena[churn_numeric])

churn_ohe_imputena.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
0,-0.326221,0.293517,-0.881162,-1.225848,1,1,1,0.021886,1,0,0,0
1,-0.440036,0.198164,-1.215657,0.11735,1,0,1,0.216534,0,0,1,0
2,-1.536794,0.293517,1.125812,1.333053,3,1,0,0.240687,1,0,0,0
3,0.501521,0.007457,-1.215657,-1.225848,2,0,0,-0.108918,0,0,0,0
4,2.063884,0.388871,-0.881162,0.785728,1,1,1,-0.365276,0,0,1,0


### Investigate the balance of classes


In [15]:
sum(churn_ohe_imputena['exited'] == 1)/len(churn_ohe_imputena['exited'])

0.2037

Positive class only accounts for 20% of the whole dataset --> **IMBALANCE** --> Need to upsampling the positive class or downsampling the negative class

# Step 3: Build the model (without and with fixing target imbalance)
- Split the source data into a training set, a validation set, and a test set with the ratio 3:1:1
- Fit different models that are appropriate for classification problem, investigate the quality of different models by changing hyperparameters
- Try with the unbalanced data, the upsampled and downsampled data too

# 3.1 Split the source data into a training set, a validation set, and a test set with the ratio 3:1:1

Create the feature and target variables

In [16]:
feature = churn_ohe_imputena.drop('exited', axis=1)
target = churn_ohe_imputena['exited']

Split the source data into a training set, a validation set, and a test set with the ratio 3:1:!
Use train_split_test twice

In [17]:
#First split the data into 2 sets: 60% for train and 40% for test (actually for test and validate)
feature_train, feature_test, target_train, target_test = train_test_split(feature, target, 
                                                                           test_size=0.4, random_state=12)

In [18]:
#Second split the above data into further 2 sets of test and validate, each has 50% of data
feature_val, feature_test, target_val, target_test = train_test_split(feature_test, target_test,
                                                                     test_size=0.5, random_state=12)

Test if each dataset has the correct proportion (60%, 20%, 20% of the total number of observations of the source data)

In [19]:
for data in [target_train, target_val, target_test]:
    print(round(len(data)/len(target), 2))

0.6
0.2
0.2


## A - Build the model without fixing target imbalance 

# 3.2 Fit different models that are appropriate for classification problem, investigate the quality of different models by changing hyperparameters

The three appropriate models for this classification probelm are:
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression

To pass the project, you need an F1 score of at least 0.59

### Create the DecisionTreeClassfier model
- Create a loop to test different hyperparameters (max_depth): test from 1-50
- Then choose the best model based on F1 score of the validation set

In [20]:
for depth in range (1,50, 1):
    model1 = DecisionTreeClassifier(random_state=12, max_depth=depth)
    model1.fit(feature_train, target_train)
    prediction_train = model1.predict(feature_train)
    prediction_val = model1.predict(feature_val)
    print('max_depth: {}'.format(depth),
          '| F1 score',
          '- Training: {}'.format(f1_score(target_train, prediction_train)),
          '- Validate: {}'.format(f1_score(target_val, prediction_val), sep='\n'))
    
   

max_depth: 1 | F1 score - Training: 0.0 - Validate: 0.0
max_depth: 2 | F1 score - Training: 0.5019493177387914 - Validate: 0.5354107648725214
max_depth: 3 | F1 score - Training: 0.3835616438356165 - Validate: 0.38113207547169814
max_depth: 4 | F1 score - Training: 0.5254054054054054 - Validate: 0.5318818040435458
max_depth: 5 | F1 score - Training: 0.5482456140350876 - Validate: 0.5339805825242718
max_depth: 6 | F1 score - Training: 0.5909568874868558 - Validate: 0.5727554179566563


  'precision', 'predicted', average, warn_for)


max_depth: 7 | F1 score - Training: 0.6268806419257774 - Validate: 0.5775862068965517
max_depth: 8 | F1 score - Training: 0.6740172579098754 - Validate: 0.6008344923504868
max_depth: 9 | F1 score - Training: 0.7045565899069084 - Validate: 0.5580736543909348
max_depth: 10 | F1 score - Training: 0.7498795180722891 - Validate: 0.5594405594405594
max_depth: 11 | F1 score - Training: 0.807832422586521 - Validate: 0.5387647831800263
max_depth: 12 | F1 score - Training: 0.8531150522964985 - Validate: 0.5065963060686016
max_depth: 13 | F1 score - Training: 0.8972209969122188 - Validate: 0.5168831168831168
max_depth: 14 | F1 score - Training: 0.928695652173913 - Validate: 0.5083014048531289
max_depth: 15 | F1 score - Training: 0.9556313993174061 - Validate: 0.5030978934324659
max_depth: 16 | F1 score - Training: 0.9723521905572098 - Validate: 0.50187265917603
max_depth: 17 | F1 score - Training: 0.9852008456659619 - Validate: 0.4975369458128079
max_depth: 18 | F1 score - Training: 0.99070160608

In [21]:
#Investigate why F1 = 0 for max_depth = 1
model1 = DecisionTreeClassifier(random_state=12, max_depth=1)
model1.fit(feature_train, target_train)
prediction_train = model1.predict(feature_train)
prediction_val = model1.predict(feature_val)
# prediction_val.unqiue()
pd.Series(prediction_val).unique() #array([0])

#Because When max_depth = 1, the model only predict 0, thus no positive predictions.
#F1 = 2 * (precision * recall) / (precision + recall)

#precision = TP/(TP+FP) model doesn't predicts positive class at all - precision is 0.

#recall = TP/(TP+FN), model doesn't predicts positive class at all - TP is 0 - recall is 0.

#So now we are dividing 0/0.

array([0])

### Decision Tree model: max_depth = 8 returned the satisfied F1 score (>= 0.59)
We also observed that starting from max_depth = 7, the larger max_depth is, the more the model overfits

### Create the RandomForestClassfier model
- Create a loop to iterate through different max_depth values  to find the optimal hyperparameters, set n_estimators=100
- Then choose the best model based on F1 score of the validation set

Let’s first fit a random forest with default parameters to get a baseline idea of the performance

In [22]:
rf = RandomForestClassifier(random_state=12)
rf.fit(feature_train, target_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=12, verbose=0,
                       warm_start=False)

In [23]:
target_pred = rf.predict(feature_val)

We will use F1 Score and AUC (Area Under Curve) as the evaluation metric. Our target value is binary so it’s a binary classification problem. F1 score and AUC are good ways for evaluation for this type of problems.

In [24]:
f1 = f1_score(target_val, target_pred)
f1

0.5236593059936908

This F1 score did not satisfy the requirement, we need to tune the model parameter. Let try different max_depth with n_estimators=100 (by default of the model)

In [25]:
#Create a loop to iterate through different max_depth values
for n in range(1,21,1): 
    model2 = RandomForestClassifier(max_depth=n, n_estimators=100, random_state=12)
    model2.fit(feature_train,target_train)
    prediction_val = model2.predict(feature_val)
    f1 = f1_score(target_val,prediction_val)
    print(n, f1)

  'precision', 'predicted', average, warn_for)


1 0.0
2 0.1050228310502283
3 0.1592920353982301
4 0.3488372093023256
5 0.4335154826958106
6 0.5109983079526227
7 0.5439739413680782
8 0.5603864734299516
9 0.5650793650793652
10 0.5830721003134797
11 0.5996908809891809
12 0.5823170731707317
13 0.5871559633027523
14 0.5966514459665144
15 0.5889387144992526
16 0.5964391691394659
17 0.5740181268882175
18 0.5744360902255639
19 0.5967503692762186
20 0.5924812030075188


### Random Forest model: max_depth = 11 and n_estimators=100 is the best combination, producing F1 score = 0.599

### Create the LogisticRegression model
No need to tune the hyperameter(s)

In [26]:
model3 = LogisticRegression(random_state=12, solver = 'liblinear')
model3.fit(feature_train, target_train)
prediction_train = model3.predict(feature_train)
prediction_val = model3.predict(feature_val)

In [27]:
print('F1 score')
print('Training set: ', f1_score(target_train, prediction_train))
print('Validation set: ', f1_score(target_val, prediction_val))

F1 score
Training set:  0.2900569981000633
Validation set:  0.3037037037037037


### Logistic Regression model: F1 score is too low for both training and validation set. 

## Conclusion for the data whose imbalance has not been fixed:
- This data has much more 0 than 1, therefore the model mostly predicts 0 for the new data points. 
- Model performance:
    - Decision Tree and Random Forest model with above mentioned parameters produces the F1 score that meets the minimum requirement of this assignment. However, the F1 score is low. 
    - Logistic Regression, without tuning any parameter, did not produce the satisfied F1 score.

## B - Build the model with fixing target imbalance 
2 approaches to fix the imbalance:
- Upsampling
- Downsampling

### B.1. Upsampling the positive class
Try a few ratios between positive and negative class
- 1:1
- 2:3

Create the function to upsample

In [28]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12)
    
    return features_upsampled, target_upsampled

Apply the upsample function for the training set

### Ratio between positive and negative class: **1:1**, repeat = 4

In [29]:
feature_train_upsampled, target_train_upsampled = upsample(feature_train, target_train, 4)

In [30]:
#Check the ratio between 1 and 0 - need to be ~50%
sum(target_train_upsampled == 1)/len(target_train_upsampled)

0.49764963961140707

Now we build the models, tune hyperparameters to see if there is any improvement in model performance

### Create the DecisionTreeClassfier model
- Create a loop to test different hyperparameters (max_depth): test from 1-50
- Then choose the best model based on F1 score of the validation set

In [31]:
for depth in range (1,50, 1):
    model1 = DecisionTreeClassifier(random_state=12, max_depth=depth)
    model1.fit(feature_train_upsampled, target_train_upsampled)
    prediction_train = model1.predict(feature_train_upsampled)
    prediction_val = model1.predict(feature_val)
    print('max_depth:', depth, '- F1 Score', f1_score(target_val, prediction_val))

max_depth: 1 - F1 Score 0.4837476099426386
max_depth: 2 - F1 Score 0.5116279069767442
max_depth: 3 - F1 Score 0.5283757338551859
max_depth: 4 - F1 Score 0.5307497893850042
max_depth: 5 - F1 Score 0.577490774907749
max_depth: 6 - F1 Score 0.5680365296803653
max_depth: 7 - F1 Score 0.5782442748091603
max_depth: 8 - F1 Score 0.5827686350435625
max_depth: 9 - F1 Score 0.5803842264914054
max_depth: 10 - F1 Score 0.572
max_depth: 11 - F1 Score 0.5683760683760684
max_depth: 12 - F1 Score 0.5569060773480662
max_depth: 13 - F1 Score 0.552026286966046
max_depth: 14 - F1 Score 0.5340782122905029
max_depth: 15 - F1 Score 0.5339265850945495
max_depth: 16 - F1 Score 0.5291375291375292
max_depth: 17 - F1 Score 0.5301204819277109
max_depth: 18 - F1 Score 0.5270758122743683
max_depth: 19 - F1 Score 0.5320197044334974
max_depth: 20 - F1 Score 0.5167701863354037
max_depth: 21 - F1 Score 0.5186104218362283
max_depth: 22 - F1 Score 0.4987468671679198
max_depth: 23 - F1 Score 0.506900878293601
max_depth: 24

### Comments:
- Now F1 score for lower max_depths (1-3) improved
- However in general F1 score did not improve compared to the unbalanced data
- There is not any max_depth that can produce F1 score that satisfies the requirement


### Create the RandomForestClassfier model
- Create a loop to find the optimal hyperparameters
- Then choose the best model based on F1 score of the validation set

Let’s first fit a random forest with default parameters to get a baseline idea of the performance

In [32]:
rf = RandomForestClassifier(random_state=12)
rf.fit(feature_train_upsampled, target_train_upsampled)
target_pred = rf.predict(feature_val)
f1 = f1_score(target_val, target_pred)
f1



0.5825242718446602

This F1 score, though improved comapred to the unbalanced data, did not satisfy the requirement, we need to tune the model parameter. Let try different max_depth with n_estimators=100 (by default of the model)

In [33]:
#Create a loop to iterate through different max_depth values
for n in range(1,21,1): 
    model2 = RandomForestClassifier(max_depth=n, n_estimators=100, random_state=12)
    model2.fit(feature_train_upsampled, target_train_upsampled)
    prediction_val = model2.predict(feature_val)
    f1 = f1_score(target_val,prediction_val)
    print(n, f1)

1 0.524074074074074
2 0.5512572533849128
3 0.5912334352701325
4 0.5939393939393939
5 0.6043165467625898
6 0.6230529595015577
7 0.6251319957761352
8 0.64340239912759
9 0.6388583973655325
10 0.6468571428571428
11 0.6526806526806528
12 0.6577992744860944
13 0.6517412935323382
14 0.6480304955527318
15 0.6336375488917863
16 0.6380208333333333
17 0.6335078534031414
18 0.6265060240963854
19 0.6352624495289367
20 0.6374501992031872


### Comments:
- F1 score of Random Forest of the upsampled data (1:1) improved a lot comapred to the unbalanced data
- Random Forest model with max_depth = 12 and n_estimators=100 is the best combination, producing F1 score = 0.657. Other max_depths satisfying the requirement are: range(3, 21)

### Create the LogisticRegression model
No need to tune the hyperameter(s)

In [34]:
model3 = LogisticRegression(random_state=12, solver = 'liblinear')
model3.fit(feature_train_upsampled, target_train_upsampled)
prediction_train = model3.predict(feature_train_upsampled)
prediction_val = model3.predict(feature_val)

In [35]:
print('F1 score')
print('Training set: ', f1_score(target_train_upsampled, prediction_train))
print('Validation set: ', f1_score(target_val, prediction_val))

F1 score
Training set:  0.7054329371816638
Validation set:  0.5215859030837003


### Comments:
- F1 score of training and validation sets of the upsampled data improved comapred to the unbalanced data
- However, the F1 score of the validation data still did not satisfy the requirement

### Ratio between positive and negative class: **2:3**, repeat = 3

In [36]:
feature_train_upsampled, target_train_upsampled = upsample(feature_train, target_train, 3)

In [37]:
#Check the ratio between 1 and 0 - need to be ~50%
sum(target_train_upsampled == 1)/len(target_train_upsampled)

0.4262705798138869

Now we build the models, tune hyperparameters to see if there is any improvement in model performance

### Create the DecisionTreeClassfier model
- Create a loop to test different hyperparameters (max_depth): test from 1-50
- Then choose the best model based on F1 score of the validation set

In [38]:
for depth in range (1,50, 1):
    model1 = DecisionTreeClassifier(random_state=12, max_depth=depth)
    model1.fit(feature_train_upsampled, target_train_upsampled)
    prediction_train = model1.predict(feature_train_upsampled)
    prediction_val = model1.predict(feature_val)
    print('max_depth:', depth, '- F1 Score', f1_score(target_val, prediction_val))

max_depth: 1 - F1 Score 0.4837476099426386
max_depth: 2 - F1 Score 0.5116279069767442
max_depth: 3 - F1 Score 0.5598923283983849
max_depth: 4 - F1 Score 0.5625
max_depth: 5 - F1 Score 0.5974304068522485
max_depth: 6 - F1 Score 0.5984598459845983
max_depth: 7 - F1 Score 0.596153846153846
max_depth: 8 - F1 Score 0.5858170606372045
max_depth: 9 - F1 Score 0.586864406779661
max_depth: 10 - F1 Score 0.5816216216216216
max_depth: 11 - F1 Score 0.5711229946524062
max_depth: 12 - F1 Score 0.5570776255707763
max_depth: 13 - F1 Score 0.5509259259259259
max_depth: 14 - F1 Score 0.5360360360360361
max_depth: 15 - F1 Score 0.5393518518518519
max_depth: 16 - F1 Score 0.5300613496932515
max_depth: 17 - F1 Score 0.5291262135922329
max_depth: 18 - F1 Score 0.516209476309227
max_depth: 19 - F1 Score 0.5248756218905473
max_depth: 20 - F1 Score 0.5238095238095237
max_depth: 21 - F1 Score 0.517948717948718
max_depth: 22 - F1 Score 0.5252525252525252
max_depth: 23 - F1 Score 0.5126903553299492
max_depth: 24

### Comments:
- Now F1 score of Decision Tree of upsampled data for lower max_depths (1-3) improved compared to the unbalanced data
- F1 score for the remaining max_depths did not improve much compared to the unbalanced data
- max_depth that can produce F1 score that satisfies the requirement are: 5, 6, 7, 8, 9 - the highest F1 score is at 6


### Create the RandomForestClassfier model
- Create a loop to find the optimal hyperparameter (combining max_Depth and n_estimators).
- Then choose the best model based on F1 score of the validation set

Let’s first fit a random forest with default parameters to get a baseline idea of the performance

In [39]:
rf = RandomForestClassifier(random_state=12)
rf.fit(feature_train_upsampled, target_train_upsampled)
target_pred = rf.predict(feature_val)
f1 = f1_score(target_val, target_pred)
f1



0.5880758807588076

This F1 score satisfied the requirement, we will tune the model parameter to see if we can improve F1 score. Let try different max_depth with n_estimators=100 (by default of the model)

In [40]:
#Create a loop to iterate through different max_depth values
for n in range(1,21,1): 
    model2 = RandomForestClassifier(max_depth=n, n_estimators=100, random_state=12)
    model2.fit(feature_train_upsampled, target_train_upsampled)
    prediction_val = model2.predict(feature_val)
    f1 = f1_score(target_val,prediction_val)
    print(n, f1)

1 0.4193011647254576
2 0.5897771952817824
3 0.5788849347568208
4 0.5992865636147443
5 0.6124852767962308
6 0.6252983293556086
7 0.6442307692307693
8 0.6507177033492823
9 0.6514423076923077
10 0.6552147239263802
11 0.6465408805031447
12 0.6445012787723785
13 0.6433203631647212
14 0.6354166666666666
15 0.6321381142098274
16 0.6418109187749667
17 0.6255033557046981
18 0.6243243243243243
19 0.6253369272237197
20 0.6396761133603238


### Comments:
- F1 score of Random Forest of the balanced data improved a lot comapred to the unbalanced data
- Random Forest model with max_depth = 10 and n_estimators=100 is the best combination, producing F1 score = 0.655. Other max_depths satisfying the requirement are: 2, 4, 5, 6, ..., 20

### Create the LogisticRegression model
No need to tune the hyperameter(s)

### Create a function to iterate through different "repeat" for Logistic Regression model

In [41]:
def f1_class(feature, target, repeat):
    for i in repeat:
        feature_train_upsampled, target_train_upsampled = upsample(feature_train, target_train, i)
        model3 = LogisticRegression(random_state=12, solver = 'liblinear')
        model3.fit(feature_train_upsampled, target_train_upsampled)
        prediction_train = model3.predict(feature_train_upsampled)
        prediction_val = model3.predict(feature_val)
        print('Repeat = {}'.format(i),
              'F1 score',
              'Training: {}'.format(f1_score(target_train_upsampled, prediction_train)),
              'Validation: {}'.format(f1_score(target_val, prediction_val)), 
              '-----', sep='\n')

In [42]:
repeat = range(1,11, 1) 
f1_class(feature_train, target_train, repeat)

Repeat = 1
F1 score
Training: 0.2900569981000633
Validation: 0.3037037037037037
-----
Repeat = 2
F1 score
Training: 0.5180294565769425
Validation: 0.49122807017543857
-----
Repeat = 3
F1 score
Training: 0.6215102974828375
Validation: 0.5241090146750524
-----
Repeat = 4
F1 score
Training: 0.7054329371816638
Validation: 0.5215859030837003
-----
Repeat = 5
F1 score
Training: 0.7481215289121202
Validation: 0.500780031201248
-----
Repeat = 6
F1 score
Training: 0.7763297872340426
Validation: 0.4777385159010601
-----
Repeat = 7
F1 score
Training: 0.8030922637387261
Validation: 0.45905867182462934
-----
Repeat = 8
F1 score
Training: 0.8205526686416396
Validation: 0.44135429262394194
-----
Repeat = 9
F1 score
Training: 0.8343858744683592
Validation: 0.42956120092378747
-----
Repeat = 10
F1 score
Training: 0.8449181340747605
Validation: 0.4132231404958677
-----


### Comments:
- F1 score of training and validation data of the upsampling data improved comapred to the unbalanced data in the logistic regression model
- However, the F1 score of the validation data still did not satisfy the requirement

### Conclusion for the Upsampling approach
- Generally, F1 score improved 
- Random Forest Classifier seems to be the optimal model with the highest F1 score (also satisfied the requirement of this assignment, which is >= 0.59)

Now we try the **downsampling** approach

### B.2 Downsampling the negative class

Create the function to downsample

In [43]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    
    features_downsampled = pd.concat([features_ones] + [features_zeros.sample(frac=fraction, random_state=12)])
    target_downsampled = pd.concat([target_ones] + [target_zeros.sample(frac=fraction, random_state=12)])
    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=12)
    return features_downsampled, target_downsampled

Apply the downsample function for the training set with different fractions
- The ratio between positive and negative class is 1:1, Fraction = 0.2
- The ratio between positive and negative class is 2:3, Fraction = 0.3

### Ratio between positive and negative class: **1:1**, fraction = 0.2

In [44]:
feature_train_downsampled, target_train_downsampled = downsample(feature_train, target_train, 0.2)

In [45]:
#Check the ratio between 1 and 0 - need to be ~50%
sum(target_train_downsampled == 1)/len(target_train_downsampled)

0.5531816070599164

Now we build the models, tune hyperparameters to see if there is any improvement in model performance

### Create the DecisionTreeClassfier model
- Create a loop to test different hyperparameters (max_depth): test from 1-50
- Then choose the best model based on F1 score of the validation set

In [46]:
for depth in range (1,50, 1):
    model1 = DecisionTreeClassifier(random_state=12, max_depth=depth)
    model1.fit(feature_train_downsampled, target_train_downsampled)
    prediction_val = model1.predict(feature_val)
    print('max_depth:', depth, '- F1 Score', f1_score(target_val, prediction_val))

max_depth: 1 - F1 Score 0.4837476099426386
max_depth: 2 - F1 Score 0.5116279069767442
max_depth: 3 - F1 Score 0.5283757338551859
max_depth: 4 - F1 Score 0.5262267343485618
max_depth: 5 - F1 Score 0.5795768169273229
max_depth: 6 - F1 Score 0.557006092254134
max_depth: 7 - F1 Score 0.5473501303214596
max_depth: 8 - F1 Score 0.551293487957181
max_depth: 9 - F1 Score 0.5260416666666667
max_depth: 10 - F1 Score 0.5256849315068493
max_depth: 11 - F1 Score 0.5153061224489796
max_depth: 12 - F1 Score 0.5076923076923077
max_depth: 13 - F1 Score 0.4971287940935193
max_depth: 14 - F1 Score 0.4853896103896104
max_depth: 15 - F1 Score 0.4909688013136289
max_depth: 16 - F1 Score 0.48569092395748165
max_depth: 17 - F1 Score 0.4850444624090542
max_depth: 18 - F1 Score 0.4806451612903225
max_depth: 19 - F1 Score 0.47871485943775094
max_depth: 20 - F1 Score 0.4862236628849271
max_depth: 21 - F1 Score 0.4927302100161551
max_depth: 22 - F1 Score 0.4927302100161551
max_depth: 23 - F1 Score 0.49273021001615

### Comments:
- Now F1 score of lower max_depth (1-3) of the DOWNSAMPLED data improved comapred to the unbalanced data
- However in general F1 score did not improve compared to the unbalanced data
- There is not any max_depth that can produce F1 score that satisfies the requirement


### Create the RandomForestClassfier model
- Create a loop to iterate through different max_depths to find the optimal hyperparameters (combining max_Depth and n_estimators).
- Then choose the best model based on F1 score of the validation set

Let’s first fit a random forest with default parameters to get a baseline idea of the performance

In [47]:
rf = RandomForestClassifier(random_state=12)
rf.fit(feature_train_downsampled, target_train_downsampled)
target_pred = rf.predict(feature_val)
f1 = f1_score(target_val, target_pred)
f1



0.569060773480663

This F1 score, though improved comapred to the unbalanced data, did not satisfy the requirement, we need to tune the model parameter. Let try different max_depth with n_estimators=100 (by default of the model)

In [48]:
#Create a loop to iterate through different max_depth values
for n in range(1,21,1): 
    model2 = RandomForestClassifier(max_depth=n, n_estimators=100, random_state=12)
    model2.fit(feature_train_downsampled, target_train_downsampled)
    prediction_train = model2.predict(feature_train_downsampled)
    prediction_val = model2.predict(feature_val)
    f1_train = f1_score(target_train_downsampled, prediction_train)
    f1_val = f1_score(target_val,prediction_val)
    print(n, f1_train, f1_val)

1 0.7662863452337136 0.45271122320302637
2 0.7765273311897106 0.5252365930599369
3 0.7828220858895706 0.5469522240527183
4 0.7945544554455446 0.5579831932773109
5 0.8054977092877967 0.5674499564838991
6 0.8223435531289374 0.5749342681858018
7 0.8514934791754313 0.5821428571428572
8 0.8751576292559899 0.5814581458145814
9 0.9190096516995385 0.5818505338078291
10 0.950544844928751 0.5714285714285715
11 0.9785443836769037 0.5729632945389436
12 0.9920402178466694 0.5853227232537577
13 0.9974789915966387 0.574430823117338
14 0.9991596638655461 0.5855855855855856
15 1.0 0.578853046594982
16 1.0 0.5779735682819384
17 1.0 0.5772646536412078
18 1.0 0.5884444444444444
19 1.0 0.583111111111111
20 1.0 0.5841495992876224


### Comments:
- F1 score of Random Forest of the DOWNSAMPLED data improved a lot comapred to the unbalanced data, however seems to get worse comapred to the UPSAMPLED data
- If rounding up to 2 decimals, models with max_depth = [12, 14, 18] returned F1 score ~ 0.59. 
- We also see greater discrepancy between training and validation sets starting right from max_depth = 1 --> a warning for model overfit

### Create the LogisticRegression model
No need to tune the hyperameter(s)

### Create a function to iterate through different "repeat" for Logistic Regression model

In [49]:
def f1_class(feature, target, fraction):
    for i in fraction:
        feature_train_downsampled, target_train_downsampled = downsample(feature_train, target_train, i)
        model3 = LogisticRegression(random_state=12, solver = 'liblinear')
        model3.fit(feature_train_downsampled, target_train_downsampled)
        prediction_train = model3.predict(feature_train_downsampled)
        prediction_val = model3.predict(feature_val)
        print('Repeat = {}'.format(i),
              'F1 score',
              'Training: {}'.format(f1_score(target_train_downsampled, prediction_train)),
              'Validation: {}'.format(f1_score(target_val, prediction_val)), 
              '-----', sep='\n')


In [50]:
fraction = np.arange(0.1,0.9, 0.05)
f1_class(feature_train, target_train, fraction)

Repeat = 0.1
F1 score
Training: 0.8445475638051044
Validation: 0.41881298992161264
-----
Repeat = 0.15000000000000002
F1 score
Training: 0.79874213836478
Validation: 0.47137150466045274
-----
Repeat = 0.20000000000000004
F1 score
Training: 0.7422512234910277
Validation: 0.4964539007092199
-----
Repeat = 0.25000000000000006
F1 score
Training: 0.7039390088945362
Validation: 0.5288808664259929
-----
Repeat = 0.30000000000000004
F1 score
Training: 0.6515353805073432
Validation: 0.5290581162324649
-----
Repeat = 0.3500000000000001
F1 score
Training: 0.5959221501390176
Validation: 0.5269058295964126
-----
Repeat = 0.40000000000000013
F1 score
Training: 0.5717035611164581
Validation: 0.5095693779904307
-----
Repeat = 0.45000000000000007
F1 score
Training: 0.5400885391047712
Validation: 0.4961636828644502
-----
Repeat = 0.5000000000000001
F1 score
Training: 0.5111561866125761
Validation: 0.49392712550607293
-----
Repeat = 0.5500000000000002
F1 score
Training: 0.4789915966386555
Validation: 0.4

### Comments:
- F1 score of training and validation data of the downsampling data improved comapred to the unbalanced data
- However, the F1 score of the validation data still did not satisfy the requirement

### Ratio between positive and negative class: **2:3**, fraction = 0.3


In [51]:
feature_train_downsampled, target_train_downsampled = downsample(feature_train, 
                                                                 target_train, 0.3)

In [52]:
#Check the ratio between 1 and 0 - need to be ~50%
sum(target_train_downsampled == 1)/len(target_train_downsampled)

0.4521640091116173

Now we build the models, tune hyperparameters to see if there is any improvement in model performance

### Create the DecisionTreeClassfier model
- Create a loop to test different hyperparameters (max_depth): test from 1-50
- Then choose the best model based on F1 score of the validation set

In [53]:
for depth in range (1,50, 1):
    model1 = DecisionTreeClassifier(random_state=12, max_depth=depth)
    model1.fit(feature_train_downsampled, target_train_downsampled)
    prediction_val = model1.predict(feature_val)
    print('max_depth:', depth, '- F1 Score', f1_score(target_val, prediction_val))

max_depth: 1 - F1 Score 0.4837476099426386
max_depth: 2 - F1 Score 0.5116279069767442
max_depth: 3 - F1 Score 0.5116279069767442
max_depth: 4 - F1 Score 0.5625587958607714
max_depth: 5 - F1 Score 0.5830845771144278
max_depth: 6 - F1 Score 0.5856459330143541
max_depth: 7 - F1 Score 0.5916919959473151
max_depth: 8 - F1 Score 0.5586272640610105
max_depth: 9 - F1 Score 0.5693730729701952
max_depth: 10 - F1 Score 0.5515832482124617
max_depth: 11 - F1 Score 0.5443668993020937
max_depth: 12 - F1 Score 0.5248780487804878
max_depth: 13 - F1 Score 0.5330739299610895
max_depth: 14 - F1 Score 0.5228136882129277
max_depth: 15 - F1 Score 0.5210384959713519
max_depth: 16 - F1 Score 0.5163704396632367
max_depth: 17 - F1 Score 0.5231910946196661
max_depth: 18 - F1 Score 0.5185856754306437
max_depth: 19 - F1 Score 0.5238970588235294
max_depth: 20 - F1 Score 0.5181058495821727
max_depth: 21 - F1 Score 0.5089605734767025
max_depth: 22 - F1 Score 0.5054744525547444
max_depth: 23 - F1 Score 0.50582959641255

### Comments:
- Now F1 score of lower max_depth (1-3) of the DOWNSAMPLED data improved comapred to the unbalanced data
- However in general F1 score did not improve compared to the unbalanced data
- There is not any max_depth that can produce F1 score that satisfies the requirement
### Therefore Decision Tree is not an optimal model

### Create the RandomForestClassfier model
- Create a loop to iterate through different max_depths of the model.
- Then choose the best model based on F1 score of the validation set.

Let’s first fit a random forest with default parameters to get a baseline idea of the performance

In [54]:
rf = RandomForestClassifier(random_state=12)
rf.fit(feature_train_downsampled, target_train_downsampled)
target_pred = rf.predict(feature_val)
f1 = f1_score(target_val, target_pred)
f1



0.6011049723756906

This F1 score satisfied the requirement. Let try different max_deptsh with n_estimators=100 to see if we can improve the score further.

In [55]:
#Create a loop to iterate through different max_depth values
for n in range(1,21,1): 
    model2 = RandomForestClassifier(max_depth=n, n_estimators=100, random_state=12)
    model2.fit(feature_train_downsampled, target_train_downsampled)
    prediction_train = model2.predict(feature_train_downsampled)
    prediction_val = model2.predict(feature_val)
    f1_train = f1_score(target_train_downsampled, prediction_train)
    f1_val = f1_score(target_val,prediction_val)
    print(n, f1_train, f1_val)

1 0.6547441629408842 0.5547263681592041
2 0.7038745387453873 0.5646794150731158
3 0.7130434782608696 0.5748898678414097
4 0.7234042553191489 0.5782092772384034
5 0.7403314917127073 0.59375
6 0.772501130710086 0.6070640176600443
7 0.7936936936936937 0.6335540838852097
8 0.8318425760286225 0.6301969365426696
9 0.8668758404303003 0.6338797814207651
10 0.9077328646748684 0.6295081967213114
11 0.9470486111111112 0.6237942122186495
12 0.9778346121057119 0.6244725738396626
13 0.9894291754756871 0.6223175965665235
14 0.997896508203618 0.6207627118644068
15 0.9987389659520807 0.6191489361702127
16 0.9995800083998321 0.634453781512605
17 0.9995800083998321 0.6226415094339621
18 0.9995800083998321 0.6222222222222222
19 0.9995800083998321 0.6264550264550264
20 0.9995800083998321 0.6292372881355933


### Comments:
- F1 score of Random Forest of the DOWNSAMPLED data improved a lot comapred to the unbalanced data
- Models with max_depth = [5, 6, 7, ...,20] returned satisfied F1 score. 

### Conclusion for the Downsampling approach
- Generally, F1 score improved in the downsampled data compared to the unbalanced data
- Random Forest Classifier seems to be the optimal model with the highest F1 score (also satisfied the requirement of this assignment, which is >= 0.59)

## Summary table of different models with different sampling approach and their F1 score

| Model               | Unbalanced/Upsampling/Downsampling | Optimal   max_depth | Acceptable   max_depth       | n_estimators | F1_Score      |
|---------------------|------------------------------------|---------------------|------------------------------|--------------|---------------|
| Decision Tree       | Unbalanced                         | 8                   | 8                            | NA           | 0.600         |
|                     | Upsampling: 4                      | Not satisfied       | Not satisfied                | NA           | Not satisfied |
|                     | Upsampling: 3                      | 6                   | 5, 6, 7, 8, 9                | NA           | 0.598         |
|                     | Downsamling: 0.2                   | Not satisfied       | Not satisfied                | NA           | Not satisfied |
|                     | Downsamling: 0.3                   | 7                   | 6, 7                         | NA           | 0.592         |
| Random Forest       | Unbalanced                         | 11                  | 11, 13, 14, 15, 16,   19, 20 | 100          |        0.599  |
|                     | Upsampling: 4                      | 12                  | 3, ...,20                    | 100          | 0.658         |
|                     | Upsampling: 3                      | 10                  | 2, 4, 5, ..., 20             | 100          | 0.655         |
|                     | Downsamling: 0.2                   | 12                  | 12, 14, 18                   | 100          | 0.585         |
|                     | Downsamling: 0.3                   | 9                   | 5, 6, ..., 20                | 100          | 0.634         |
| Logistic Regression | Unbalanced                         | NA                  | NA                           | NA           | Not satisfied |
|                     | Upsampling: 4                      | NA                  | NA                           | NA           | Not satisfied |
|                     | Upsampling: 3                      | NA                  | NA                           | NA           | Not satisfied |
|                     | Downsamling: 0.2                   | NA                  | NA                           | NA           | Not satisfied |
|                     | Downsamling: 0.3                   | NA                  | NA                           | NA           | Not satisfied |

# Conclusion of Step 3:
- Random Forest returned satisfied F1 score for all approaches.
- Logistic Regression returned unsatisfied F1 score for all approaches.
- Decision Tree returned satisfied F1 score for the unbalanced data, upsampling with the ratio of positive and negative class of 2:3, and downsampling with the ratio of positive and negative class of 2:3
- Upsampling with the ratio of positive and negative class of 2:3 returns high enough F1 score and more acceptable max_depths for both Decision Tree and Random Forest. F1 score in the Logistic Regression model for this approach did not satisfy the threshold, however it's higher than other approaches. Therefore I will choose this approach for the 3 models in the next step together with their optimal parameters. 

# Step 4: Check the quality of the model using the test set

### Compare the three models by comparing the accuracy score of the test set

Save the selected model with their corresponding hyperparameters

In [56]:
model1 = DecisionTreeClassifier(random_state=12, max_depth=6)
model2 = RandomForestClassifier(random_state=12, max_depth=10, n_estimators=100)
model3 = LogisticRegression(random_state=12, solver = 'liblinear')                                    

Fit the 3 models into our upsampled training data 

In [57]:
#Upsampling the training data
feature_train_upsampled_3, target_train_upsampled_3 = upsample(feature_train, target_train, 3)

In [58]:
model1.fit(feature_train_upsampled_3, target_train_upsampled_3)
model2.fit(feature_train_upsampled_3, target_train_upsampled_3)
model3.fit(feature_train_upsampled_3, target_train_upsampled_3)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=12, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Get predictions for the test set using 3 models

In [59]:
prediction_test_decisiontree = model1.predict(feature_test)

prediction_test_randomforest = model2.predict(feature_test)

prediction_test_logisticregression = model3.predict(feature_test)

Calculate F1 score using test set

In [60]:
print("Decision Tree Model", ": ", end="")
print(f1_score(target_test, prediction_test_decisiontree))
print("Random Forest Model", ": ", end="")
print(f1_score(target_test, prediction_test_randomforest))
print("Logistic Regression Model", ": ", end="")
print(f1_score(target_test, prediction_test_logisticregression))

Decision Tree Model : 0.5997952917093142
Random Forest Model : 0.6257521058965102
Logistic Regression Model : 0.5119170984455959


Check overfitness by comparing the accuracy score between train and test sets for each model

Get predictions using training set

In [61]:
prediction_train_decisiontree = model1.predict(feature_train_upsampled_3)

prediction_train_randomforest = model2.predict(feature_train_upsampled_3)

prediction_train_logisticregression = model3.predict(feature_train_upsampled_3)

Compare F1 score of training and test set to see if there are warnings of overfit

In [62]:
print('Decision Tree Model')
print('Training set:', f1_score(target_train_upsampled_3, prediction_train_decisiontree))
print('Test set:', f1_score(target_test, prediction_test_decisiontree))
      
print('Random Forest Model')
print('Training set:', f1_score(target_train_upsampled_3, prediction_train_randomforest))
print('Test set:', f1_score(target_test, prediction_test_randomforest))    
      
print('Logistic Regression Model')
print('Training set:', f1_score(target_train_upsampled_3, prediction_train_logisticregression))
print('Test set:', f1_score(target_test, prediction_test_logisticregression))

Decision Tree Model
Training set: 0.741978021978022
Test set: 0.5997952917093142
Random Forest Model
Training set: 0.8821226620269682
Test set: 0.6257521058965102
Logistic Regression Model
Training set: 0.6215102974828375
Test set: 0.5119170984455959


### Calculate AUC-ROC metric and compare it with F1 score

In [None]:
#Decision Tree
probabilities_test_dt = model1.predict_proba(feature_test)
probabilities_one_test_dt = probabilities_test_dt[:, 1]
auc_roc_dt = roc_auc_score(target_test, probabilities_one_test_dt)

#Random Forest
probabilities_test_rf = model2.predict_proba(feature_test)
probabilities_one_test_rf = probabilities_test_rf[:, 1]
auc_roc_rf = roc_auc_score(target_test, probabilities_one_test_rf)

#Logistic Regression
probabilities_test_lr = model3.predict_proba(feature_test)
probabilities_one_test_lr = probabilities_test_lr[:, 1]
auc_roc_lr = roc_auc_score(target_test, probabilities_one_test_lr)

print("AUC - ROC", 
      "------------",
      "Decision Tree: {}".format(auc_roc_dt),
      "Random Forest: {}".format(auc_roc_rf), 
      "Logistic Regression: {}".format(auc_roc_lr), sep='\n')

# Conclusion: Best model: Random Forest Model
- Pro: It's F1 score and AUC-ROC is the highest, also satisfying the threshold for F1 score
- Con: The discrepancy in F1 score between training data and set data might warn about overfitness. However, we need to further check whether when the error on the training data continues to decrease while the error on test starts to increase. This test is beyond this assignment. 
- AUC - ROC is higher than F1 score for all 3 models. 