# Contens

1. Project Description
2. Data Description
3. Project Instruction :
   - 3.1 Open the dataset, decribe the procedure:
     - 3.1.1 checking dataset information
     - 3.1.2 data preparation
   - 3.2 Check the imbalance of data classes, fitting the model without considering the imbalance class of data.
     - 3.2.1 check the imbalance of data classes
     - 3.2.2 fitting the model without considering the imbalance class of data
   - 3.3 Improve model Quality:
     - 3.3.1 class weight adjustment
     - 3.3.2 upsampling approach
     - 3.3.2 downsampling approach
4. Model Testing / Run the last test
5. Conclusion

## Project Description

BETA Bank facing the decreasing number of his customer every month. Bank employees realized that it was cheaper to retain their loyal old customers than to attract new ones. 

Goal :
- To predict whether a customer will leave the bank soon or not, with :
  - Develop a model with the maximum value of F1.
  - Minimun F1 score for dataset testing is 0.59
  - Calculate AUC-ROC metric and compare with F1 score


## Data Description

- Data source file : /datasets/Churn.csv
- Data figure :
  - RowNumber — data string index
  - CustomerId — ID of customer
  - Surname 
  - CreditScore — score number of credit
  - Geography — country residence
  - Gender
  - Age
  - Tenure — maturity periode for customer's fixed deposit (year)
  - Balance — account balance
  - NumOfProducts — number of bank product used by customers
  - HasCrCard — does the customers has credit card
  - IsActiveMember — customer activity level
  - EstimatedSalary — the salary estimation
- Target
  - Exited — whether the customer has stopped

## Project Instruction
### Open the dataset

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.utils import shuffle


#open the dataset
df = pd.read_csv('/datasets/Churn.csv')

#### checking dataset information

In [2]:
#check dataset
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [3]:
#check dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [4]:
#check Geography columns
df['Geography'].value_counts()

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

In [5]:
#checking dataset volume
df.shape

(10000, 14)

In [6]:
#checking dataset missing value in %
df.isnull().sum() / df.shape[0]*100

RowNumber          0.00
CustomerId         0.00
Surname            0.00
CreditScore        0.00
Geography          0.00
Gender             0.00
Age                0.00
Tenure             9.09
Balance            0.00
NumOfProducts      0.00
HasCrCard          0.00
IsActiveMember     0.00
EstimatedSalary    0.00
Exited             0.00
dtype: float64

In [7]:
#describing the dataset
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


1. Findings :
- There are 909 missing value in Tenure columns

2. Insight :
- several columns will droped due to not used for further evaluation :
  - RowNumber
  - CustomerId
  - Surname
- there are several columns with the same characteristic can be grouped such as numerical columns and categorical columns

3. Summary/Recomendation :
- missing value can be filled with the median value of 'Tenure' columns
- columns with numerical value will be group into numeric_cols :
  - CreditScore
  - Age
  - Tenure
  - NumOfProducts
  - HasCrCard
  - IsActiveMember
  - EstimatedSalary
- columns with categorical value will be group into categorical_cols :
  - Geography
  - Gender

#### Data Preparation

In [8]:
#check dataset
df.head(3)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1


In [9]:
# define columns category
drop_cols = ['RowNumber','CustomerId','Surname']
numeric_cols = ['CreditScore','Age','Tenure','NumOfProducts','HasCrCard','IsActiveMember','EstimatedSalary']
categorical_cols = ['Geography','Gender']

In [10]:
# drop unused columns
df = df.drop(columns = drop_cols)

In [11]:
# dataset after drop col
df.head(3)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1


In [12]:
#encoding variable features
df = pd.get_dummies(df, columns = categorical_cols)

In [13]:
# dataset after encoding
df.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,1,0,0,1,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,0,1,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,1,0,0,1,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,1,0,0,1,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,0,1,1,0


In [14]:
#fill the missing value
df['Tenure']=df['Tenure'].fillna(value=df['Tenure'].median())

In [15]:
#check missing value
df.isnull().sum()

CreditScore          0
Age                  0
Tenure               0
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
Geography_France     0
Geography_Germany    0
Geography_Spain      0
Gender_Female        0
Gender_Male          0
dtype: int64

note :
- unused columns 'RowNumber','CustomerId','Surname' has been dropped
- columns with has numerical value defined as numeric_cols
- categorical columns used to categorize 'Gender' and 'Geography' columns
- filling the missing value in 'Tenure' columns with median value

### Chec the Imbalance class & Fitting the model
#### Check the imbalance of data classes.

In [16]:
df

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42,2.0,0.00,1,1,1,101348.88,1,1,0,0,1,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,0,1,1,0
2,502,42,8.0,159660.80,3,1,0,113931.57,1,1,0,0,1,0
3,699,39,1.0,0.00,2,0,0,93826.63,0,1,0,0,1,0
4,850,43,2.0,125510.82,1,1,1,79084.10,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,39,5.0,0.00,2,1,0,96270.64,0,1,0,0,0,1
9996,516,35,10.0,57369.61,1,1,1,101699.77,0,1,0,0,0,1
9997,709,36,7.0,0.00,1,0,1,42085.58,1,1,0,0,1,0
9998,772,42,3.0,75075.31,2,1,0,92888.52,1,0,1,0,0,1


In [17]:
#checking the 'Exited' column
df['Exited'].value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

In [18]:
# percentage (%) of Exited data
df['Exited'].value_counts()/df.shape[0]*100

0    79.63
1    20.37
Name: Exited, dtype: float64

Findings :

 - data imbalance
   - 79.63% stay as bank customer
   - 20.37% exited from bank customer

#### fitting the model without considering the imbalance class of data

In [19]:
#separate it in to the training set, validation set, and test set

train_valid, test = train_test_split(df, test_size = 0.15)
train, valid = train_test_split(train_valid, test_size = 0.15)

#train
features_train = train.drop(['Exited'], axis=1)
target_train = train['Exited']

#validation
features_valid = valid.drop(['Exited'], axis=1)
target_valid = valid['Exited']

#test
features_test = test.drop(['Exited'], axis=1)
target_test = test['Exited']

#check features shape
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(7225, 13)
(1275, 13)
(1500, 13)


In [20]:
#check dataset shape
df.shape

(10000, 14)

In [21]:
#scalling features

scaler = StandardScaler()

features_train[numeric_cols] = scaler.fit_transform(features_train[numeric_cols])
features_valid[numeric_cols] = scaler.transform(features_valid[numeric_cols])
features_test[numeric_cols] = scaler.transform(features_test[numeric_cols])

In [22]:
#check features_train dataset
features_train.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
7649,0.677296,-0.752448,-0.73715,110581.29,-0.911879,0.643561,0.970134,-0.082739,1,0,0,1,0
5501,-0.196652,-0.752448,1.081683,137687.72,-0.911879,0.643561,-1.030785,1.574902,1,0,0,1,0
9192,1.572053,1.820721,-1.464683,79954.61,0.815074,0.643561,0.970134,-1.19936,0,0,1,0,1
3430,-0.259077,-0.180633,-1.464683,177069.24,0.815074,0.643561,0.970134,-0.058865,1,0,0,0,1
4055,-2.058995,2.011326,0.35415,121730.49,-0.911879,0.643561,0.970134,0.756027,1,0,0,1,0


In [23]:
features_train.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
7649,0.677296,-0.752448,-0.73715,110581.29,-0.911879,0.643561,0.970134,-0.082739,1,0,0,1,0
5501,-0.196652,-0.752448,1.081683,137687.72,-0.911879,0.643561,-1.030785,1.574902,1,0,0,1,0
9192,1.572053,1.820721,-1.464683,79954.61,0.815074,0.643561,0.970134,-1.19936,0,0,1,0,1
3430,-0.259077,-0.180633,-1.464683,177069.24,0.815074,0.643561,0.970134,-0.058865,1,0,0,0,1
4055,-2.058995,2.011326,0.35415,121730.49,-0.911879,0.643561,0.970134,0.756027,1,0,0,1,0


In [24]:
#check features_valid dataset
features_valid.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
7567,-0.21746,0.486485,-0.009617,0.0,4.268978,-1.553854,-1.030785,0.314624,1,0,0,0,1
952,-0.820901,2.583141,1.44545,111577.01,-0.911879,-1.553854,0.970134,1.561077,0,1,0,1,0
6972,-0.571201,1.248906,1.44545,0.0,-0.911879,0.643561,0.970134,0.119363,1,0,0,0,1
2490,0.479618,-0.371238,-0.373383,174902.26,-0.911879,0.643561,-1.030785,-0.528396,0,1,0,1,0
8471,-1.299492,1.153603,1.809216,0.0,2.542026,0.643561,-1.030785,1.244218,1,0,0,0,1


In [25]:
#check features_test dataset
features_test.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
3387,-1.174642,4.298588,1.081683,92242.34,-0.911879,0.643561,0.970134,1.508299,0,1,0,0,1
8081,-0.841709,-0.847751,0.35415,106629.49,-0.911879,-1.553854,0.970134,-0.962398,0,1,0,0,1
2377,-1.049792,1.0583,-0.009617,94748.76,0.815074,-1.553854,0.970134,-1.49274,1,0,0,0,1
8425,1.093462,-0.943054,1.44545,117035.89,-0.911879,0.643561,0.970134,-1.349249,1,0,0,0,1
1651,-0.16544,-0.752448,1.44545,108632.48,-0.911879,0.643561,0.970134,1.390954,0,0,1,0,1


In [26]:
#LogisticRegression Regression with imbalance dataset

logreg = LogisticRegression(random_state=8080)
logreg.fit(features_train, target_train)

prediction_train = logreg.predict(features_train)
prediction_valid = logreg.predict(features_valid)
probabilities_valid = logreg.predict_proba(features_valid)[:,1]

print('LogisticRegression with imbalance dataset')

print('   Train F1 score :', f1_score(target_train, prediction_train))
print('   Valid F1 score :', f1_score(target_valid, prediction_valid))
print('   AUC-ROC score :', roc_auc_score(target_valid, probabilities_valid))

LogisticRegression with imbalance dataset
   Train F1 score : 0.025065963060686012
   Valid F1 score : 0.022058823529411766
   AUC-ROC score : 0.4194757961496265


In [27]:
#Decission Tree Regression with imbalance dataset

for depth in [1,2,4,6,8,10, None]:
    dtree = DecisionTreeClassifier(random_state = 8080, max_depth = depth)
    dtree.fit(features_train, target_train)

    prediction_valid = dtree.predict(features_valid)
    prediction_train = dtree.predict(features_train)
    probabilities_valid = dtree.predict_proba(features_valid)[:,1]

    print('Max_depth =', depth)
    print('    Train F1 score :', f1_score(target_train, prediction_train))
    print('    Valid F1 score :', f1_score(target_valid, prediction_valid))
    print('    AUC-ROC  :', roc_auc_score(target_valid, probabilities_valid))

Max_depth = 1
    Train F1 score : 0.0
    Valid F1 score : 0.0
    AUC-ROC  : 0.6929413576999954
Max_depth = 2
    Train F1 score : 0.5091053048297705
    Valid F1 score : 0.49673202614379075
    AUC-ROC  : 0.7351309006747921
Max_depth = 4
    Train F1 score : 0.40674394099051625
    Valid F1 score : 0.40935672514619886
    AUC-ROC  : 0.8190797877936247
Max_depth = 6
    Train F1 score : 0.5894308943089431
    Valid F1 score : 0.6044444444444445
    AUC-ROC  : 0.8514499015614904
Max_depth = 8
    Train F1 score : 0.6601638704642996
    Valid F1 score : 0.6052631578947368
    AUC-ROC  : 0.847429702881017
Max_depth = 10
    Train F1 score : 0.7370855821125675
    Valid F1 score : 0.5892473118279571
    AUC-ROC  : 0.8193953921760171
Max_depth = None
    Train F1 score : 1.0
    Valid F1 score : 0.5155393053016453
    AUC-ROC  : 0.6974086625888577


In [28]:
#RandomForest with imbalance dataset

for estim in range ( 10, 101, 10):
    rf = RandomForestClassifier(random_state = 8080, n_estimators = estim)

    rf.fit(features_train, target_train)
    prediction_valid = rf.predict(features_valid)
    prediction_train = rf.predict(features_train)
    probabilities_valid = rf.predict_proba(features_valid)[:,1]

    print('n_estimators =', estim)
    print('   Train F1 score :', f1_score(target_train, prediction_train))
    print('   Valid F1 score :', f1_score(target_valid, prediction_valid))
    print('   AUC-ROC  :', roc_auc_score(target_valid, probabilities_valid))

n_estimators = 10
   Train F1 score : 0.9636109167249826
   Valid F1 score : 0.5346534653465347
   AUC-ROC  : 0.8288503734651858
n_estimators = 20
   Train F1 score : 0.9897330595482547
   Valid F1 score : 0.5714285714285714
   AUC-ROC  : 0.8473996453207894
n_estimators = 30
   Train F1 score : 0.9948927477017365
   Valid F1 score : 0.5371702637889689
   AUC-ROC  : 0.8478711732968636
n_estimators = 40
   Train F1 score : 0.9972826086956521
   Valid F1 score : 0.5419664268585132
   AUC-ROC  : 0.8503809795758879
n_estimators = 50
   Train F1 score : 0.9993220338983051
   Valid F1 score : 0.5645933014354066
   AUC-ROC  : 0.8501893626294353
n_estimators = 60
   Train F1 score : 1.0
   Valid F1 score : 0.5569007263922519
   AUC-ROC  : 0.8501743338493214
n_estimators = 70
   Train F1 score : 1.0
   Valid F1 score : 0.5590361445783132
   AUC-ROC  : 0.8484009377958791
n_estimators = 80
   Train F1 score : 1.0
   Valid F1 score : 0.5520581113801453
   AUC-ROC  : 0.8491279550338898
n_estimators 

Findings :
- the highest F1 score is RandomForest Regression with n_estimators = 100
  - F1 score : 0.5578231292517007
  - AUC-ROC  : 0.8348249465270741

### Improve model Quality

#### class_weight adjustment

In [29]:
#LogisticRegression Regression with balance class_weight

logreg = LogisticRegression(random_state=8080, class_weight = 'balanced')
logreg.fit(features_train, target_train)

prediction_train = logreg.predict(features_train)
prediction_valid = logreg.predict(features_valid)
probabilities_valid = logreg.predict_proba(features_valid)[:,1]

print('LogisticRegression with imbalance dataset')
print('   Train F1 score :', f1_score(target_train, prediction_train))
print('   Valid F1 score :', f1_score(target_valid, prediction_valid))
print('   AUC-ROC score :', roc_auc_score(target_valid, probabilities_valid))

LogisticRegression with imbalance dataset
   Train F1 score : 0.39210526315789473
   Valid F1 score : 0.4035369774919615
   AUC-ROC score : 0.6293076241001518


In [30]:
#Decission Tree Regression with with balance class_weight

for depth in [1,2,4,6,8,10, None]:
    dtree = DecisionTreeClassifier(random_state = 8080, max_depth = depth,class_weight = 'balanced')
    dtree.fit(features_train, target_train)

    prediction_valid = dtree.predict(features_valid)
    prediction_train = dtree.predict(features_train)
    probabilities_valid = dtree.predict_proba(features_valid)[:,1]

    print('Max_depth =', depth)
    print('    Train F1 score :', f1_score(target_train, prediction_train))
    print('    Valid F1 score :', f1_score(target_valid, prediction_valid))
    print('    AUC-ROC  :', roc_auc_score(target_valid, probabilities_valid))

Max_depth = 1
    Train F1 score : 0.4883720930232558
    Valid F1 score : 0.4971751412429378
    AUC-ROC  : 0.7016956221163528
Max_depth = 2
    Train F1 score : 0.51440329218107
    Valid F1 score : 0.5165745856353591
    AUC-ROC  : 0.7502611250544793
Max_depth = 4
    Train F1 score : 0.5510567863509039
    Valid F1 score : 0.5810055865921788
    AUC-ROC  : 0.8314428380348368
Max_depth = 6
    Train F1 score : 0.5927192768140728
    Valid F1 score : 0.5714285714285715
    AUC-ROC  : 0.8536516178481792
Max_depth = 8
    Train F1 score : 0.6515110992243915
    Valid F1 score : 0.5861561119293078
    AUC-ROC  : 0.833811749500293
Max_depth = 10
    Train F1 score : 0.704913678618858
    Valid F1 score : 0.551622418879056
    AUC-ROC  : 0.7823325418175807
Max_depth = None
    Train F1 score : 1.0
    Valid F1 score : 0.5009708737864078
    AUC-ROC  : 0.6844763973008311


In [31]:
#RandomForest with balance class_weight

for estim in range ( 10, 101, 10):
    rf = RandomForestClassifier(random_state = 8080, n_estimators = estim,class_weight = 'balanced' )

    rf.fit(features_train, target_train)
    prediction_valid = rf.predict(features_valid)
    prediction_train = rf.predict(features_train)
    probabilities_valid = rf.predict_proba(features_valid)[:,1]

    print('n_estimators =', estim)
    print('    Train F1 score :', f1_score(target_train, prediction_train))
    print('    Valid F1 score :', f1_score(target_valid, prediction_valid))
    print('    AUC-ROC  :', roc_auc_score(target_valid, probabilities_valid))

n_estimators = 10
    Train F1 score : 0.9631966351209252
    Valid F1 score : 0.5012531328320802
    AUC-ROC  : 0.8153075639850313
n_estimators = 20
    Train F1 score : 0.9886947584789311
    Valid F1 score : 0.5590361445783132
    AUC-ROC  : 0.8355363020183653
n_estimators = 30
    Train F1 score : 0.9945504087193461
    Valid F1 score : 0.5402843601895734
    AUC-ROC  : 0.8445122409414029
n_estimators = 40
    Train F1 score : 0.9976230899830221
    Valid F1 score : 0.5598086124401914
    AUC-ROC  : 0.8515137738769744
n_estimators = 50
    Train F1 score : 0.9996611318197222
    Valid F1 score : 0.5502392344497608
    AUC-ROC  : 0.853007258900795
n_estimators = 60
    Train F1 score : 1.0
    Valid F1 score : 0.5507246376811594
    AUC-ROC  : 0.8521600114218728
n_estimators = 70
    Train F1 score : 1.0
    Valid F1 score : 0.5414634146341464
    AUC-ROC  : 0.8523027848329551
n_estimators = 80
    Train F1 score : 1.0
    Valid F1 score : 0.5458937198067633
    AUC-ROC  : 0.8529602

Findings :
- the highest F1 score is RandomForest Regression with n_estimators = 60
  - F1 score : 0.5765765765765766
  - AUC-ROC  : 0.8361090003377237

#### using upsampling approach

In [32]:
# define upsampling method
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat )
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=8080)

    return features_upsampled, target_upsampled
features_upsampled, target_upsampled = upsample(features_train, target_train, 3)


In [33]:
#counting the target_train data
target_train.value_counts()

0    5750
1    1475
Name: Exited, dtype: int64

In [34]:
#counting the target_upsampled data
target_upsampled.value_counts()

0    5750
1    4425
Name: Exited, dtype: int64

In [35]:
#LogisticRegression Regression with upsampling data

logreg = LogisticRegression(random_state=8080)
logreg.fit(features_upsampled, target_upsampled)

prediction_train = logreg.predict(features_train)
prediction_valid = logreg.predict(features_valid)
probabilities_valid = logreg.predict_proba(features_valid)[:,1]

print('LogisticRegression with upsampling data')
print('   Train F1 score :', f1_score(target_train, prediction_train))
print('   Valid F1 score :', f1_score(target_valid, prediction_valid))
print('   AUC-ROC score :', roc_auc_score(target_valid, probabilities_valid))

LogisticRegression with upsampling data
   Train F1 score : 0.1836839404822986
   Valid F1 score : 0.1776504297994269
   AUC-ROC score : 0.4204000661266325


In [36]:
#Decission Tree Regression with upsampling data

for depth in [1,2,4,6,8,10, None]:
    dtree = DecisionTreeClassifier(random_state = 8080, max_depth = depth)
    dtree.fit(features_upsampled, target_upsampled)

    prediction_valid = dtree.predict(features_valid)
    prediction_train = dtree.predict(features_train)
    probabilities_valid = dtree.predict_proba(features_valid)[:,1]

    print('Max_depth =', depth)
    print('    Train F1 score :', f1_score(target_train, prediction_train))
    print('    Valid F1 score :', f1_score(target_valid, prediction_valid))
    print('    AUC-ROC  :', roc_auc_score(target_valid, probabilities_valid))

Max_depth = 1
    Train F1 score : 0.4883720930232558
    Valid F1 score : 0.4971751412429378
    AUC-ROC  : 0.7016956221163528
Max_depth = 2
    Train F1 score : 0.51440329218107
    Valid F1 score : 0.5165745856353591
    AUC-ROC  : 0.7502611250544793
Max_depth = 4
    Train F1 score : 0.5510567863509039
    Valid F1 score : 0.5810055865921788
    AUC-ROC  : 0.8314428380348368
Max_depth = 6
    Train F1 score : 0.6078061911170929
    Valid F1 score : 0.5896296296296297
    AUC-ROC  : 0.8544274786215603
Max_depth = 8
    Train F1 score : 0.662049062049062
    Valid F1 score : 0.5822784810126581
    AUC-ROC  : 0.833913193766062
Max_depth = 10
    Train F1 score : 0.7309260337798487
    Valid F1 score : 0.5673076923076922
    AUC-ROC  : 0.7759584604517651
Max_depth = None
    Train F1 score : 1.0
    Valid F1 score : 0.4830188679245283
    AUC-ROC  : 0.6746701182764995


In [37]:
#RandomForest with upsampling data

for estim in range ( 10, 101, 10):
    rf = RandomForestClassifier(random_state = 8080, n_estimators = estim)

    rf.fit(features_upsampled, target_upsampled)
    prediction_valid = rf.predict(features_valid)
    prediction_train = rf.predict(features_train)
    probabilities_valid = rf.predict_proba(features_valid)[:,1]

    print('n_estimators =', estim)
    print('    Train F1 score :', f1_score(target_train, prediction_train))
    print('    Valid F1 score :', f1_score(target_valid, prediction_valid))
    print('    AUC-ROC  :', roc_auc_score(target_valid, probabilities_valid))

n_estimators = 10
    Train F1 score : 0.9979702300405954
    Valid F1 score : 0.5925925925925927
    AUC-ROC  : 0.8178568208118547
n_estimators = 20
    Train F1 score : 0.9996611318197222
    Valid F1 score : 0.5702306079664569
    AUC-ROC  : 0.8324065585596417
n_estimators = 30
    Train F1 score : 1.0
    Valid F1 score : 0.5738045738045738
    AUC-ROC  : 0.8371068095402696
n_estimators = 40
    Train F1 score : 1.0
    Valid F1 score : 0.5896907216494846
    AUC-ROC  : 0.8423311892273704
n_estimators = 50
    Train F1 score : 1.0
    Valid F1 score : 0.5979381443298969
    AUC-ROC  : 0.8469055741745443
n_estimators = 60
    Train F1 score : 1.0
    Valid F1 score : 0.6004140786749482
    AUC-ROC  : 0.8472380859345646
n_estimators = 70
    Train F1 score : 1.0
    Valid F1 score : 0.5899581589958158
    AUC-ROC  : 0.8489983318054072
n_estimators = 80
    Train F1 score : 1.0
    Valid F1 score : 0.5933609958506225
    AUC-ROC  : 0.8501179759238943
n_estimators = 90
    Train F1 sco

Findings :
- the highest F1 score is RandomForest Regression with n_estimators = 80
   - F1 score : 0.6088709677419355
   - AUC-ROC  : 0.8399857874591917

#### using downsampling approach

In [38]:
#define downsampling method

def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state=8080)] + [features_ones])
    target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state=8080)] + [target_ones])

    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=8080)

    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.3)


In [39]:
#counting the target_train data
target_train.value_counts()

0    5750
1    1475
Name: Exited, dtype: int64

In [40]:
#counting the target_downsampled data
target_downsampled.value_counts()

0    1725
1    1475
Name: Exited, dtype: int64

In [41]:
#LogisticRegression Regression with downsampling data

logreg = LogisticRegression(random_state=8080)
logreg.fit(features_downsampled, target_downsampled)

prediction_train = logreg.predict(features_train)
prediction_valid = logreg.predict(features_valid)
probabilities_valid = logreg.predict_proba(features_valid)[:,1]

print('LogisticRegression with upsampling data')

#print ('training accuracy score   : ',accuracy_score (target_train, prediction_train)*100)
#print ('validation accuracy score : ',accuracy_score (target_valid, prediction_valid)*100)
print('   Train F1 score :', f1_score(target_train, prediction_train))
print('   Valid F1 score :', f1_score(target_valid, prediction_valid))
print('   AUC-ROC score :', roc_auc_score(target_valid, probabilities_valid))

LogisticRegression with upsampling data
   Train F1 score : 0.39300783604581074
   Valid F1 score : 0.4056525353283458
   AUC-ROC score : 0.6296720720179143


In [42]:
#Decission Tree Regression with downsampling data

for depth in [1,2,4,6,8,10, None]:
    dtree = DecisionTreeClassifier(random_state = 8080, max_depth = depth)
    dtree.fit(features_downsampled, target_downsampled)

    prediction_valid = dtree.predict(features_valid)
    prediction_train = dtree.predict(features_train)
    probabilities_valid = dtree.predict_proba(features_valid)[:,1]

    print('Max_depth =', depth)
    print('    Train F1 score :', f1_score(target_train, prediction_train))
    print('    Valid F1 score :', f1_score(target_valid, prediction_valid))
    print('    AUC-ROC  :', roc_auc_score(target_valid, probabilities_valid))

Max_depth = 1
    Train F1 score : 0.4883720930232558
    Valid F1 score : 0.4971751412429378
    AUC-ROC  : 0.7016956221163528
Max_depth = 2
    Train F1 score : 0.51440329218107
    Valid F1 score : 0.5165745856353591
    AUC-ROC  : 0.7502611250544793
Max_depth = 4
    Train F1 score : 0.5562350438713108
    Valid F1 score : 0.5913818722139672
    AUC-ROC  : 0.8304791175100318
Max_depth = 6
    Train F1 score : 0.592845870594424
    Valid F1 score : 0.5872093023255813
    AUC-ROC  : 0.8470953125234826
Max_depth = 8
    Train F1 score : 0.6223811957077159
    Valid F1 score : 0.5371428571428571
    AUC-ROC  : 0.8084581974481131
Max_depth = 10
    Train F1 score : 0.6625226625226625
    Valid F1 score : 0.5356125356125357
    AUC-ROC  : 0.7448676715910969
Max_depth = None
    Train F1 score : 0.7248157248157248
    Valid F1 score : 0.48108108108108105
    AUC-ROC  : 0.6906757690978224


In [43]:
#RandomForest with downsampling data

for estim in range ( 10, 101, 10):
    rf = RandomForestClassifier(random_state = 8080, n_estimators = estim)

    rf.fit(features_downsampled, target_downsampled)
    prediction_valid = rf.predict(features_valid)
    prediction_train = rf.predict(features_train)
    probabilities_valid = rf.predict_proba(features_valid)[:,1]

    print('n_estimators =', estim)
    print('    Train F1 score :', f1_score(target_train, prediction_train))
    print('    Valid F1 score :', f1_score(target_valid, prediction_valid))
    print('    AUC-ROC  :', roc_auc_score(target_valid, probabilities_valid))

n_estimators = 10
    Train F1 score : 0.7885933644091034
    Valid F1 score : 0.5794392523364486
    AUC-ROC  : 0.8317490494296578
n_estimators = 20
    Train F1 score : 0.7951153324287653
    Valid F1 score : 0.5858895705521473
    AUC-ROC  : 0.8335299598731571
n_estimators = 30
    Train F1 score : 0.7973962571196095
    Valid F1 score : 0.5914634146341464
    AUC-ROC  : 0.8441722147913254
n_estimators = 40
    Train F1 score : 0.7955711585201188
    Valid F1 score : 0.5869894099848714
    AUC-ROC  : 0.8412322096815401
n_estimators = 50
    Train F1 score : 0.7965414752769522
    Valid F1 score : 0.5877061469265367
    AUC-ROC  : 0.8446644073400562
n_estimators = 60
    Train F1 score : 0.795362631437045
    Valid F1 score : 0.5894736842105264
    AUC-ROC  : 0.8450044334901335
n_estimators = 70
    Train F1 score : 0.7938643702906352
    Valid F1 score : 0.5960665658093798
    AUC-ROC  : 0.8465167044890965
n_estimators = 80
    Train F1 score : 0.7925846319183235
    Valid F1 score 

Findings :
- the highest F1 score is RandomForest Regression with n_estimators = 100
    - F1 score : 0.6184615384615384
    - AUC-ROC  : 0.8226651328380052

## Model Testing

Based on the previous model fitting we found that the highest F1 score is using RandomForest regression with downsampling approach with n_estimators = 20

In [44]:
#fitting the final model
final_model = RandomForestClassifier(random_state = 8080, n_estimators = 100)
final_model.fit(features_downsampled, target_downsampled)
prediction_valid = final_model.predict(features_valid)
probabilities_valid = final_model.predict_proba(features_valid)[:,1]


print('    F1 score :', f1_score(target_valid, prediction_valid))
print('    AUC-ROC  :', roc_auc_score(target_valid, probabilities_valid))

    F1 score : 0.5948406676783005
    AUC-ROC  : 0.8480759404259156


In [45]:
#run the last test
prediction_test = final_model.predict(features_test)
probabilities_test = final_model.predict_proba(features_test)[:,1]


print('    F1 score :', f1_score(target_test, prediction_test))
print('    AUC-ROC  :', roc_auc_score(target_test, probabilities_test))

    F1 score : 0.5986394557823129
    AUC-ROC  : 0.84627915978602


# Conclussion 

- The customer of Beta Bank were decreasing and exited from bank beta customer 
- The best model to predict the customer Exited is using RandomForest Regression with downsampling data, we got :
  - F1 Score is 0.6438746438746439
  - AUC-ROC is 0.8733063525249727
- we would like to suggest bank Employee to have more enggagement with the old loyal customer to maintain/protect the number of bank customer still remain as 