# Business Question: 
## Based upon customer data provided by a global bank, please provide a Exiting Customer list to Marketing Team.
## The Exiting Customer list will be used to retain customer who is going to exit the bank.

# Import Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import confusion_matrix,accuracy_score,precision_score,recall_score,f1_score

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Data Extraction

In [None]:
df=pd.read_csv('/kaggle/input/churn-modelling/Churn_Modelling.csv')

# EDA

## Checking Missing Value and Data Type

In [None]:
df.info()

In [None]:
df.groupby('Exited').agg({'CustomerId':'count'})

#### 1) There is no missing value
#### 2) Numerical Data: Age, Estimated Salary, Tenure, Balance, CreditScore, 
#### 3a) Categorical Data: Gender, Geography
#### 3b) HasCrDard, IsActiveMember and NumofProducts should be Categorical/ Binary Data rather than int64
#### 4a) Label: Exited should be Categorical data
#### 4b) Label: Exited is imbalance, the proportion of Exit:Not Exit around 20%:80%

## Prepare Dataset for EDA

In [None]:
EDA_Data=df.copy()
EDA_Data['Exited']=EDA_Data['Exited'].apply(lambda x: 'Exit' if x==1 else 'Not Exit')
EDA_Data['HasCrCard']=EDA_Data['HasCrCard'].apply(lambda x: 'Has Credit Card' if x==1 else 'Do not have Credit Card')
EDA_Data['IsActiveMember']=EDA_Data['IsActiveMember'].apply(lambda x: 'Active Member' if x==1 else 'Inactive Member')
EDA_Data['NumOfProducts']=EDA_Data['NumOfProducts'].astype('str')

In [None]:
EDA_Data.drop(['RowNumber','CustomerId'],axis=1).describe()

## Part 1: Relationship between Numerical Data and Exited

### 1) Relationship between Age and Exited

In [None]:
sns.boxplot(x='Exited',y='Age',data=EDA_Data)

#### Overall, customer who exited is older than those not exit. 
#### The median of exited customer is around 45 years old.
#### The median of exited customer is already older than 75% not exit customer.

### 2) Relationship between Tenure and Exited

In [None]:
sns.boxplot(x='Exited',y='Tenure',data=EDA_Data)

#### Compare to Not Exist Customer, Customer who join less than 3 years or above 7 years are more likely to exit.

### 3) Relationship between Balance and Exited

In [None]:
sns.boxplot(x='Exited',y='Balance',data=EDA_Data)

#### Customer whose balance less than $40000 are more likely not exit
#### The top 25% exit customer has higher balance than those not exit

### 4) Relationship between Credit Score and Exited

In [None]:
sns.boxplot(x='Exited',y='CreditScore',data=EDA_Data)

#### Credit Score only slightly correlated to Exited. 
#### Customer with lower credt score slightly exit.
#### There are some outliers whose score fall around 350. 

### 5) Relationship between EstimatedSalary and Exited

In [None]:
sns.boxplot(x='Exited',y='EstimatedSalary',data=EDA_Data)

## Part 2: Relationship between Categorical Data and Exited

#### Estimated Salary do not correlated to Exited.

In [None]:
fig, axis = plt.subplots(3, 2, figsize=(10,15),)
axis[0,0].set_title("Relationship between Gender and Exited")
axis[0,1].set_title("Relationship between Geography and Exited")
axis[1,0].set_title("Relationship between Has Credit Card and Exited")
axis[1,1].set_title("Relationship between Is Active and Exited")
axis[2,0].set_title("Relationship between No. of Product and Exited")

sns.countplot(x='Gender',hue='Exited',data=EDA_Data,ax=axis[0,0])
sns.countplot(x='Geography',hue='Exited',data=EDA_Data,ax=axis[0,1])
sns.countplot(x='HasCrCard',hue='Exited',data=EDA_Data,ax=axis[1,0])
sns.countplot(x='IsActiveMember',hue='Exited',data=EDA_Data,ax=axis[1,1])
sns.countplot(x='NumOfProducts',hue='Exited',data=EDA_Data,ax=axis[2,0])

axis[2,1].remove()

#### From the above diagram, we can see that:
#### 1) Female is easier exit than Male
#### 2) Customer in Germany more likely to exit
#### 3) Inactive Member has higher proportion to exit than Active Member
#### 4) Credit Card is not correlated to Exit
#### 5) Customer with 2 products have higher proportion not exit

### Correlation between features

In [None]:
plt.figure(figsize=(20,10))
EDA2=df.iloc[:,3:-1]
EDA2=pd.concat([df.iloc[:,-1],EDA2,],axis=1)
EDA2=pd.get_dummies(EDA2,columns=['Geography','Gender','NumOfProducts','HasCrCard','IsActiveMember'])
sns.heatmap(EDA2.corr(),vmin=-1,vmax=1,annot=True)

#### Features with color close to 0 means no correlation, 
#### Features with color close to -1 or 1 means having strong negative or postive relationship to each other respectively. 
#### From the above heat map, we can see that all features only have weak relationship to each other.
#### Features that have relative slightly relationship with Exited include:
#### Age, NumOfProduct, Geography, IsActiveMember,Balance,Gender

## Train Data without Feature Selected - Set 1

In [None]:
X0=df.iloc[:,3:-1]
X0_c=X0.loc[:,['Geography','Gender','NumOfProducts','HasCrCard','IsActiveMember']]

#### Apply OneHotEncoding to Categorical Data: (Geography,Gender,NumOfProducts,HasCrCard,IsActiveMember)

In [None]:
X0_c=pd.get_dummies(X0_c)

In [None]:
from sklearn.preprocessing import StandardScaler
X0_n=X0.loc[:,['CreditScore','Age','Balance','Tenure','EstimatedSalary']]
scaler = StandardScaler()
X0_n1=scaler.fit_transform(X0_n)

In [None]:
X0_n1=pd.DataFrame(X0_n1,columns=['CreditScore','Age','Balance','Tenure','EstimatedSalary'])

In [None]:
X=pd.concat([X0_c,X0_n1],axis=1)
y=EDA_Data['Exited'].apply(lambda x: 1 if x=='Exit' else 0)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Generating new data by oversampling
#### As mentioned before, the data is imbalnce, so, we increase the number of samples by SMOTE technique

In [None]:
#conda install -c conda-forge imbalanced-learn

In [None]:
from imblearn.over_sampling import SMOTE
smk = SMOTE()
# Oversample training  data
X_train, y_train = smk.fit_sample(X_train, y_train)

# Oversample validation data
X_test, y_test = smk.fit_sample(X_test, y_test)

In [None]:
print(y_train.value_counts(),'\n',y_test.value_counts())

#### Start to Train Model

In [None]:
Accuracy_Score=[]
Recall_Score=[]
Precision_Score=[]
F1_Score=[]

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=42).fit(X_train, y_train)
y_pred=clf.predict(X_test)
s1_1_y_pred=y_pred
clf.score(X_test, y_test)

In [None]:
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

### SVC (Support Vector Classifier)

In [None]:
from sklearn.svm import SVC
clf = SVC(random_state=42)
svc=clf.fit(X_train, y_train)
y_pred=svc.predict(X_test)
s1_2_y_pred=y_pred
svc.score(X_test, y_test)

In [None]:
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
s1_3_y_pred=y_pred
clf.score(X_test, y_test)

In [None]:
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

### XGBoost

In [None]:
import xgboost as xgb
dtrain = xgb.DMatrix(data = X_train, label = y_train) 
dtest = xgb.DMatrix(data = X_test, label = y_test) 
# specify parameters via map
param = {'max_depth':6, 'eta':0.3, 'objective':'binary:hinge' } #Use default value in the first time
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# make prediction
y_pred = bst.predict(dtest)
s1_4_y_pred=y_pred
accuracy_score(y_test,y_pred)

In [None]:
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

In [None]:
# Set 1 Score:
ind_name=['Accuracy_Score','Recall_Score','Precision_Score','F1_Score']
summary1=pd.DataFrame(np.vstack((Accuracy_Score,Recall_Score,Precision_Score,F1_Score)),columns=['Logistic Reg.','SVC','Random Forest','XGB'],index=ind_name)
summary1

## Train Data with Feature Selected (Based on EDA) - Set 2 
### [Keep all hyperparameter same as Set 1]

### Remove Estimated Salary, Has Credit Card, Credit Score and Tenure refer to EDA

In [None]:
X1=df.iloc[:,3:-1]
X1_c=X1.loc[:,['Geography','Gender','NumOfProducts','IsActiveMember']]
X1_n=X1.loc[:,['Age','Balance']]

### Part 1: Distribution of Numerical Feature

In [None]:
EDA_Data[EDA_Data['Exited']=='Exit'].describe()

In [None]:
EDA_Data[EDA_Data['Exited']=='Not Exit'].describe()

In [None]:
sns.distplot(a=EDA_Data['Age'],kde=False)

In [None]:
Age1=EDA_Data[EDA_Data['Exited']=='Exit']
sns.distplot(a=Age1['Age'],kde=False)

In [None]:
Age1=EDA_Data[EDA_Data['Exited']=='Not Exit']
sns.distplot(a=Age1['Age'],kde=False)

In [None]:
def age_gp(a):
    if a>=18 and a<30:
        return 'Gp1'
    elif a>=30 and a<40:
        return 'Gp2'
    elif a>=40 and a<50:
        return 'Gp3'
    elif a>=50:
        return 'Gp4'

In [None]:
X1_c['Age_group']=X1_n['Age'].apply(age_gp)

In [None]:
X1_c.head(3)

In [None]:
sns.distplot(a=EDA_Data['Balance'])

In [None]:
X1_c['Balance_Group']=X1_n['Balance'].apply(lambda x: 'Without Balance' if x<50000 else 'With Balance')

In [None]:
## As All of X's are Catergorical Data, only need transfer them to Binary Dataa

In [None]:
X1_c2=pd.get_dummies(X1_c)
y=EDA_Data['Exited'].apply(lambda x: 1 if x=='Exit' else 0)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X1_c2, y, test_size=0.2, random_state=42)

#### Generating new data by oversampling
#### Same as before, as the data is imbalnce, so, we increase the number of samples by SMOTE technique

In [None]:
from imblearn.over_sampling import SMOTE
smk = SMOTE()
# Oversample training  data
X_train, y_train = smk.fit_sample(X_train, y_train)

# Oversample validation data
X_test, y_test = smk.fit_sample(X_test, y_test)

In [None]:
print(y_train.value_counts(),'\n',y_test.value_counts())

#### Start to Train Model

In [None]:
Accuracy_Score=[]
Recall_Score=[]
Precision_Score=[]
F1_Score=[]

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=42).fit(X_train, y_train)
y_pred=clf.predict(X_test)
s2_1_y_pred=y_pred
clf.score(X_test, y_test)


In [None]:
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

### SVC (Support Vector Classifier)

In [None]:
from sklearn.svm import SVC
clf = SVC(random_state=42)
svc=clf.fit(X_train, y_train)
y_pred=svc.predict(X_test)
s2_2_y_pred=y_pred
svc.score(X_test, y_test)

In [None]:
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
s2_3_y_pred=y_pred
clf.score(X_test, y_test)

In [None]:
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

### XGBoosting

In [None]:
import xgboost as xgb
dtrain = xgb.DMatrix(data = X_train, label = y_train) 
dtest = xgb.DMatrix(data = X_test, label = y_test) 
# specify parameters via map
param = {'max_depth':6, 'eta':0.3, 'objective':'binary:hinge' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# make prediction
y_pred = bst.predict(dtest)
s2_4_y_pred=y_pred
accuracy_score(y_test,y_pred)

In [None]:
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

In [None]:
# Set 2 Score:
ind_name=['Accuracy_Score','Recall_Score','Precision_Score','F1_Score']
summary2=pd.DataFrame(np.vstack((Accuracy_Score,Recall_Score,Precision_Score,F1_Score)),columns=['Logistic Reg.','SVC','Random Forest','XGB'],index=ind_name)
summary2

In [None]:
import sklearn.metrics as metrics
# calculate the fpr and tpr for all thresholds of the classification
probs = clf.predict_proba(X_test)
preds = probs[:,1]


fpr11, tpr11, threshold = metrics.roc_curve(y_test, s1_1_y_pred)
roc_auc11 = metrics.auc(fpr11, tpr11)
fpr12, tpr12, threshold = metrics.roc_curve(y_test, s1_2_y_pred)
roc_auc12 = metrics.auc(fpr12, tpr12)
fpr13, tpr13, threshold = metrics.roc_curve(y_test, s1_3_y_pred)
roc_auc13 = metrics.auc(fpr13, tpr13)
fpr14, tpr14, threshold = metrics.roc_curve(y_test, s1_4_y_pred)
roc_auc14 = metrics.auc(fpr14, tpr14)


# method I: plt
import matplotlib.pyplot as plt
plt.title('Set 1 Model AUC-ROC')
plt.plot(fpr11, tpr11, 'b', label = 'AUC = %0.2f logistic' % roc_auc11)
plt.plot(fpr12, tpr12, 'r', label = 'AUC = %0.2f svc' % roc_auc12)
plt.plot(fpr13, tpr13, 'y', label = 'AUC = %0.2f RF' % roc_auc13)
plt.plot(fpr14, tpr14, 'g', label = 'AUC = %0.2f XGB' % roc_auc14)

plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
import sklearn.metrics as metrics
# calculate the fpr and tpr for all thresholds of the classification
probs = clf.predict_proba(X_test)
preds = probs[:,1]


fpr21, tpr21, threshold = metrics.roc_curve(y_test, s2_1_y_pred)
roc_auc21 = metrics.auc(fpr21, tpr21)
fpr22, tpr22, threshold = metrics.roc_curve(y_test, s2_2_y_pred)
roc_auc22 = metrics.auc(fpr22, tpr22)
fpr23, tpr23, threshold = metrics.roc_curve(y_test, s2_3_y_pred)
roc_auc23 = metrics.auc(fpr23, tpr23)
fpr24, tpr24, threshold = metrics.roc_curve(y_test, s2_4_y_pred)
roc_auc24 = metrics.auc(fpr24, tpr24)


# method I: plt
import matplotlib.pyplot as plt
plt.title('Set 2 Model AUC-ROC')
plt.plot(fpr11, tpr21, 'b', label = 'AUC = %0.2f logistic' % roc_auc21)
plt.plot(fpr12, tpr22, 'r', label = 'AUC = %0.2f svm' % roc_auc22)
plt.plot(fpr13, tpr23, 'y', label = 'AUC = %0.2f RF' % roc_auc23)
plt.plot(fpr14, tpr24, 'g', label = 'AUC = %0.2f XGB' % roc_auc24)

plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
summary1

In [None]:
summary2

#### We can see that the best model among Set 1 and Set 2 is 'Random Forest in Set 1'
#### All Models have higher score in Set 1 than Set 2
#### However, for building a Churn Model, it is more important to predict customer who will leave correctly rather than the overall accuracy of the model.
#### Let's think carefully: 
#### For Type I error, which is the error to predict the customer who exit, but actually he/she doesn't. 
#### For Type II error, which is the error to predict the customer who not exist, but actually he/she does.
#### Which error is more serious? It should be Type II error.
#### For Type I error case, if we predict wrongly, we may waste cost/resource to retain a customer who actually will stay.
#### For Type II error case, if we predict wrongly, we may take no action to the customer and the customer will therefore leave.
#### Therefore, Recall Score is much more important than Accuracy Score.
#### Before select which model to be used, let's fine tune our model!

## Hyperparameter Tuning - Set 1
I will try to fine tune the Hyperparameter and hope to obtain a better Accuracy and Recall Rate:

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
X=pd.concat([X0_c,X0_n1],axis=1)
y=EDA_Data['Exited'].apply(lambda x: 1 if x=='Exit' else 0)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from imblearn.over_sampling import SMOTE
smk = SMOTE()
# Oversample training  data
X_train, y_train = smk.fit_sample(X_train, y_train)

# Oversample validation data
X_test, y_test = smk.fit_sample(X_test, y_test)

In [None]:
pd.DataFrame(y_train).groupby('Exited').agg({'Exited':'count'})

In [None]:
Accuracy_Score=[]
Recall_Score=[]
Precision_Score=[]
F1_Score=[]

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

clf =  LogisticRegression(random_state=42)
param_grid = [
    {'C':[0.01,0.1,1,10]}]
     #'max_iter':[100,150,200,1000]}]
    #{'solver': ['newton-cg','sag','lbfgs' ],'penalty':['l2']}] 
    #{'solver': ['liblinear','saga'],'penalty':['l1']}]
search = GridSearchCV(clf, param_grid,scoring='accuracy',cv=5)
lr=search.fit(X_train, y_train)

lr=clf.fit(X_train, y_train)
y_pred=lr.predict(X_test)
lr.score(X_test, y_test) 

print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

#search.cv_results_


In [None]:
from sklearn.linear_model import LogisticRegression

clf =  LogisticRegression(random_state=42)

param_grid = [
    {'C':[0.01,0.1,1,10]}]
  #{'solver': ['newton-cg','sag','lbfgs' ],'penalty':['l2']}]
  #{'solver': ['liblinear','saga'],'penalty':['l1']}]
search = GridSearchCV(clf, param_grid,scoring='recall', cv=5)
lr=search.fit(X_train, y_train)

#lr=clf.fit(X_train, y_train)
y_pred=lr.predict(X_test)
lr.score(X_test, y_test) 

print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

#search.cv_results_

### SVC (Support Vector Classifier)

In [None]:
from sklearn.svm import SVC
clf = SVC(random_state=42, kernel='rbf')

param_grid = [
  {'C': [0.01,0.1,1,10]}]

search = GridSearchCV(clf, param_grid,scoring='accuracy', cv=5)
svc=search.fit(X_train, y_train)
y_pred=svc.predict(X_test)
svc.score(X_test, y_test) 

print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
#search.cv_results_

In [None]:
from sklearn.svm import SVC
clf = SVC(random_state=42, kernel='rbf')

param_grid = [
  {'C': [0.01,0.1,1,10],
   'gamma':['scale', 'auto']}]

search = GridSearchCV(clf, param_grid,scoring='recall', cv=5)
svc=search.fit(X_train, y_train)
y_pred=svc.predict(X_test)
svc.score(X_test, y_test)

print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
#search.cv_results_

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
param_grid = [
  {'n_estimators' : [140,150,160,170,180,190,200]}]

search = GridSearchCV(clf, param_grid,scoring='accuracy', cv=5)
clf=search.fit(X_train, y_train)
y_pred=clf.predict(X_test)

clf.score(X_test, y_test)
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
#search.cv_results_  

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
param_grid = [
  {'n_estimators' : [140,150,160,170,180,190,200]}]

search = GridSearchCV(clf, param_grid, scoring='recall', cv=5)
clf=search.fit(X_train, y_train)
y_pred=clf.predict(X_test)

clf.score(X_test, y_test)
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
#search.cv_results_ 

### XGBoost

In [None]:
import xgboost as xgb
dtrain = xgb.DMatrix(data = X_train, label = y_train) 
dtest = xgb.DMatrix(data = X_test, label = y_test) 
# specify parameters via map
param = {'max_depth':4, 'eta':0.6, 'objective':'binary:hinge'}
num_round = 20
bst = xgb.train(param, dtrain, num_round)
# make prediction
y_pred = bst.predict(dtest)
accuracy_score(y_test,y_pred)

print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

In [None]:
# Set 3 Score:
ind_name=['Accuracy_Score','Recall_Score','Precision_Score','F1_Score']
summary3=pd.DataFrame(np.vstack((Accuracy_Score,Recall_Score,Precision_Score,F1_Score)),
                      columns=['Logistic Reg.(Accuracy)','Logistic Reg.(Recall)',
                               'SVC (Accuracy)','SVC (Recall)',
                               'Random Forest (Accuracy)','Random Forest (Recall)'
                               ,'XGB'],index=ind_name)
summary3

In [None]:
summary1

In [None]:
summary2

## Hyperparameter Tuning - Set 2
Also, I will try to fine tune the Hyperparameter and hope to obtain a better Accuracy and Recall Rate:

In [None]:
X1_c2=pd.get_dummies(X1_c)
y=EDA_Data['Exited'].apply(lambda x: 1 if x=='Exit' else 0)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X1_c2, y, test_size=0.2, random_state=42)

In [None]:
from imblearn.over_sampling import SMOTE
smk = SMOTE()
# Oversample training  data
X_train, y_train = smk.fit_sample(X_train, y_train)

# Oversample validation data
X_test, y_test = smk.fit_sample(X_test, y_test)

In [None]:
pd.DataFrame(y_train).groupby('Exited').agg({'Exited':'count'})

In [None]:
Accuracy_Score=[]
Recall_Score=[]
Precision_Score=[]
F1_Score=[]

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=42)


param_grid = [
  {'C':[0.1,1,10],'max_iter':[1000,10000]}]
  #{'solver': ['newton-cg','sag','lbfgs'],'C':[0.1,1,10],'max_iter':[1000,10000]}] 
  #{'solver': ['newton-cg','sag','lbfgs' ],'penalty':['l2'],'C':[0.1,1,10],'max_iter':[1000,10000]}] #1
  #{'solver': ['liblinear','saga'],'penalty':['l1'],'max_iter':[1000,10000]}]
search = GridSearchCV(clf, param_grid, scoring='accuracy', cv=5)


lr=search.fit(X_train, y_train)
y_pred=lr.predict(X_test)
lr.score(X_test, y_test) 

print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
#search.cv_results_  (No much Change)

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=42)

param_grid = [
    {'C':[0.1,1,10],'max_iter':[1000,10000]}]
  #{'solver': ['newton-cg','sag','lbfgs' ],'C':[0.1,1,10],'max_iter':[1000,10000]}] 
  #{'solver': ['newton-cg','sag','lbfgs' ],'penalty':['l2'],'C':[0.1,1,10],'max_iter':[1000,10000]}]
  #{'solver': ['liblinear','saga'],'penalty':['l1'],'C':[0.1,1,10],'max_iter':[1000,10000]}]
search = GridSearchCV(clf, param_grid, scoring='recall', cv=5)


lr=search.fit(X_train, y_train)
y_pred=lr.predict(X_test)
lr.score(X_test, y_test) 

print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
#search.cv_results_  (No much Change)

### SVC (Support Vector Classifier)

In [None]:
from sklearn.svm import SVC

clf = SVC(random_state=42)#,C=1,kernel='rbf',gamma='scale')


param_grid = [
    {'C': [0.01,0.1,1,10]}]    
   #{'C': [0.01,0.1,1,10], 'kernel': ['rbf'],'gamma':['scale','auto']}]

search = GridSearchCV(clf, param_grid, scoring='accuracy', cv=5)
svc=search.fit(X_train, y_train)
y_pred=svc.predict(X_test)


#svc=clf.fit(X_train, y_train)
#y_pred=svc.predict(X_test)
#s2_2_y_pred=y_pred

svc.score(X_test, y_test)
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
#search.cv_results_ 

In [None]:
from sklearn.svm import SVC

clf = SVC(random_state=42)#,C=1,kernel='rbf',gamma='scale')


param_grid = [
  {'C': [0.01,0.1,1,10]}]    
  #{'C': [0.01,0.1,1,10], 'kernel': ['rbf'],'gamma':['scale','auto']}]

search = GridSearchCV(clf, param_grid, scoring='recall', cv=5)

svc=search.fit(X_train, y_train)
y_pred=svc.predict(X_test)


#svc=clf.fit(X_train, y_train)
#y_pred=svc.predict(X_test)
#s2_2_y_pred=y_pred

svc.score(X_test, y_test)
print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
#search.cv_results_ 

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=42)
param_grid = [
  #{'n_estimators' : [100,150,200],'max_depth':[5,10,15]}]
  #{'n_estimators' : [100,110,120,130,140,150],'max_depth':[5,6,7,8,9,10]}]
  #{'n_estimators' : [100,110,120,130,140,150],'max_depth':[9,10]}]
  #{'n_estimators' : [110,111,112,113,114,115,116,117,118,119],'max_depth':[5,10,15]}]
   {'n_estimators' : [117,118,119],'max_depth':[9,10,11],'criterion':['gini','entropy']}]
search = GridSearchCV(clf, param_grid, scoring='accuracy', cv=5)
clf=search.fit(X_train, y_train)
y_pred=clf.predict(X_test)

#clf.fit(X_train, y_train)
#y_pred=clf.predict(X_test)
clf.score(X_test, y_test)

print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

#search.cv_results_

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=42)
param_grid = [
  #{'n_estimators' : [100,150,200],'max_depth':[5,10,15]}]
  #{'n_estimators' : [100,110,120,130,140,150],'max_depth':[5,6,7,8,9,10]}]
  #{'n_estimators' : [100,110,120,130,140,150],'max_depth':[9,10]}]
  #{'n_estimators' : [105,106,107,108,109,110,111,112,113,114,115,116,117,118,119],'max_depth':[9,10,11]}]
   {'n_estimators' : [108,109],'max_depth':[9]}]
search = GridSearchCV(clf, param_grid, scoring='recall', cv=5)
clf=search.fit(X_train, y_train)
y_pred=clf.predict(X_test)

#clf.fit(X_train, y_train)
#y_pred=clf.predict(X_test)
clf.score(X_test, y_test)

print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

#search.cv_results_

### XGBoost

In [None]:
#scale_pos_weight=sum(negative instances) / sum(positive instances)
import xgboost as xgb
dtrain = xgb.DMatrix(data = X_train, label = y_train) 
dtest = xgb.DMatrix(data = X_test, label = y_test) 
# specify parameters via map
param = {'max_depth':5, 'eta':0.07,'objective':'binary:hinge'}
num_round = 500
bst = xgb.train(param, dtrain, num_round)
# make prediction
y_pred = bst.predict(dtest)
accuracy_score(y_test,y_pred)

print(' Accuracy Score: %.3f' % accuracy_score(y_test, y_pred) ,'\n', 
      'Recall Score: %.3f' % recall_score(y_test, y_pred) ,'\n', #True Postive out of Actual Postive
      'Precision Score: %.3f' % precision_score(y_test, y_pred) ,'\n', #True Postive out of Predicted Postive
      'F1 Score Score: %.3f' % f1_score(y_test, y_pred) ) #Close to 1 is better; Close to 0 is worse

Accuracy_Score.append(accuracy_score(y_test, y_pred))
Recall_Score.append(recall_score(y_test, y_pred))
Precision_Score.append(precision_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))

In [None]:
# Set 4 Score:
ind_name=['Accuracy_Score','Recall_Score','Precision_Score','F1_Score']
summary4=pd.DataFrame(np.vstack((Accuracy_Score,Recall_Score,Precision_Score,F1_Score)),
                      columns=['Logistic Reg.(Accuracy)','Logistic Reg.(Recall)',
                               'SVC (Accuracy)','SVC (Recall)',
                               'Random Forest (Accuracy)','Random Forest (Recall)'
                               ,'XGB'],index=ind_name)
summary4

## Model Selection

In [None]:
summary1

In [None]:
summary2

In [None]:
summary3

In [None]:
summary4

After hyperparameter tuning, we can see that:
1) For Accuracy, All score in set 1 Model have been improved, while in set 2 Model, only Logistic Regression, SVC and XGB have been improved.

2) For Recall rate, Logistic Regression, SVC and Random Forest have been improved in set 1 while only SVC have been improved in set 2.

So, which model should be used?
In my opinion, XGB with hyperparameter tuning in set 1 is recommended to use.
As mentioned before, our business question is find out customer who will exit the bank.
Therefore, Recall Rate is more important than Accuracy Score.

Recap that:
TP (True Positive): Predict customer will exit while the customer really exit
TN (True Negative): Predict customer will not exit while the customer really not exit
FP (False Positive): Predict customer will exit while actually the customer will not exit 
FN (False Negative): Predict customer will not exit while actually the customer will exit 

Accuracy Rate = TP/(TP+TN)
Recall Rate = TP/(TP+FN)
Precision = TP/(TP+FP)
F1 Score = 2TP/(2TP+FP+FN)

In our case, we want predict Exit customer for retention.
Therefore, which ones is more important?
A) Finding a exit customer from a base with exiting customer and not exiting customer correctly? OR
B) Finding a exit customer from a base with exiting customer and 'I guess the customer will not leave, but actually the customer will leave'? OR
C) Finding a exit customer from a base with exiting customer and 'I guess the customer will leave, but actually the customer will not leave'?

The answer should be B, right? 
We should minimize the % of 'I guess the customer will not leave, but actually the customer will leave', i.e. Use the highest recall rate.

Therefore XGBoost in set 1 seems the best ones we are going to use.
We should the ones with hyperparameter tuning because the F1_Score is higher than the ones without tuning.
F1 score becomes high only when both precision and recall are high.
