# VODAFONE MINIPROJECT - CLASSIFICATION ON BANK CHURN DATASET

## Vaibhav Jaiswal
## PRN : 17070122071
## Final Year CSE , 2017-2021
## Symbiosis Institute of Technology , Pune

In [1]:
#import required Libraries

import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [2]:
#Read the Bank customer dataset
bank_df = pd.read_csv("Bank_Customer_Churn_dataset.csv")
bank_df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [3]:
#see the types and counts of columns
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [4]:
#check how many null values are there if any
bank_df.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [5]:
#check how many unique values are present in each columns
bank_df.nunique()

RowNumber          10000
CustomerId         10000
Surname             2932
CreditScore          460
Geography              3
Gender                 2
Age                   70
Tenure                11
Balance             6382
NumOfProducts          4
HasCrCard              2
IsActiveMember         2
EstimatedSalary     9999
Exited                 2
dtype: int64

Based on the number of unique values and well as domain insight we can see that row number , customer Id and surname is unique to each customer and therefore will not contribute towards classification and must be dropped

In [6]:
#Drop RowNumber CustomerID and Surname
bank_df = bank_df.drop(['RowNumber','CustomerId','Surname'],axis=1)

In [7]:
#see info for remaining columns
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  int64  
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB


Given that tenure and age are related in the sense that tenure will be based on the age of the person and age of a person will be more than tenure. So we create a new variable called tenurebyage to normalize tenure over age of a person

# DATA PREP FOR MODEL FITTING

In [8]:
bank_df['TenureByAge'] = bank_df.Tenure/(bank_df.Age)
bank_df['TenureByAge']

0       0.047619
1       0.024390
2       0.190476
3       0.025641
4       0.046512
          ...   
9995    0.128205
9996    0.285714
9997    0.194444
9998    0.071429
9999    0.142857
Name: TenureByAge, Length: 10000, dtype: float64

A persons spending behaviour will vary with age , so their credit score is bound to vary according to their age . as people generally spend a lot or take loans in their youth and as their age progresses they focus on savings and clearing debt
So we create a variable called Creditbyage which takes this behaviour into account according to age

In [9]:
bank_df["CreditScoreByAge"] = bank_df.CreditScore/bank_df.Age
bank_df["CreditScoreByAge"]

0       14.738095
1       14.829268
2       11.952381
3       17.923077
4       19.767442
          ...    
9995    19.769231
9996    14.742857
9997    19.694444
9998    18.380952
9999    28.285714
Name: CreditScoreByAge, Length: 10000, dtype: float64

In [10]:
bank_df.head(10)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,TenureByAge,CreditScoreByAge
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1,0.047619,14.738095
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,0.02439,14.829268
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,0.190476,11.952381
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0.025641,17.923077
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0.046512,19.767442
5,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1,0.181818,14.659091
6,822,France,Male,50,7,0.0,2,1,1,10062.8,0,0.14,16.44
7,376,Germany,Female,29,4,115046.74,4,1,0,119346.88,1,0.137931,12.965517
8,501,France,Male,44,4,142051.07,2,0,1,74940.5,0,0.090909,11.386364
9,684,France,Male,27,2,134603.88,1,1,1,71725.73,0,0.074074,25.333333


## Lets min-max scale the  numerical columns so their range becomes same

In [11]:
numerical_columns = ['CreditScore',  'Age', 'Tenure', 'Balance','NumOfProducts', 'EstimatedSalary',
                   'TenureByAge','CreditScoreByAge']

for column in numerical_columns:
  minimum=bank_df[column].min()
  maximum=bank_df[column].max()
  bank_df[column] = (bank_df[column] - minimum)/(maximum-minimum)

bank_df.head(10)


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,TenureByAge,CreditScoreByAge
0,0.538,France,Female,0.324324,0.2,0.0,0.0,1,1,0.506735,1,0.085714,0.235083
1,0.516,Spain,Female,0.310811,0.1,0.334031,0.0,0,1,0.562709,0,0.043902,0.237252
2,0.304,France,Female,0.324324,0.8,0.636357,0.666667,1,0,0.569654,1,0.342857,0.168807
3,0.698,France,Female,0.283784,0.1,0.0,0.333333,0,0,0.46912,0,0.046154,0.310859
4,1.0,Spain,Female,0.337838,0.2,0.500246,0.0,1,1,0.3954,0,0.083721,0.354739
5,0.59,Spain,Male,0.351351,0.8,0.453394,0.333333,1,0,0.748797,1,0.327273,0.233203
6,0.944,France,Male,0.432432,0.7,0.0,0.333333,1,1,0.050261,0,0.252,0.275574
7,0.052,Germany,Female,0.148649,0.4,0.45854,1.0,1,0,0.596733,1,0.248276,0.192911
8,0.302,France,Male,0.351351,0.4,0.56617,0.333333,0,1,0.37468,0,0.163636,0.15534
9,0.668,France,Male,0.121622,0.2,0.536488,0.0,1,1,0.358605,0,0.133333,0.48716


# NOW LETS ENCODE THE CATEGORICAL VARIABLES

here since Geography and Gender  will not have a relationship between their categories , so it will be optimal to onehot encode them so the model dosent interprit some relationship amoung them

In [12]:
categorical_columns = ['Geography', 'Gender']

encoder = OneHotEncoder()
for column in categorical_columns:
  bank_df=pd.concat([bank_df,pd.get_dummies(bank_df[column],prefix=column)],axis=1).drop([column],axis=1)

bank_df

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,TenureByAge,CreditScoreByAge,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,0.538,0.324324,0.2,0.000000,0.000000,1,1,0.506735,1,0.085714,0.235083,1,0,0,1,0
1,0.516,0.310811,0.1,0.334031,0.000000,0,1,0.562709,0,0.043902,0.237252,0,0,1,1,0
2,0.304,0.324324,0.8,0.636357,0.666667,1,0,0.569654,1,0.342857,0.168807,1,0,0,1,0
3,0.698,0.283784,0.1,0.000000,0.333333,0,0,0.469120,0,0.046154,0.310859,1,0,0,1,0
4,1.000,0.337838,0.2,0.500246,0.000000,1,1,0.395400,0,0.083721,0.354739,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0.842,0.283784,0.5,0.000000,0.333333,1,0,0.481341,0,0.230769,0.354782,1,0,0,0,1
9996,0.332,0.229730,1.0,0.228657,0.000000,1,1,0.508490,0,0.514286,0.235196,1,0,0,0,1
9997,0.718,0.243243,0.7,0.000000,0.000000,0,1,0.210390,1,0.350000,0.353002,1,0,0,1,0
9998,0.844,0.324324,0.3,0.299226,0.333333,1,0,0.464429,1,0.128571,0.321752,0,1,0,0,1


In [13]:
bank_df.columns

Index(['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited', 'TenureByAge',
       'CreditScoreByAge', 'Geography_France', 'Geography_Germany',
       'Geography_Spain', 'Gender_Female', 'Gender_Male'],
      dtype='object')

#Split the dataset

In [14]:
from sklearn.model_selection import train_test_split

independant_coluumn_names = ['CreditScore', 'Age', 'Tenure', 'Balance', 
                             'NumOfProducts', 'HasCrCard','IsActiveMember', 
                             'EstimatedSalary', 'TenureByAge','CreditScoreByAge', 
                             'Geography_France', 'Geography_Germany',
                             'Geography_Spain', 'Gender_Female', 'Gender_Male']
dependant_column_names = ['Exited']

X=bank_df[independant_coluumn_names]
Y=bank_df[dependant_column_names]

x_train, x_test, y_train, y_test = train_test_split(X, Y,test_size=0.20,random_state=42)

## Lets create a variable to store accuracy of each model we fit

In [15]:
accuracy_models = pd.DataFrame(columns=["Model" , "Accuracy Score"])

# Metric import

In [16]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import classification_report


# Logistic Regression

In [17]:
# fit the logistic regression model
from sklearn.linear_model import LogisticRegression 
logistic_model = LogisticRegression() 
logistic_model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [18]:
# predict using logistic model

y_pred_logistic = logistic_model.predict(x_test) 
len(y_pred_logistic)

2000

In [19]:
#lets see the accuracy of the model

print(f"\nLogistic Classification Model \nACCURACY SCORE  :  {accuracy_score(y_test,y_pred_logistic)}")


Logistic Classification Model 
ACCURACY SCORE  :  0.815


In [20]:
accuracy_models=accuracy_models.append({"Model" : "Logistic Regression " , 
                        "Accuracy Score" : accuracy_score(y_test,y_pred_logistic)}
                       ,ignore_index=True)

In [21]:
#LETS PRINT THE CONFUSION MATRIX OF THE MODEL

print(f"\nLogistic Classification Model \nConfusion Matrix :") 
print(confusion_matrix(y_test, y_pred_logistic))



Logistic Classification Model 
Confusion Matrix :
[[1548   59]
 [ 311   82]]


In [22]:
#lets print the classification report of the logistic model

print(classification_report(y_test, y_pred_logistic))


              precision    recall  f1-score   support

           0       0.83      0.96      0.89      1607
           1       0.58      0.21      0.31       393

    accuracy                           0.81      2000
   macro avg       0.71      0.59      0.60      2000
weighted avg       0.78      0.81      0.78      2000



In [23]:
#print accuracy of all models tested till now
accuracy_models

Unnamed: 0,Model,Accuracy Score
0,Logistic Regression,0.815


#LETS TRY OUT DECISION TREE CLASSIFIER 

In [24]:
from sklearn.tree import DecisionTreeClassifier

decisiontree_model=DecisionTreeClassifier(criterion = 'entropy', splitter = 'best',
                                          random_state=42)
decisiontree_model.fit(x_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

In [25]:
# predict using Decision tree model

y_pred_decisiontree = decisiontree_model.predict(x_test) 
len(y_pred_decisiontree)

2000

In [26]:
#lets see the accuracy of the Decision Tree model

print(f"\nDecision Tree  Model \nACCURACY SCORE  :  {accuracy_score(y_test,y_pred_decisiontree)}")



Decision Tree  Model 
ACCURACY SCORE  :  0.796


In [27]:
accuracy_models=accuracy_models.append({"Model" : "Decision Tree" , 
                        "Accuracy Score" : accuracy_score(y_test,y_pred_decisiontree)}
                       ,ignore_index=True)

In [28]:
#LETS PRINT THE CONFUSION MATRIX OF THE Decision Tree MODEL

print(f"\nDecision Tree Model \nConfusion Matrix :") 
print(confusion_matrix(y_test, y_pred_decisiontree))



Decision Tree Model 
Confusion Matrix :
[[1395  212]
 [ 196  197]]


In [29]:
#lets print the classification report of the Decision Tree  model
print("CLASSIFICATION REPORT - DECISION TREE MODEL\n")
print(classification_report(y_test, y_pred_decisiontree))


CLASSIFICATION REPORT - DECISION TREE MODEL

              precision    recall  f1-score   support

           0       0.88      0.87      0.87      1607
           1       0.48      0.50      0.49       393

    accuracy                           0.80      2000
   macro avg       0.68      0.68      0.68      2000
weighted avg       0.80      0.80      0.80      2000



In [30]:
#print accuracy of all models tested till now
accuracy_models

Unnamed: 0,Model,Accuracy Score
0,Logistic Regression,0.815
1,Decision Tree,0.796


# LETS TRY OUT RANDOM FOREST CLASSIFIER MODEL 

In [31]:
from sklearn.ensemble import RandomForestClassifier

randomforest_model = RandomForestClassifier(n_estimators=100 , criterion = 'entropy',
                                      random_state = 42)

randomforest_model.fit(x_train,y_train)

  


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [32]:
# predict using Random Forest model

y_pred_rendomforest = randomforest_model.predict(x_test) 
len(y_pred_rendomforest)

2000

In [33]:
#lets see the accuracy of the Random Forest model

print(f"\nRandom Forest  Model \nACCURACY SCORE  :  {accuracy_score(y_test,y_pred_rendomforest)}")


Random Forest  Model 
ACCURACY SCORE  :  0.861


In [34]:
accuracy_models=accuracy_models.append({"Model" : "Random Forest" , 
                        "Accuracy Score" : accuracy_score(y_test,y_pred_rendomforest)}
                       ,ignore_index=True)

In [35]:
#LETS PRINT THE CONFUSION MATRIX OF THE Random Forest MODEL

print(f"\nRandom Forest Model \nConfusion Matrix :") 
print(confusion_matrix(y_test, y_pred_rendomforest))



Random Forest Model 
Confusion Matrix :
[[1546   61]
 [ 217  176]]


In [36]:
#lets print the classification report of the Random Forest  model

print("CLASSIFICATION REPORT - RANDDOM FOREST  MODEL\n")
print(classification_report(y_test, y_pred_rendomforest))


CLASSIFICATION REPORT - RANDDOM FOREST  MODEL

              precision    recall  f1-score   support

           0       0.88      0.96      0.92      1607
           1       0.74      0.45      0.56       393

    accuracy                           0.86      2000
   macro avg       0.81      0.70      0.74      2000
weighted avg       0.85      0.86      0.85      2000



In [37]:
#print accuracy of all models tested till now
accuracy_models

Unnamed: 0,Model,Accuracy Score
0,Logistic Regression,0.815
1,Decision Tree,0.796
2,Random Forest,0.861


## K Neighbours Classifier MODEL

In [38]:
from sklearn.neighbors import KNeighborsClassifier

kneighbour_model = KNeighborsClassifier(n_neighbors=1)

# Train the model using the training sets
kneighbour_model.fit(x_train,y_train)

#Predict Output
y_pred_kneighbour= kneighbour_model.predict(x_test) 
print(len(y_pred_kneighbour))

2000


  


In [39]:
#lets see the accuracy of the K Neighbour Classifier  model
from sklearn.metrics import accuracy_score

print(f"\nK Neighbour Classifier \nACCURACY SCORE  :  {accuracy_score(y_test,y_pred_kneighbour)}")


K Neighbour Classifier 
ACCURACY SCORE  :  0.7965


In [40]:
accuracy_models=accuracy_models.append({"Model" : "K Neighbour Classifier" , 
                        "Accuracy Score" : accuracy_score(y_test,y_pred_kneighbour)}
                       ,ignore_index=True)

In [41]:
#LETS PRINT THE CONFUSION MATRIX OF THE K Neighbour Classifier Model

print(f"\nK Neighbour Classifier Model \nConfusion Matrix :") 
print(confusion_matrix(y_test, y_pred_kneighbour))



K Neighbour Classifier Model 
Confusion Matrix :
[[1416  191]
 [ 216  177]]


In [42]:
#lets print the classification report of the K Neighbour Classifier Model

print("CLASSIFICATION REPORT -  K Neighbour Classifier Model\n")
print(classification_report(y_test, y_pred_kneighbour))


CLASSIFICATION REPORT -  K Neighbour Classifier Model

              precision    recall  f1-score   support

           0       0.87      0.88      0.87      1607
           1       0.48      0.45      0.47       393

    accuracy                           0.80      2000
   macro avg       0.67      0.67      0.67      2000
weighted avg       0.79      0.80      0.79      2000



In [43]:
#print accuracy of all models tested till now
accuracy_models

Unnamed: 0,Model,Accuracy Score
0,Logistic Regression,0.815
1,Decision Tree,0.796
2,Random Forest,0.861
3,K Neighbour Classifier,0.7965


# LETS TRY XGBOOST MODEL

In [44]:
!pip install xgboost



In [45]:
# LETS FIR XGBOOST CLASSIFIER MODEL 

from xgboost import XGBClassifier

xgboost_model = XGBClassifier()
xgboost_model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [46]:
# predict using XGBOOST model

y_pred_xgboost = xgboost_model.predict(x_test) 
len(y_pred_xgboost)

2000

In [47]:
#lets see the accuracy of the XGBOOST model

print(f"\nXGBOOST Model \nACCURACY SCORE  :  {accuracy_score(y_test,y_pred_xgboost)}")


XGBOOST Model 
ACCURACY SCORE  :  0.869


In [48]:
accuracy_models=accuracy_models.append({"Model" : "XGBOOST" , 
                        "Accuracy Score" : accuracy_score(y_test,y_pred_xgboost)}
                       ,ignore_index=True)

In [49]:
#LETS PRINT THE CONFUSION MATRIX OF THE XGBOOST MODEL

print(f"\nXGBOOST Model \nConfusion Matrix :") 
print(confusion_matrix(y_test, y_pred_xgboost))



XGBOOST Model 
Confusion Matrix :
[[1548   59]
 [ 203  190]]


In [50]:
#lets print the classification report of the RXGBOOST  model

print("CLASSIFICATION REPORT - XGBOOST  MODEL\n")
print(classification_report(y_test, y_pred_xgboost))


CLASSIFICATION REPORT - XGBOOST  MODEL

              precision    recall  f1-score   support

           0       0.88      0.96      0.92      1607
           1       0.76      0.48      0.59       393

    accuracy                           0.87      2000
   macro avg       0.82      0.72      0.76      2000
weighted avg       0.86      0.87      0.86      2000



In [51]:
#print accuracy of all models tested till now
accuracy_models

Unnamed: 0,Model,Accuracy Score
0,Logistic Regression,0.815
1,Decision Tree,0.796
2,Random Forest,0.861
3,K Neighbour Classifier,0.7965
4,XGBOOST,0.869


# Lets try to improve XGBOOST with Gridsearch and tune its hyperparameters

In [52]:
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import  make_scorer

estimator = XGBClassifier(
    objective= 'binary:logistic',
    nthread=4,
    seed=42
)
parameters = {
    'max_depth': range (2, 10, 1),
    'n_estimators': range(50, 301, 50),
    'learning_rate': [0.01,0.1, 0.2, 0.3]
}

kfold = KFold(n_splits=10, random_state=42)

grid_search = GridSearchCV(
    estimator=estimator,
    param_grid=parameters,
    scoring = {'AUC':'roc_auc', 'Accuracy':make_scorer(accuracy_score)},
    n_jobs = 10,
    cv = kfold,
    verbose=True,
    refit='AUC'
)
grid_search.fit(X, Y)

Fitting 10 folds for each of 192 candidates, totalling 1920 fits


[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:   22.9s
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed:  3.2min
[Parallel(n_jobs=10)]: Done 430 tasks      | elapsed: 12.8min
[Parallel(n_jobs=10)]: Done 780 tasks      | elapsed: 22.7min
[Parallel(n_jobs=10)]: Done 1230 tasks      | elapsed: 36.1min
[Parallel(n_jobs=10)]: Done 1780 tasks      | elapsed: 53.1min
[Parallel(n_jobs=10)]: Done 1920 out of 1920 | elapsed: 59.8min finished
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=False),
             error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=4, objective='binary:logistic',
                                     rand...=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=42, silent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=10,
             param_grid={'learning_rate': [0.01, 0.1, 0.2, 0.3],
                         'max_depth': range(2, 10),
                       

In [53]:
grid_search.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.2, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=50, n_jobs=1,
              nthread=4, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42,
              silent=None, subsample=1, verbosity=1)

In [54]:
# LETS FIR XGBOOST CLASSIFIER MODEL 

from xgboost import XGBClassifier

xgboost_model_2 = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.2, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=50, n_jobs=1,
              nthread=4, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42,
              silent=None, subsample=1, verbosity=1)

xgboost_model_2.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.2, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=50, n_jobs=1,
              nthread=4, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42,
              silent=None, subsample=1, verbosity=1)

In [55]:
# predict using XGBOOST model

y_pred_xgboost_2 = xgboost_model_2.predict(x_test) 
len(y_pred_xgboost_2)

2000

In [56]:
#lets see the accuracy of the XGBOOST Best Params (GridSearch) model

print(f"\nXGBOOST Best Params (GridSearch) \nACCURACY SCORE  :  {accuracy_score(y_test,y_pred_xgboost_2)}")


XGBOOST Best Params (GridSearch) 
ACCURACY SCORE  :  0.8685


In [57]:
accuracy_models=accuracy_models.append({"Model" : "XGBOOST Best Params (gridsearch)" , 
                        "Accuracy Score" : accuracy_score(y_test,y_pred_xgboost_2)}
                       ,ignore_index=True)

In [58]:
#LETS PRINT THE CONFUSION MATRIX OF THE XGBOOST Best Params (GridSearch) Model

print(f"\nXGBOOST Best Params (GridSearch) Model \nConfusion Matrix :") 
print(confusion_matrix(y_test, y_pred_xgboost_2))



XGBOOST Best Params (GridSearch) Model 
Confusion Matrix :
[[1549   58]
 [ 205  188]]


In [59]:
#lets print the classification report of the XGBOOST Best Params (GridSearch) Model

print("CLASSIFICATION REPORT - XGBOOST Best Params (GridSearch) Model\n")
print(classification_report(y_test, y_pred_xgboost_2))


CLASSIFICATION REPORT - XGBOOST Best Params (GridSearch) Model

              precision    recall  f1-score   support

           0       0.88      0.96      0.92      1607
           1       0.76      0.48      0.59       393

    accuracy                           0.87      2000
   macro avg       0.82      0.72      0.76      2000
weighted avg       0.86      0.87      0.86      2000



In [60]:
accuracy_models

Unnamed: 0,Model,Accuracy Score
0,Logistic Regression,0.815
1,Decision Tree,0.796
2,Random Forest,0.861
3,K Neighbour Classifier,0.7965
4,XGBOOST,0.869
5,XGBOOST Best Params (gridsearch),0.8685


# ACCURACY SCORES OF ALL MODELS TESTED

In [61]:
accuracy_models.sort_values(by=["Accuracy Score"], ascending=False)

Unnamed: 0,Model,Accuracy Score
4,XGBOOST,0.869
5,XGBOOST Best Params (gridsearch),0.8685
2,Random Forest,0.861
0,Logistic Regression,0.815
3,K Neighbour Classifier,0.7965
1,Decision Tree,0.796


# _______________________________ END OF NOTEBOOK _______________________________