# Customer Credit Card Prediction

In this notebook, there will be 4 main focuses which are:-

1. [**EDA**](#Exploratory-Data-Analysis---EDA)
2. [**Feature Engineering**](#Feature-Engineering)
    - [Feature Encoding](#Feature/-Data-Encoding)
    - [Scaling](#Feature-Scaling)
    - [Decomposition (PCA)](#Decomposition-(PCA))
3. [**Modelling & Hyperparameter Tuning**](#Modelling-&-Hyperparameter-Tuning)
4. [**Evaluation Metrics**](#Evaluation-Metrics)
    - [Precision-Recall Curve](#Precision-Recall-Curve)
    - [Receiver Operating Characteristic curve (ROC)](#Receiver-Operating-Characteristic-curve-(ROC))
    - [AUC](#AUC)

[**Summary**](#Summary)


P/s: I am new in this data science field, please feel free to give better suggestions or if you happen to find any mistake along this notebook, feel free to comment. Finally, got to apply the knowledge I have gained from my Data Science certificate in this dataset. Now, let's dig in!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# import necessary packages

import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
plt.style.use('default')


import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Exploratory Data Analysis - EDA

We will explore the data first, to get an overall understanding of the data

In [None]:
# read the csv using panda
df_cred = pd.read_csv("/kaggle/input/credit-card-customers/BankChurners.csv")

# let's check the shape and the last 5 rows of the data
print(df_cred.shape)
df_cred.tail()

In [None]:
# Drop columns that are unneccessary
# Clientnum and the last two Naive Bayes columns is not related as been explained in the data description

df_cred.drop(['CLIENTNUM', 'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1', 
              'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], 
             axis=1, inplace=True)

print(df_cred.shape)

In [None]:
# Recommended to check the table after dropping column(s) on another cell because of the drop is inplace=True
# So, to avoid errors when running the cell for multiple times

# List of columns
print(df_cred.columns)
print("")

# Let's see if there is any missing values
print(df_cred.info())

# Let's see the statistics for numerical values
display(df_cred.describe())

### Visualise the frequency of each categorical features

In [None]:
# We want to see the frequency and unique features of all categorical variable in the data by plotting graph

def pltCountplot(category):
    
    fig, axis = plt.subplots(3, 2, figsize=(20,17))  # graph 3 by 2

    index = 0
    for i in range(3):
        for j in range(2):
            
            ax = sns.countplot(category[index], data=df_cred, ax=axis[i][j])
            
            if category[index] in ['Education_Level', 'Income_Category']:  # because the x-label of edu and income category is quite long, so we need to rotate it a bit for visually pleasing
                for item in ax.get_xticklabels():
                    item.set_rotation(15)
                
            for p in ax.patches:
                height = p.get_height()
                ax.text(p.get_x()+p.get_width()/2.,
                        height + 3,
                        '{:1.2f}%'.format(height/len(df_cred)*100),
                        ha="center") 
            index += 1
            
            
category = ['Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
pltCountplot(category)

#### Observations

1. The dataset has no missing values which is good and there are 20 features we are currently working on before applying some feature engineering on them which we will explore later.
2. We can see the dataset is unbalanced where the number of customer attrited is 67.66% lower than existing.
3. The gender of the customer is quite balance.
4. Most of the customer holds blue card category and have an income less than $40K.

### Visualise the features frequency of customer who attrited between the categorical features

In [None]:
def pltCountplot_attrited(category, attrition_flag):
    
    fig, axis = plt.subplots(3, 2, figsize=(22,18))  # graph 3 by 2
    
    index = 0
    for i in range(3):
        for j in range(2):
            
            ax = sns.countplot(category[index], data=df_cred, hue=attrition_flag, ax=axis[i][j])
            
            if category[index] in ['Education_Level', 'Income_Category']:
                for item in ax.get_xticklabels():
                    item.set_rotation(15)
                
            for p in ax.patches:
                height = p.get_height()
                ax.text(p.get_x()+p.get_width()/2.,
                        height + 3,
                        '{:1.2f}%'.format(height/len(df_cred)*100),
                        ha="center") 
            index += 1
            
            
pltCountplot_attrited(category, 'Attrition_Flag')

#### Observations

1. We can see clearly here that attrited customer is significantly less than the existing customer on all the categorical features.
2. For the education level, the ratio of customer who attrited is roughly the same from high school up until college level. And, slowly decreasing in ratio from postgraduate to doctorate level.
3. Ratio of attrited customer in the '60K-80K' income category, is 6.41 where it is higher than the rest income range (exclude Unknown).

These are observations we can see within the respective category only.

### Now, let's see the features correlation between the numerical features

In [None]:
numerical = ['Customer_Age','Dependent_count', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
             'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
             'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']


corr_data = df_cred.loc[:, numerical].corr()

plt.figure(figsize=(15,10))
sns.heatmap(corr_data.abs(), annot=True, fmt='.3f',cmap='viridis',square=True)
plt.show()

#### Observations

There are 8 numerical features who correlated above 0.5, which are:-

1. Average Open To Buy (Credit Line) and Credit Limit (on the card) = 0.996, high positive correlation.
2. Total Transaction Amount and Total Transaction Count = 0.807, high positive correlation.
3. Months on book and Customer age = 0.789, positive correlation.
4. Total revolving balance and Average Utilization Ratio = 0.624, slightly positive correlation.

# Feature Engineering

## Feature/ Data Encoding

For the 6 categorical features, I have divided it into 2 sub-categories which are Nominal and Ordinal. Nominal categories will be using dummy function for non-binary and map function for binary categorical. Whereas, Ordinal will be mapped to their respective level according to their respective category.

* Nominal: Attrition_Flag, Gender, and Marital_Status
* Ordinal: Education_Level, Income_Category, and Card_Category

#### Binary categorical (Nominal)

In [None]:
# create new df
df_cred_updated = pd.DataFrame()

# Target variable
# Customer whom attrited will be replaced to 1, else 0
df_cred_updated["Attrit"] = df_cred.Attrition_Flag.map({"Attrited Customer":1, "Existing Customer":0})

# Gender
df_cred_updated["Gender"] = df_cred.Gender.map({"M":1, "F":0})

#### Dummy variable (Nominal)

In [None]:
dum_marital = pd.get_dummies(df_cred.Marital_Status, prefix='marital', drop_first=True)
dum_marital.head(2)

#### Ordinal Categorical

In [None]:
# ordinal -- ordinal variable because it has natural ordering

df_cred_updated['Education_Level'] = df_cred.Education_Level.map({'Uneducated':1, 'High School':2, 'College':3, 
                                                         'Graduate':4, 'Post-Graduate':5, 'Doctorate':6, 'Unknown':7})
df_cred_updated['Income_Category'] = df_cred.Income_Category.map({'Less than $40K':1, '$40K - $60K':2, '$60K - $80K':3, 
                                                          '$80K - $120K':4, '$120K +':5, 'Unknown':6})
df_cred_updated['Card_Category'] = df_cred.Card_Category.map({'Blue':1, 'Silver':2, 'Gold':3, 'Platinum':4})

df_cred_updated = pd.concat([df_cred_updated, dum_marital, df_cred[numerical]], axis=1)

In [None]:
# Let's see the updated credit data in more detail
print("Shape of updated credit dataset:", df_cred_updated.shape)
print("\nList of updated columns: \n\n", df_cred_updated.columns)
display(df_cred_updated.head(5))

## Feature Scaling

3 scaler will be used to scale the data:-
   * Standard Scaler
   * MinMax Scaler
   * Power Transformer Scaler

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer

from sklearn.pipeline import make_pipeline


X = df_cred_updated.drop('Attrit', axis=1)
y = df_cred_updated['Attrit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)


# Feature scaling -- Standard Scaler
scaler_S = StandardScaler().fit(X_train)
X_train_s = scaler_S.transform(X_train)
X_test_s = scaler_S.transform(X_test)


# Feature scaling -- MinMax Scaler
scaler_MM = MinMaxScaler().fit(X_train)
X_train_mm = scaler_MM.transform(X_train)
X_test_mm = scaler_MM.transform(X_test)


# Feature scaling -- Power Transformer
scaler_PT = PowerTransformer().fit(X_train)
X_train_pt = scaler_PT.transform(X_train)
X_test_pt = scaler_PT.transform(X_test)

## Decomposition (PCA)

Choose the most important feature components that preserves the maximal data variance in the dataset.

In [None]:
from sklearn.decomposition import PCA

# PCA -- Standard Scaler, MinMax, PT
pca_s = PCA().fit(X_train_s)
pca_mm = PCA().fit(X_train_mm)
pca_pt = PCA().fit(X_train_pt)


# Plot the graph to see the number of components needed to explain the most data variance
fig, axs = plt.subplots(1, 3, figsize=(20,5))
fig.text(0.5, -0.04, 'Number of Components', ha='center', size="x-large")
fig.text(0.08, 0.5, 'Cumulative Explain Variance', va='center', rotation='vertical', size="x-large")

axs[0].plot(np.cumsum(pca_s.explained_variance_ratio_), color="red")
axs[0].set_title("PCA_StandardScaler")
axs[1].plot(np.cumsum(pca_mm.explained_variance_ratio_), color="green")
axs[1].set_title("PCA_MinMax")
axs[2].plot(np.cumsum(pca_pt.explained_variance_ratio_), color="blue")
axs[2].set_title("PCA_PowerTransformer");


# do the PCA first so that we know which feature is most important -- plus, dont want our model to be complicated
# If we add these irrelevant features in the model, it will just make the model worst (Garbage In Garbage Out)

#### Observations

This curve quantifies how much of the total, 22-dimensional variance is contained within the first  ùëÅ  components.

1. PCA_StandardScaler - We can see, with the updated dataset. The first 14 components contain approximately 90% of the variance while we need around 17 components to describe close to 100% of the variance.
    
2. PCA_MinMax - For the PCA after scaled using MinMax, the first 10 components contain approximately 90% of the variance while we need around 16 components to describe close to 100% of the variance.
    
2. PCA_PowerTransformer - Lastly, for the PCA after scaled using PowerTransformer, the first 13 components contain approximately 90% of the variance while we need around 16 components to describe close to 100% of the variance.


Hence, by looking at this plot for a high-dimensional dataset, we can choose how many components needed to train our model. It is useful in reducing redundancy present in the dataset plus it also help to improve the time execution.

In [None]:
# PCA -- SS
pca1_s = PCA(n_components=14)
X_train_s_pca = pca1_s.fit_transform(X_train_s)
X_test_s_pca = pca1_s.transform(X_test_s)

# PCA -- MM
pca1_mm = PCA(n_components=10)
X_train_mm_pca = pca1_mm.fit_transform(X_train_mm)
X_test_mm_pca = pca1_mm.transform(X_test_mm)

# PCA -- PT
pca1_pt = PCA(n_components=13)
X_train_pt_pca = pca1_pt.fit_transform(X_train_pt)
X_test_pt_pca = pca1_pt.transform(X_test_pt)


# Shape
print("Original shape: ", X_train.shape)
print("\nAfter PCA & SS: ", X_train_s_pca.shape)
print("After PCA & MM: ", X_train_mm_pca.shape)
print("After PCA & PT: ", X_train_pt_pca.shape)

#### Below is the visualization of the first 14 importance features after being scaled using Standard Scaler

We can see clearly the first component hold the most cumulative variance of features compared to the last component.

In [None]:
plt.figure(figsize=(15,10))
plt.matshow(pca1_s.components_, cmap='viridis', fignum=1)
plt.yticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], 
           ["1st component", "2nd component", "3rd component", "4th component", "5th component", 
            "6th component", "7th component", "8th component", "9th component", "10th component", 
            "11th component", "12th component", "13th component", "14th component"])
plt.colorbar()
plt.xticks(range(21), df_cred_updated.iloc[:, 1:], rotation=60, ha='left')
plt.xlabel("Feature")
plt.ylabel("Principal components")
plt.show()

# Modelling & Hyperparameter Tuning

### Preamble -- imbalanced dataset

Since, we have an imbalanced dataset where Existing Customer:8500 and Attrited Customer:1627. We will handle it by applying KFold and over-sampling (SMOTE) method to the dataset. Our aim is to find a classifier with a good recall (i.e. we want our classifier to find as many attrited cases as it can). However, we need to be aware when using this metric because we do not want to simply label people who are going to attrit their credit card. 



#### Note: We use dataset that is being scaled using StandardScaler and PCA(14)

### Let's Standardizing our splits

Let's make sure that our results are consistent as we try different methods. It is a little simpler to have cv=5 in all of our grid searches and cross-validations, but we will get different splits each time.

If we use cv=kf, where kf is a KFold object we can ensure that we get the same splits each time.

KFold cross-validation will be applied first before data resampling being done. It is important to avoid the data overfit our model to a specific artificial bootstrapping result. Only by resampling the data repeatedly, randomness can be introduced into the dataset to make sure that there won‚Äôt be an overfitting problem.

In [None]:
""" From here, we can see our targets are preamble dataset. So, we use over-sampling (SMOTE) method to 
balance the dataset by increasing the size of the rare sample, in our case Churned size.

{0:Existing, 1=Churned}"""


from collections import Counter
Counter(y)

In [None]:
# Import necessary packages

from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from imblearn.pipeline import Pipeline, make_pipeline
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


kf = KFold(n_splits=5)

In [None]:
# General function for all model except forest
def imba_pipe(model):
    imba_pipeline = make_pipeline(SMOTE(random_state=42), model)
    return cross_val_score(imba_pipeline, X_train_s_pca, y_train, scoring="recall", cv=kf)

# Function for forest model -- the model does not use scaled data or dimension reduction data
def imba_pipe_forest(model):
    imba_pipeline = make_pipeline(SMOTE(random_state=42), model)
    return cross_val_score(imba_pipeline, X_train, y_train, scoring="recall", cv=kf)

# Function to fit the train and test dataset to models
def imba_pipe_fit(model):
    imba_pipeline = Pipeline(steps=[('smote', SMOTE(random_state=42)), 
                                    ('model', model)
                                   ])
    return imba_pipeline

### 1. KNeighbors Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Before GridSearchCV
knn = KNeighborsClassifier(n_neighbors=100)
knn_recall = imba_pipe(knn)
print("Array of KNN recall score after KFold and resample: ", knn_recall)
print("Mean of array KNN recall score: ", round(knn_recall.mean(), 2))

In [None]:
# Find the best hypertuning params -- using GridSearchCV
knn1 = KNeighborsClassifier()
knn_params = {"model__n_neighbors":[2,5,10,87,100],
              "model__weights":["uniform", "distance"],
              "model__p":[1,2]
             }

knn_grid = GridSearchCV(imba_pipe_fit(knn1), knn_params, cv=kf, scoring="recall")
knn_grid.fit(X_train_s_pca, y_train)

print("Best parameters for KNN: ", knn_grid.best_params_)
print(f"\nKNN Grid best train recall score: {knn_grid.best_score_ :.2f}")
print(f"KNN Grid best test recall score: {recall_score(y_test, knn_grid.best_estimator_.predict(X_test_s_pca)) :.2f}")

### 2. Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Since data are not necessarily scaled when using RandomForest algorithm. Hence, we use unscaled data.

# Before GridSearchCV
rfc = RandomForestClassifier(max_depth= None, max_features=5, min_samples_split=2, criterion="gini")
rfc_recall = imba_pipe_forest(rfc)
print("Array of RFC recall score after KFold and resample: ", rfc_recall)
print("Mean of array RFC recall score: ", round(rfc_recall.mean(), 2))

In [None]:
# Find the best hypertuning params -- using GridSearchCV
rfc1 = RandomForestClassifier(max_depth= None)

rfc_params = {"model__criterion":["gini", "entropy"],
              "model__min_samples_split": np.arange(2,6), 
              "model__max_features":[5, "auto", "sqrt", "log2"]
             }

rfc_grid = GridSearchCV(imba_pipe_fit(rfc1), rfc_params, cv=kf, scoring="recall")
rfc_grid.fit(X_train, y_train)

print("Best parameters for RFC: ", rfc_grid.best_params_)
print(f"\nRFC Grid best train recall score: {rfc_grid.best_score_ :.2f}")
print(f"RFC Grid best test recall score: {recall_score(y_test, rfc_grid.best_estimator_.predict(X_test)) :.2f}")

### 3. XGBoost

In [None]:
from xgboost import XGBClassifier

# Since data are not necessarily scaled when using XGBC algorithm. Hence, we use unscaled data.

# Before GridSearchCV
xgbc = XGBClassifier(booster="gbtree", learning_rate=0.1, max_delta_step=1, verbosity=0, use_label_encoder=False)
xgbc_recall = imba_pipe_forest(xgbc)
print("Array of XGBC recall score after KFold and resample: ", xgbc_recall)
print("Mean of array XGBC recall score: ", round(xgbc_recall.mean(), 2))

In [None]:
# Find the best hypertuning params -- using GridSearchCV
xgbc1 = XGBClassifier(booster="gbtree", verbosity=0, use_label_encoder=False)

xgbc_params = {"model__learning_rate": [0.05, 0.1, 0.2, 0.3, 0.4, 0.5],
               "model__max_depth": np.arange(2,7),
               "model__max_delta_step": np.arange(1,11)
              }

xgbc_grid = GridSearchCV(imba_pipe_fit(xgbc1), xgbc_params, cv=kf, scoring="recall")
xgbc_grid.fit(X_train, y_train)

print("Best parameters for XGBC: ", xgbc_grid.best_params_)
print(f"\nXGBC Grid best train recall score: {xgbc_grid.best_score_ :.2f}")
print(f"XGBC Grid best test recall score: {recall_score(y_test, xgbc_grid.best_estimator_.predict(X_test)) :.2f}")

### 4. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

# Before GridSearchCV
logr = LogisticRegression(fit_intercept=True, solver="saga", penalty="l1", C=0.5)
logr_recall = imba_pipe(logr)
print("Array of LogReg recall score after KFold and resample: ", logr_recall)
print("Mean of array LogReg recall score: ", round(logr_recall.mean(), 2))

In [None]:
# Find the best hypertuning params -- using GridSearchCV
logr1 = LogisticRegression()

logr_params = {"model__solver": ["lbfgs", "sag", "saga"],
               "model__C": np.arange(0.1,2,0.1), 
               "model__class_weight": ["balanced", None]
              }

logr_grid = GridSearchCV(imba_pipe_fit(logr1), logr_params, cv=kf, scoring="recall")
logr_grid.fit(X_train_s_pca, y_train)

print("Best parameters for LogReg: ", logr_grid.best_params_)
print(f"\nLogreg Grid best train recall score: {logr_grid.best_score_ :.2f}")
print(f"Logreg Grid best test recall score: {recall_score(y_test, logr_grid.best_estimator_.predict(X_test_s_pca)) :.2f}")

### 5. Support Vector Machine (SVM)

In [None]:
from sklearn.svm import SVC

svc = SVC(kernel="rbf", C=0.5, gamma="auto")
svc_recall = imba_pipe(svc)
print("Array of SVC recall score after KFold and resample: ", svc_recall)
print("Mean of array SVC recall score: ", round(svc_recall.mean(), 2))

In [None]:
# Find the best hypertuning params -- using GridSearchCV
svc1 = SVC()

svc_params = {"model__C": [0.1, 0.2, 0.3, 0.4],
              "model__kernel": ["rbf", "sigmoid"],
              "model__gamma": [0.2, "scale", "auto"]
             }

svc_grid = GridSearchCV(imba_pipe_fit(svc1), svc_params, cv=kf, scoring="recall")
svc_grid.fit(X_train_s_pca, y_train) 

print("Best parameters for SVC: ", svc_grid.best_params_)
print(f"\nSVC Grid best train recall score: {svc_grid.best_score_ :.2f}")
print(f"SVC Grid best test recall score: {recall_score(y_test, svc_grid.best_estimator_.predict(X_test_s_pca)) :.2f}")

# Evaluation Metrics

### Evaluation for classification

So far, we have evaluated classifiers using recall accuracy, the proportion of actual positives was identified correctly.

Simple accuracy may not often be the right goal for your particular machine learning application. For example with credit card churned or credit card fraud, false positives and false negatives might have very different real world effects for users or for organization outcomes. So, it's important to select an evaluation metric that reflects those user application or business needs.

In [None]:
# import necessary packages for classification evaluation

from sklearn.metrics import confusion_matrix, classification_report

### 1. KNearest Neighbors

In [None]:
knn_pred = knn_grid.best_estimator_.predict(X_test_s_pca)
knn_confusion = confusion_matrix(y_test, knn_pred)
knn_class_report = classification_report(y_test, knn_pred, target_names=["not 1", "1"])

print("KNN confusion matrix\n", knn_confusion)
print("______________________________________\n")
print("KNN classification report\n\n", knn_class_report)

### 2. Random Forest Classifier

In [None]:
rfc_pred = rfc_grid.best_estimator_.predict(X_test)
rfc_confusion = confusion_matrix(y_test, rfc_pred)
rfc_class_report = classification_report(y_test, rfc_pred, target_names=["not 1", "1"])

print("RFC confusion matrix\n", rfc_confusion)
print("______________________________________\n")
print("RFC classification report\n\n", rfc_class_report)

### 3. XGBoost

In [None]:
xgbc_pred = xgbc_grid.best_estimator_.predict(X_test)
xgbc_confusion = confusion_matrix(y_test, xgbc_pred)
xgbc_class_report = classification_report(y_test, xgbc_pred, target_names=["not 1", "1"])

print("XGBC confusion matrix\n", xgbc_confusion)
print("______________________________________\n")
print("XGBC classification report\n\n", xgbc_class_report)

### 4. Logistic Regression

In [None]:
logr_pred = logr_grid.best_estimator_.predict(X_test_s_pca)
logr_confusion = confusion_matrix(y_test, logr_pred)
logr_class_report = classification_report(y_test, logr_pred, target_names=["not 1", "1"])

print("Logistic Regression confusion matrix\n", logr_confusion)
print("______________________________________\n")
print("Logistic Regression classification report\n\n", logr_class_report)

### 5. Support Vector Machine (SVM)

In [None]:
svc_pred = svc_grid.best_estimator_.predict(X_test_s_pca)
svc_confusion = confusion_matrix(y_test, svc_pred)
svc_class_report = classification_report(y_test, svc_pred, target_names=["not 1", "1"])

print("SVC confusion matrix\n", svc_confusion)
print("______________________________________\n")
print("SVC classification report\n\n", svc_class_report)

## Precision-Recall Curve

In [None]:
from sklearn.metrics import precision_recall_curve

y_proba_knn = knn_grid.best_estimator_.fit(X_train_s_pca, y_train).predict_proba(X_test_s_pca)
y_proba_rfc = rfc_grid.best_estimator_.fit(X_train, y_train).predict_proba(X_test)
y_proba_xgbc = xgbc_grid.best_estimator_.fit(X_train, y_train).predict_proba(X_test)
y_scores_logr = logr_grid.best_estimator_.fit(X_train_s_pca, y_train).decision_function(X_test_s_pca)
y_scores_svc = svc_grid.best_estimator_.fit(X_train_s_pca, y_train).decision_function(X_test_s_pca)


knn_precision, knn_recall, knn_thresholds = precision_recall_curve(y_test, y_proba_knn[:,1])
rfc_precision, rfc_recall, rfc_thresholds = precision_recall_curve(y_test, y_proba_rfc[:,1])
xgbc_precision, xgbc_recall, xgbc_thresholds = precision_recall_curve(y_test, y_proba_xgbc[:,1])
logr_precision, logr_recall, logr_thresholds = precision_recall_curve(y_test, y_scores_logr)
svc_precision, svc_recall, svc_thresholds = precision_recall_curve(y_test, y_scores_svc)


plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(knn_recall, knn_precision, lw=1, label='KNeighbors: Precision-Recall Curve', color='blue')
plt.plot(rfc_recall, rfc_precision, lw=1, label='RandForest: Precision-Recall Curve', color='orange')
plt.plot(xgbc_recall, xgbc_precision, lw=1, label='XGBoost: Precision-Recall Curve', color='yellow')
plt.plot(logr_recall , logr_precision, lw=1,  label='LogReg: Precision-Rec all Curve', color='red')
plt.plot(svc_recall, svc_precision, lw=1, label='SVC: Precision-Recall Curve', color='cyan')
plt.xlabel('Recall', fontsize=16)
plt.ylabel('Precision', fontsize=16)
plt.legend(loc='lower left', fontsize=10)
plt.axes().set_aspect('equal')
plt.show();

## Receiver Operating Characteristic curve (ROC)

ROC curves or Receiver Operating Characteristic curves illustrate the performance of a binary classifier. It is created by plotting the true positive rate (TPR) (or recall) against the false positive rate (FPR).

ROC curves on the X-axis show a classifier's False Positive Rate so that would go from 0 to 1.0, and on the Y-axis they show a classifier's True Positive Rate so that will also go from 0 to 1.0.

ROC curves are very help with understanding the balance between true-positive rate and false positive rate.

In [None]:
# import necessary packages
from sklearn.metrics import roc_curve, auc, roc_auc_score

fpr_knn, tpr_knn, _ = roc_curve(y_test, y_proba_knn[:,1])
fpr_rfc, tpr_rfc, _ = roc_curve(y_test, y_proba_rfc[:,1])
fpr_xgbc, tpr_xgbc, _ = roc_curve(y_test, y_proba_xgbc[:,1])
fpr_logr, tpr_logr, _ = roc_curve(y_test, y_scores_logr)
fpr_svc, tpr_svc, _ = roc_curve(y_test, y_scores_svc)


plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_knn, tpr_knn, lw=1, label='KNeighbors: ROC curve', color='blue')
plt.plot(fpr_rfc, tpr_rfc, lw=1, label='RandForest: ROC curve', color='orange')
plt.plot(fpr_xgbc, tpr_xgbc, lw=1, label='XGBoost: ROC curve', color='yellow')
plt.plot(fpr_logr, tpr_logr, lw=1, label='LogRegr: ROC curve', color='red')
plt.plot(fpr_svc, tpr_svc, lw=1, label='SVC: ROC curve', color='cyan')
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.axes().set_aspect('equal')
plt.show();

## AUC
Let's see the AUC score for each models

In [None]:
# AUC

print(f"AUC score for KNN {auc(fpr_knn, tpr_knn) :.2f}")
print(f"AUC score for Random Forest {auc(fpr_rfc, tpr_rfc) :.2f}")
print(f"AUC score for XGBoost {auc(fpr_xgbc, tpr_xgbc) :.2f}")
print(f"AUC score for LogReg {auc(fpr_logr, tpr_logr) :.2f}")
print(f"AUC score for SVC Non-linear {auc(fpr_svc, tpr_svc) :.2f}")

# Summary

XGBoost and Random Forest has the best model compared to the other with all metrics score (precision, recall, f1, accuracy) has shown at least 85% accuracy. Also, both XGBoost and Random Forest have AUC score of 99% that shows the model is learning the data well enough.

Even though most of the models are improving in their recall score, some of the models have significant reduction in their precision score; which are KNN, Logistic Regression, and Support Vector Machine. This is important as we must aware by getting a high recall score and a low precision score, it could be a sign that the model might simply predicting customer that want to churn their credit card (increasing in **False Positive**)