**Introduction**

Digital or online banking is advantageous since it enables more customer-centric strategies and gives users instant access to banking services. However, this digital revolution has also opened avenues to banking fraud. Lack of customers’ awareness of online security risks makes them vulnerable to divulging confidential data to criminal groups that can be used to authenticate fraudulent transactions. Though banks are investing significantly to provide digital services to save customers time and money, they are failing to allocate sufficient resources to keep their services secure. Among the many criminal activities occurring in the financial and medical insurance industry, credit card fraud is the most prevalent and taxing owing to the ease and sheer volume at which it occurs. Despite many organizations taking earnest measures to counter and prevent these frauds, they can never be eradicated because criminals ultimately find a loophole or caveat in these steps. The benefits of fraud detection technology still outweigh the risks of heavy investment and inconveniences caused. Thus, countering the fraud activities through continuous up-gradation of data mining and machine learning is one of the main approaches to prevent the losses caused by illegal acts.

**Problem statement**

The study aims to identify fraudulent credit card transactions on the given data set using data science solutions. We need to develop a model with knowledge of ones that turned out to be fraudulent. This model is then used to identify whether a new transaction is fraudulent or not while minimizing the incorrect fraud classifications.

The approach used for solving the problem are:

Kaggle has already provided whether data is PCA transformed.

**Step -1**: Load the data to understand it. Data understanding is critical since we will select the subset of features to carry out model training, and it includes types of data and its patterns.

**Step -2**: Exploratory Data Analysis Study histogram to compare each variable’s data distribution pattern and skewness of the data. Make necessary data adjustments to avoid any problems while we train the model.

**Step -3**: Class imbalance Study whether data is highly imbalanced between the fraudulent and non-fraudulent. We have learned four techniques to balance the data using various methods: Under-sampling, oversampling, 2 SMOTE (Synthetic Minority Over Sample Technique), and ADASYN (Adaptive Synthetic). We will use ADASYN techniques to lower the bias introduced by class imbalance and adaptively shift the classification decision boundary towards complex examples.

**Step -4**: Data modeling and model selection We will use two classification models (RadomForest and XGBoost) for the current study. For XGBoost, we will identify the number of trees based on the accuracy levels. We will not use KNN since the data set are more than 10K (computation time is increased if data sets are more than 10K). The decision tree gives an interpretation of the flow chart. Hence, it is widely used, but we do not know when to stop building trees and tend to overfit. We will use parameters such as confusion matrix, accuracy, precision, recall, and F-score, threshold dependent. Since data is imbalanced, we will also perform a deep dive into the ROC curve data to identify the threshold value (above the threshold value is fraudulent and below the threshold is not dishonest). We will calculate the F1 score, Precision, and Recall.

**Step -5**: Hyperparameter Tunings the model At this time, we will have the best understanding of the type of data we have and the kind of model we were going to build. After model building, our next step will be either hyperparameter tunings (CV or Cross-validation) or grid-search cross-validation or K-Fold or stratified K-Fold, or train validation and test split. This step is essential to improve accuracy, AUC, and lower misclassification error.

**Step -6**: Model evaluation Evaluate the model based on AUC -ROC score, through which we will define the threshold value. We will calculate the F1 Score, Precision, and Recall. This will be key for banks to represent the business strategy to bring down the Fraud.

In [None]:
# Load Required Libraries

import pandas as pd
import numpy as np

# EDA analysis

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pyplot import show
from matplotlib.colors import ListedColormap
from plotnine import *
import matplotlib.pyplot as Pyplot

from scipy.stats import kurtosis
from scipy.stats import skew
from scipy.stats import norm
from scipy.stats import shapiro
import sklearn.metrics as metrics
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score,accuracy_score,roc_auc_score, precision_score, recall_score, confusion_matrix, classification_report, roc_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split,validation_curve, StratifiedShuffleSplit,StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.datasets import make_classification


from sklearn.model_selection import train_test_split
from collections import Counter
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn import over_sampling

In [None]:
import warnings
warnings.warn('foo', DeprecationWarning)
import warnings, sklearn.utils
warnings.warn('bar', DeprecationWarning)

In [None]:
# Will create the seperator for better data visualization among each variables
def Line_Separator():
    print('*'*50, '\n')

def Line_Separator1():
    print('*'*100, '\n')

In [None]:
### Load the data file

creditcard=pd.read_csv('../input/credit-card-fraud-detecion/creditcard.csv', index_col=None)

creditcard.head(2)

In [None]:
# Evaluate number of columns and rows in given dataset

Number_of_row = creditcard.shape[0]
Number_of_column = creditcard.shape[1]

print('Number of rows in creditcard file     :', Number_of_row)
print('Number of columns in creditcard file  :', Number_of_column); Line_Separator()

In [None]:
# Review columns title in given dataset

print("Columns name in  creditcard file :\n",creditcard.columns.values);Line_Separator1()

In [None]:
# Evaluate number of categorical and numerical features

def data_features (data):
    categorical_features = creditcard.select_dtypes(exclude = [np.number]).columns
    numerical_features = creditcard.select_dtypes(include = [np.number]).columns
    print("Categorical features in  creditcard file :\n",categorical_features);Line_Separator1()
    print("Numerical features in  creditcard file   :\n",numerical_features);Line_Separator1()

print(data_features(creditcard))

In [None]:
# Review the datatypes
print("Review the Data Format in  creditcard file :");Line_Separator()
print(creditcard.dtypes);Line_Separator()

In [None]:
# Identify missing value if any

print("Check if there is any missing value in in  creditcard file :");Line_Separator()
print(round(100*(creditcard.isnull()).sum()/len(creditcard),2).sort_values(ascending=False));Line_Separator()

In [None]:
print("Time variable statistics");Line_Separator()
creditcard['Time'].describe()

In [None]:
# Data is 48hours, and the values seems to represent the second will convert it to hours - 1hour =3600seconds

creditcard['Time'] =creditcard['Time']/3600

In [None]:
# Assignment of classes one and zero for visualization

def replace_data_to_binary(x,y):
    creditcard.Class.replace(x,y, inplace=True)


replace_data_to_binary(0, 'Non Fraudulent')
replace_data_to_binary(1, 'Fraudulent')

In [None]:
# Readmitted distribution

fig = plt.figure(figsize=(9,5))
fig.set_facecolor("#F3F3F3")
total = float(len(creditcard))
ax = sns.countplot(x="Class",  data=creditcard)
plt.xlabel('Class', fontsize=14)
plt.ylabel('Number of observations', fontsize=14)
plt.title('Distrinution of Class', fontsize =14)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2., height + 3, '{:1.2f}'.format(100*height/total),
            ha="center",va='bottom')
ax.grid(False)
plt.tick_params(labelsize=12)
plt.show();Line_Separator1()

In [None]:
# Evaluate whether data is balanced or not

total_count_combined_calss = creditcard['Class'].value_counts()
imbalance= (total_count_combined_calss['Fraudulent']/creditcard['Class'].count()*100)/(total_count_combined_calss['Non Fraudulent']/creditcard['Class'].count()*100)*100
print('Imbalance Percentage : ' + str(imbalance));Line_Separator()

In [None]:
print("Amount details of  transaction :");Line_Separator()
print(creditcard[creditcard["Class"] == "Fraudulent"].Amount.describe());Line_Separator()

print('\n')

print("Amount details of non-fraudulent transaction :");Line_Separator()
print(creditcard[creditcard["Class"] == "Non Fraudulent"].Amount.describe());Line_Separator()

This statistical analysis, it clearly highlights that the average money transaction for the fraudulent ones is observed more compared to non-fraudulent ones.

In [None]:
# Evaluate the time vs. amount transaction between fraudulent and non-fraudulent

print("Evaluate the time vs. amount transaction between fraudulent and non-fraudulent");Line_Separator1()
ax=sns.relplot(x="Time", y="Amount",
                 col="Class", hue="Class",
                 kind="scatter", data=creditcard)
plt.tick_params(labelsize=12)
plt.show(); Line_Separator1()

Few insights on the visualization above reveal the following:

1. The plot indicates that the fraud amounts were less than approx 2.2k.
2. Fraud pattern indicates that the number of data points is observed between 14 to 20 hours on both days.
3. We can see a two-picks pattern in time due tonight.


In [None]:
# Assign back to class one and zero for analysis
def replace_data_to_binary(x,y):
    creditcard.Class.replace(x,y, inplace=True)


replace_data_to_binary('Non Fraudulent', 0)
replace_data_to_binary ('Fraudulent', 1)

In [None]:
print("After reassigning class to zero and one, we will evaluate the data type ");Line_Separator1()
creditcard.dtypes

In [None]:
# Evaluate the correlation between different parameters in the dataset if any

creditcard_corr=creditcard.corr()
fig = plt.figure(figsize=(20,10))
fig.set_facecolor("#F3F3F3")
ax=sns.heatmap(creditcard_corr)
plt.tick_params(labelsize=12)
plt.show();Line_Separator1()

Few insights on the visualization above reveal the following:
1. V7 and V20 are positively correlated with the amount.
2. V2 and V5 are negatively correlated with the amount.

In [None]:
# Evaluate the data distributions of each variable

print(); Line_Separator1()
print("Plotting the Shape of a Distribution of each Variable in 'Non-Fraudulent' & 'Fraudulent': Variables V1 to V12"); Line_Separator1()
fig = plt.figure(figsize=(20,15))
fig.set_facecolor("#F3F3F3")

plt.subplot(431)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V1'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V1'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V1", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(432)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V2'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V2'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V2", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(433)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V3'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V3'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V3", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(434)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V4'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V4'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V4", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(435)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V5'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V5'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V5", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(436)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V6'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V6'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V6", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(437)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V7'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V7'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V7", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(438)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V8'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V8'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V8", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(439)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V9'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V9'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V9", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(4,3,10)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V10'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V10'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V10", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(4,3,11)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V11'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V11'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V11", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(4,3,12)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V12'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V12'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V12", fontsize=12)
plt.tick_params(labelsize=12)

plt.show(); Line_Separator1()

In [None]:
# Evaluate the data distributions of each variable

print(); Line_Separator1()
print("Plotting the Shape of a Distribution of each Variable in 'Non-Fraudulent' & 'Fraudulent': Variables V13 to V24"); Line_Separator1()

fig = plt.figure(figsize=(20,15))
fig.set_facecolor("#F3F3F3")

plt.subplot(431)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V13'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V13'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V13", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(432)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V14'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V14'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V14", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(433)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V15'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V15'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V15", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(434)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V16'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V16'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V16", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(435)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V17'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V17'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V17", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(436)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V18'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V18'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V18", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(437)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V19'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V19'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V19", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(438)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V20'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V20'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V20", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(439)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V21'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V21'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V21", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(4,3,10)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V22'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V22'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V22", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(4,3,11)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V23'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V23'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V23", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(4,3,12)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V24'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V24'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V24", fontsize=12)
plt.tick_params(labelsize=12)

plt.show(); Line_Separator1()

In [None]:
# Evaluate the data distributions of each variable

print(); Line_Separator1()
print("Plotting the Shape of a Distribution of each Variable in 'Non-Fraudulent' & 'Fraudulent': Variables V25 to V28, Time & Amout"); Line_Separator1()
fig = plt.figure(figsize=(20,15))
fig.set_facecolor("#F3F3F3")

plt.subplot(431)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V25'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V25'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V25", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(432)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V26'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V26'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V26", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(433)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V27'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V27'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V27", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(434)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'V28'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'V28'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("V28", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(435)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'Time'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'Time'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("Time", fontsize=12)
plt.tick_params(labelsize=12)

plt.subplot(436)
g=sns.kdeplot(creditcard.loc[(creditcard["Class"] == 0),'Amount'] , color='b',shade=True,label='Non Fraudulent')
g=sns.kdeplot(creditcard.loc[(creditcard['Class'] == 1),'Amount'] , color='orange',shade=True, label='Fraudulent')
g.grid(False)
plt.ylabel("Frequency Density", fontsize=12)
plt.xlabel("Amount", fontsize=12)
plt.tick_params(labelsize=12)

plt.show(); Line_Separator1()

In [None]:
# Separate data of Non-Fraudulent and Fraudulent to check the skewness and kurtosis

Non_Fraudulent= creditcard[creditcard["Class"] == 0]
print ("Non_Fraudulent:", Non_Fraudulent.shape); Line_Separator()
Fraudulent= creditcard[creditcard["Class"] == 1]
print ("Fraudulent:", Fraudulent.shape); Line_Separator()

In [None]:
# Non_Fraudulent = to check the skewness and kurtosis.

print ("Non_Fraudulent = To evaluate Mean, Variance, skewness, and kurtosis:")
print("")
a = Non_Fraudulent.mean(axis = 0, skipna = True)
b = Non_Fraudulent.var(axis = 0, skipna = True)
c = Non_Fraudulent.skew(axis = 0, skipna = True)
d = Non_Fraudulent.kurtosis(axis = 0, skipna = True)

a.index = b.index
a.index = c.index
a.index = d.index

data_Non_Fraudulent = pd.concat([a, b, c, d] ,axis = 1)
data_Non_Fraudulent.columns = ["Mean", "Var", "Skewness", "kurtosis"]
data_Non_Fraudulent=data_Non_Fraudulent.reset_index().rename(index=str, columns={"index": "Variables"})
print(data_Non_Fraudulent); Line_Separator1()

# -------------------------------------------------------------------------------------------------------------

# Fraudulent = to check the skewness and kurtosis

print ("Fraudulent = To evaluate Mean, Variance, skewness, and kurtosis:")
print("")
e = Fraudulent.mean(axis = 0, skipna = True)
f = Fraudulent.var(axis = 0, skipna = True)
g = Fraudulent.skew(axis = 0, skipna = True)
h = Fraudulent.kurtosis(axis = 0, skipna = True)

e.index = f.index
e.index = g.index
e.index = h.index

data_Fraudulent = pd.concat([a, b, c, d] ,axis = 1)
data_Fraudulent.columns = ["Mean", "Var", "Skewness", "kurtosis"]
data_Fraudulent=data_Fraudulent.reset_index().rename(index=str, columns={"index": "Variables"})
print(data_Fraudulent); Line_Separator1()

1. Skewness = 0 : normally distributed. ; a zero value means that the tails on both sides of the mean balance out overall,
2. Skewness > 0: more weight in the left tail of the distribution.
3. Skewness < 0: more weight in the right tail of the distribution.

For example, a zero value means that the tails on both sides of the mean balance out overall; this is the case for asymmetric distribution, but it can also be true for an asymmetric distribution where one tail is long and thin, and the other is short but fat.

**Note:** we will look into this if we require any power transformation to the data. However, they are PCA transformed (not an original one)

In [None]:
# Non-Fraudulent: Evaluate the number of positive skewness variables

print('Non Fraudulent - Positive skewness:')
left_skewness_Non_Fraudulent= data_Non_Fraudulent[data_Non_Fraudulent.Skewness >0]
print(left_skewness_Non_Fraudulent['Variables'].unique());Line_Separator()

# ---------------------------------------------------------------------------------------

# Fraudulent : Evaluate the number of positive skewness variables

print('Fraudulent - Positive skewness:')
left_skewness_Fraudulent = data_Fraudulent[data_Fraudulent.Skewness >0]
print(left_skewness_Fraudulent['Variables'].unique());Line_Separator()

**Positive Skew:** Mean > median

In [None]:
# Non-Fraudulent: Evaluate the number of negative skewness variables
print('Non Fraudulent - Negative skewness:')
right_skewness_Non_Fraudulent= data_Non_Fraudulent[data_Non_Fraudulent.Skewness <0]
print(right_skewness_Non_Fraudulent['Variables'].unique());Line_Separator()

# ---------------------------------------------------------------------------------------

# Fraudulent: Evaluate the number of negative skewness variables
print('Fraudulent - Negative skewness:')
right_skewness_Fraudulent = data_Fraudulent[data_Fraudulent.Skewness <0]
print(right_skewness_Fraudulent['Variables'].unique());Line_Separator()

1. **Negative Skew:** Median > mean
2. **Zero Skew:** Mean = median (normal distribution)

In [None]:
# We will drop the time column

creditcard = creditcard.drop(['Time'],axis=1)
creditcard.head()

In [None]:
from sklearn.preprocessing import StandardScaler

creditcard['Amount'] = StandardScaler().fit_transform(creditcard['Amount'].values.reshape(-1,1))
#creditcard['Time'] = StandardScaler().fit_transform(creditcard['Time'].values.reshape(-1,1))

In [None]:
# Create X and y

X = creditcard.drop('Class',axis=1)
y = creditcard['Class']

In [None]:
# Evaluate the X dataset

X.head(2)

In [None]:
# print the data shape

print(X.shape)
print(y.shape)

In [None]:
# Split the data into train and test with 70-30 division

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.3, train_size=0.7,  random_state=0)
X_train.shape,X_test.shape

In [None]:
# Skewness observed in the distribution hence will use power transformation by 'yeo-johnson' method

creditcard_pt= preprocessing.PowerTransformer(method='yeo-johnson', copy=True)
creditcard_pt.fit(X_train)

X_train_pt = creditcard_pt.transform(X_train)
X_test_pt = creditcard_pt.transform(X_test)

y_train_pt = y_train

In [None]:
# Rename the x_train and x_test

X_train = X_train_pt
X_test = X_test_pt

In [None]:
# Balance data Method used ADASYN oversampling on the minority class

X_ada, y_ada = over_sampling.ADASYN(sampling_strategy='minority', random_state=42).fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_ada))

In [None]:
# Shape of the dataset after balancing it

print(X_ada.shape)
print(y_ada.shape)

In [None]:
# Evalute unique data in y_ada

y_ada.unique()

In [None]:
# Evaluate whether the dataset is balanced at what percentage level

total_count_combined_calss = y_ada.value_counts()
imbalance= (total_count_combined_calss[1]/y_ada.count()*100)/(total_count_combined_calss[0]/y_ada.count()*100)*100
print('Balance Percentage after ADASYN : ' + str(imbalance));Line_Separator1()

In [None]:
# Evaluate the sum of y_ada, y_train and y_test

print(np.sum(y_ada))
print(np.sum(y_train))
print(np.sum(y_test))

In [None]:
# Rename X_ada and y_ada

X_train = X_ada
y_train = y_ada

In [None]:
# Evaluate the shape of the training and test dataset after balancing it

print(X_train.shape)
print(y_train.shape)

**Base Model**

In [None]:
# Base Model for evaluation

b_m=[]

for i in range (y_test.shape[0]):
    b_m.append(y_test.mode()[0])

len(b_m)

In [None]:
y_pred=pd.Series(b_m)

In [None]:
print('Accuracy (Base Model):', accuracy_score(y_test, y_pred)*100);Line_Separator()

**KNN Model**

1. The KNN model is used when the data points presented are less as the computation time is very high, and it was advised not to use KNN in this capstone.
2. However, we have tried it, and it takes more than 8hrs, but there is no certain output; hence we will not be using this model.

# To find the K neighbour value

    neighbors = np.arange(1,11)
    train_accuracy = np.empty(len(neighbors))
    test_accuracy = np.empty(len(neighbors))

    for i, k in enumerate(neighbors):
      knn = KNeighborsClassifier(n_neighbors=k)
      knn.fit(X_train, y_train)
      train_accuracy[i] = knn.score(X_train, y_train)
      test_accuracy[i] = knn.score(X_test, y_test)

    Generate plot
    fig = plt.figure(figsize=(8,4))
    fig.set_facecolor("#F3F3F3")
    plt.title('k-NN: Varying Number of Neighbors', fontsize =12)
    plt.plot(neighbors, test_accuracy, label = 'Accuracy on training set')
    plt.plot(neighbors, train_accuracy, label = 'Accuracy on testing set')
    plt.legend(loc = "best", prop = {"size" : 14})
    plt.xlabel('Number of Neighbors', fontsize=12)
    plt.ylabel('Accuracy', fontsize=12)
    plt.show()

    There is another approach to identify the k number is

    mean error - (mean error zero or lower will be the k)
    error = []

    Calculating error for K values between 1 and 40 for i in range(1, 40):

    knn = KNeighborsClassifier(n_neighbors=i) knn.fit(X_train, y_train) pred_i = knn.predict(X_test)
    error.append(np.mean(pred_i != y_test))
    fig = plt.figure(figsize=(8,4))
    fig.set_facecolor("#F3F3F3")
    plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10)
    plt.title('Error Rate K Value', fontsize=12) plt.xlabel('K Value', fontsize=12) plt.ylabel('Mean Error', fontsize=12)
    plt.legend(loc = "best", prop = {"size" : 12})
    plt.show()

1. We will try to obtain results of all five models under consideration without any optimization and checkout it's performance
2. We will first try the Logistic regression model and Xgboost with different values and no optimization to check its performance
3. We will use logistic regression with stratified K-fold cross-validation
4. Function to plot the validation curve to obtain the right C value for the logistic regression model

In [None]:
def plot_val_curve(train_scores, val_scores, param_range, plt_title):
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    val_scores_mean = np.mean(val_scores, axis=1)
    val_scores_std = np.std(val_scores, axis=1)

    plt.figure(figsize=(14,6))

    plt.title(plt_title)
    plt.xlabel("$C-Regularization parameter$")
    plt.ylabel("Accuracy")

    lw = 2
    plt.semilogx(param_range, train_scores_mean, label="Training score",
                 color="darkorange", lw=lw)
    plt.fill_between(param_range, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.2,
                     color="darkorange", lw=lw)

    plt.semilogx(param_range, val_scores_mean, label="Cross-validation score",
                 color="navy", lw=lw)

    plt.fill_between(param_range, val_scores_mean - val_scores_std,
                     val_scores_mean + val_scores_std, alpha=0.2,
                     color="navy", lw=lw)
    plt.legend(loc="best")

# Number of folds selected for cross-validation is 10

K = 10
stratified_cv = StratifiedShuffleSplit(n_splits = K, random_state = 0)

lgreg = LogisticRegression(penalty='l1', solver='liblinear',random_state=0, n_jobs=-1, max_iter=1000).fit(X_train, y_train)
param_C_range = [1e-9, 1e-5, 1e-3, 1e-2, 1e-1, 1.0,10,100]
train_scores, val_scores = validation_curve(estimator=lgreg,
                                            X=X_train,
                                            y=y_train,
                                            param_name='C',
                                            param_range=param_C_range,
                                            cv=stratified_cv,
                                            n_jobs=-1,
                                            scoring='f1')
plot_val_curve(train_scores, val_scores, param_C_range, plt_title='L1 Regularization Parameter Sweep')
plt.xscale('log')
plt.show()

In [None]:
# C value which can be used shall be = 0.01 selected from the graph above
# Predict class label and probability for the validation data set

lgreg = LogisticRegression(penalty='l1', C=0.01,solver='liblinear',random_state=0, n_jobs=-1, max_iter=1000).fit(X_train, y_train)
y_pred = lgreg.predict(X_test)
y_prob = lgreg.predict_proba(X_test)


# Compute classification metrics

accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)
logloss = metrics.log_loss(y_test, y_pred)

# Compute precision, recall, and AUPRC for different levels of thresholds

precisions, recalls, thresholds = metrics.precision_recall_curve(y_test.ravel(), y_prob[:, 1].ravel(), pos_label=1)
prc_auc = metrics.average_precision_score(y_test, y_prob[:,1], average='weighted')

print('Test accuracy: %.3f' % (accuracy))
print('\nTest Precision: %.3f' % (precision))
print('\nTest Recall: %.3f' % (recall))
print('\nTest LogLoss: %.3f' % (logloss))
print('\nTest AUPRC: %.3f' % (prc_auc))


fpr,tpr,thresholds = roc_curve(y_test,y_pred)
auc_score = roc_auc_score(y_test,y_pred)
plt.plot(fpr, tpr, linestyle = "dotted", color = "royalblue",linewidth = 2, label='ROC curve (area = %0.5f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('LR_1 Model: ROC Curve', fontsize =12)
plt.legend(loc="lower right")
plt.tick_params(labelsize=12)
plt.plot([0,1],[0,1],linestyle = "dashed",
             color = "orangered",linewidth = 1.5)
plt.fill_between(fpr,tpr,alpha = .4)
plt.fill_between([0,1],[0,1],color = "y")

The test accuracy seems to be 90.6%, which is not so good, and the area under the curve value seems to be good.

In [None]:
# Plotting the Confusion Matrix

classes = ['Not Fraudulent', 'Fraudulent']
LR_conf_matrix_1 = confusion_matrix(y_test,y_pred)
fig = plt.figure(figsize=(9,5))
fig.set_facecolor("#F3F3F3")
ax= sns.heatmap(LR_conf_matrix_1,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=classes,
                yticklabels=classes,
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("LR_conf_matrix_1", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')

1. Above confusion matrix represents Logistic regression under L1 regularization and C=0.01
2.  We will use stratified k fold validation on the XGBoost algorithm next
3.  Perform cross-validation using stratified k fold method on the X_train & y_train

In [None]:
from sklearn.model_selection import StratifiedKFold

X_train_p = X_train_pt
y_train_p = y_train_pt
skf = StratifiedKFold(n_splits=3, random_state=None, shuffle=False)

print("XGBOOST Classifier: --------------------------")
cv_score_mean=0
for n_estimators in [100,50]:
    for learning_rate in [0.2,0.6]:
        for subsample in [0.3, 0.6, 0.9]:
            print("n_estimators=",n_estimators,"learning_rate=",learning_rate, "subsample=",subsample)
            for train_index, test_index in skf.split(X_train_p, y_train_p):
                print("Train:", train_index, "Test:", test_index)
                X_train_cv, X_test_cv = X_train_p[train_index], X_train_p[test_index]
                y_train_cv, y_test_cv = y_train_p.iloc[train_index], y_train_p.iloc[test_index]

                ros = over_sampling.ADASYN(sampling_strategy='minority', random_state=42)
                X_ros_cv,y_ros_cv = ros.fit_resample(X_train_cv,y_train_cv)

                xgboost_classifier= XGBClassifier(n_estimators=n_estimators,
                                                learning_rate=learning_rate,
                                                subsample=subsample, n_jobs=-1,
                                                eval_metric='logloss',
                                                use_label_encoder=False)
                xgboost_classifier.fit(X_ros_cv, y_ros_cv)

                y_test_pred= xgboost_classifier.predict_proba(X_test_cv)
                cv_score= metrics.roc_auc_score(y_true=y_test_cv,y_score=y_test_pred[:,1])
                cv_score_mean=cv_score_mean+cv_score
            print("Cross Val ROC-AUC Score=", cv_score_mean/3)

1. The results shown above suggest we use n_estimators= 100, learning_rate= 0.2 and subsample= 0.3 to obtain highest area under the curve

2. Similarly optimization was performed on other three models as well to check its performance, and parameters were selected based on it

In [None]:
# Finding the best tree in XGB

from xgboost import XGBClassifier

tree_range = range(2, 100, 5)
score1=[]
score2=[]
for tree in tree_range:
    xgb=XGBClassifier(n_estimators=tree,
                      eval_metric='mlogloss',
                      use_label_encoder=False)
    xgb.fit(X_train,y_train)
    score1.append(xgb.score(X_train,y_train))
    score2.append(xgb.score(X_test,y_test))

In [None]:
# Generate plot for - XGB to obtain the optimum value of trees to be used

%matplotlib inline
fig = plt.figure(figsize=(8,4))
fig.set_facecolor("#F3F3F3")
plt.plot(tree_range,score1,label= 'Accuracy on training set')
plt.plot(tree_range,score2,label= 'Accuracy on testing set')
plt.title('XGboost: Number of trees Vs Accuracy', fontsize =12)
plt.xlabel('Value of number of trees in XGboost', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.grid(b=None)
plt.legend(loc = "best",
               prop = {"size" : 12})
plt.show()

In [None]:
# Using the optimized parameters for

xgb=XGBClassifier(n_estimators=100, use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train,y_train)
print('Accuracy of XGB n=100 on the testing dataset is :{:.4f}'.format(xgb.score(X_test,y_test)))

In [None]:
# testing various models

model_summary = []
ML_models = {}

In [None]:
%%time

# ML_models = {}

model_index = ['LR','RF','NN', 'GRB', 'DT','KN', 'XGB']
model_sklearn = [LogisticRegression(penalty = 'l2', C = 1, random_state=0),
                 RandomForestClassifier(n_jobs=3, min_samples_split = 20, min_samples_leaf = 5, random_state=0),
                 MLPClassifier([100]*5,early_stopping=True,learning_rate='adaptive',random_state=0),
                 GradientBoostingClassifier(n_estimators = 50, max_depth = 2, random_state = 0),
                 DecisionTreeClassifier(random_state=42, max_depth=6),
                 KNeighborsClassifier(n_neighbors=7, n_jobs=-1),
                 XGBClassifier(n_estimators=100,learning_rate=0.2,subsample=0.3, n_jobs=-1,
                              eval_metric='mlogloss',use_label_encoder=False)]

# model_summary = []

for name,model in zip(model_index,model_sklearn):
    ML_models[name] = model.fit(X_train,y_train)
    preds = model.predict(X_test)
    model_summary.append([name,f1_score(y_test,preds,average='weighted'),accuracy_score(y_test,preds),
                          roc_auc_score(y_test,model.predict_proba(X_test)[:,1]), precision_score(y_test,preds, average='weighted'),
                         recall_score(y_test,preds, average='weighted')])


print(ML_models)

In [None]:
# Let's look at the model evaluation parameters

model_summary = pd.DataFrame(model_summary,columns=['Name','F1_score','Accuracy', 'AUC_ROC', 'Precision', 'Recall'])
model_summary = model_summary.reset_index()
display(model_summary)

In [None]:
# Plotting all the evaluation parameters of all the models and comparing them to each other

plt.rcParams['figure.figsize']=(14,10)
plt.subplot(231)
g=sns.barplot(data=model_summary, x="Name", y="F1_score")
plt.title('ML Model Vs Respective model- F1 Score')

plt.subplot(232)
g=sns.barplot(data=model_summary, x="Name", y="Accuracy")
plt.title('ML Model Vs Respective model- Accuracy')

plt.subplot(233)
g=sns.barplot(data=model_summary, x="Name", y="Precision")
plt.title('ML Model Vs Respective model- Precision')

plt.subplot(234)
g=sns.barplot(data=model_summary, x="Name", y="Recall")
plt.title('ML Model Vs Respective model- Recall')

plt.subplot(235)
g=sns.barplot(data=model_summary, x="Name", y="AUC_ROC")
plt.title('ML Model Vs Respective model- AUC_ROC')

plt.show()

In [None]:
# Verifying the validity of our different models

# LR Model

ML_test = ML_models['LR'].predict(X_test)
ML_train = ML_models['LR'].predict(X_train)
LR_Model_Residuals = y_train == ML_train

print(); Line_Separator()
print ("LR model :- Number of values correctly predicted:"); Line_Separator()
print(pd.Series(LR_Model_Residuals).value_counts())

# RF Model
ML_test = ML_models['RF'].predict(X_test)
ML_train = ML_models['RF'].predict(X_train)
RF_Model_Residuals = y_train == ML_train

print(); Line_Separator()
print ("RF model :- Number of values correctly predicted:"); Line_Separator()
print(pd.Series(RF_Model_Residuals).value_counts()); Line_Separator()


# NN Model
ML_test = ML_models['NN'].predict(X_test)
ML_train = ML_models['NN'].predict(X_train)
NN_Model_Residuals = y_train == ML_train

print ("NN model :- Number of values correctly predicted:"); Line_Separator()
print(pd.Series(NN_Model_Residuals).value_counts()); Line_Separator()


# GRB Model
ML_test = ML_models['GRB'].predict(X_test)
ML_train = ML_models['GRB'].predict(X_train)
NN_Model_Residuals = y_train == ML_train

print(); Line_Separator()
print ("GRB model :- Number of values correctly predicted:"); Line_Separator()
print(pd.Series(NN_Model_Residuals).value_counts()); Line_Separator()


# DT Model
ML_test = ML_models['DT'].predict(X_test)
ML_train = ML_models['DT'].predict(X_train)
DT_Model_Residuals = y_train == ML_train

print(); Line_Separator()
print ("DT model :- Number of values correctly predicted:"); Line_Separator()
print(pd.Series(DT_Model_Residuals).value_counts()); Line_Separator()


# KN Model
ML_test = ML_models['KN'].predict(X_test)
ML_train = ML_models['KN'].predict(X_train)
KN_Model_Residuals = y_train == ML_train

print(); Line_Separator()
print ("KN model :- Number of values correctly predicted:"); Line_Separator()
print(pd.Series(KN_Model_Residuals).value_counts()); Line_Separator()

# XGB Model
ML_test = ML_models['XGB'].predict(X_test)
ML_train = ML_models['XGB'].predict(X_train)
XGB_Model_Residuals = y_train == ML_train

print(); Line_Separator()
print ("XGB model :- Number of values correctly predicted:"); Line_Separator()
print(pd.Series(XGB_Model_Residuals).value_counts()); Line_Separator()

In [None]:
# Confusion matrix for LR, RF, NN, GRB, DT, KN, XGB models

# LR
LR_conf_matrix = confusion_matrix(y_test,ML_models['LR'].predict(X_test))

# RF
RF_conf_matrix = confusion_matrix(y_test,ML_models['RF'].predict(X_test))

# NN
NN_conf_matrix = confusion_matrix(y_test,ML_models['NN'].predict(X_test))

# GRB
GRB_conf_matrix = confusion_matrix(y_test,ML_models['GRB'].predict(X_test))

# DT
DT_conf_matrix = confusion_matrix(y_test,ML_models['DT'].predict(X_test))

# KN
KN_conf_matrix = confusion_matrix(y_test,ML_models['KN'].predict(X_test))

# XGB
XGB_conf_matrix = confusion_matrix(y_test,ML_models['XGB'].predict(X_test))

fig = plt.figure(figsize=(15,18))
fig.set_facecolor("#F3F3F3")
plt.subplot(431)
ax= sns.heatmap(LR_conf_matrix,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("LR_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)


plt.subplot(432)
ax= sns.heatmap(RF_conf_matrix,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("RF_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)


plt.subplot(433)
ax= sns.heatmap(NN_conf_matrix,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("NN_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)

plt.subplot(434)
ax= sns.heatmap(GRB_conf_matrix ,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("GRB_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)

plt.subplot(435)
ax= sns.heatmap(DT_conf_matrix ,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("DT_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)

plt.subplot(436)
ax= sns.heatmap(KN_conf_matrix ,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("KN_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)


plt.subplot(437)
ax= sns.heatmap(XGB_conf_matrix ,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("XGB_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)

plt.show()

In [None]:
# Classification report

target_names = ["Not Fraudulent","Fraudulent"]

# LR Model
print(); Line_Separator()
print ("                LR Model - Classification Report"); Line_Separator()
print(classification_report(y_test, ML_models["LR"].predict(X_test))); Line_Separator()

#--------------------------------------------------------------------------------------------
# RF Model
print ("                RF Model - Classification Report"); Line_Separator()
print(classification_report(y_test, ML_models["RF"].predict(X_test))); Line_Separator()

#--------------------------------------------------------------------------------------------
# NN Model
print ("                NN Model - Classification Report"); Line_Separator()
print(classification_report(y_test, ML_models["NN"].predict(X_test))); Line_Separator()

#--------------------------------------------------------------------------------------------
# GRB Model
print ("                GRB Model - Classification Report"); Line_Separator()
print(classification_report(y_test, ML_models["GRB"].predict(X_test))); Line_Separator()


#--------------------------------------------------------------------------------------------
# GRB Model
print ("                DT Model - Classification Report"); Line_Separator()
print(classification_report(y_test, ML_models["DT"].predict(X_test))); Line_Separator()

#--------------------------------------------------------------------------------------------
# KN Model
print ("                KN Model - Classification Report"); Line_Separator()
print(classification_report(y_test, ML_models["KN"].predict(X_test))); Line_Separator()


#--------------------------------------------------------------------------------------------
# XGB Model
print ("                XGB Model - Classification Report"); Line_Separator()
print(classification_report(y_test, ML_models["XGB"].predict(X_test))); Line_Separator()

In [None]:
# create speficity, sensitivity, FPR, positive predictive value and Negative predictive value

# LR model
tn, fp, fn, tp  = confusion_matrix(y_test,ML_models['LR'].predict(X_test)).ravel()

print("            LR Model:")

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', tp / (tp+fn))

# Let us calculate specificity
print('Specificity               : ',tn /(tn+fp))

# Calculate false postive rate - predicting churn when customer does not have churned
print('false postive rate        : ',fp/(tn+fp))

# positive predictive value
print('positive predictive value : ', tp / (tp+fp))

# Negative predictive value
print('Negative predictive value : ',tn / (tn+ fn)); Line_Separator()

# ------------------------------------------------------------------------------------------------
# RF model
tn, fp, fn, tp  = confusion_matrix(y_test,ML_models['RF'].predict(X_test)).ravel()

print("            RF Model:")

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', tp / (tp+fn))

# Let us calculate specificity
print('Specificity               : ',tn /(tn+fp))

# Calculate false postive rate - predicting churn when customer does not have churned
print('False postive rate        : ',fp/(tn+fp))

# positive predictive value
print('Positive predictive value : ', tp / (tp+fp))

# Negative predictive value
print('Negative predictive value : ',tn / (tn+ fn)); Line_Separator()

#-----------------------------------------------------------------------------------------------------
# NN model
tn, fp, fn, tp  = confusion_matrix(y_test,ML_models['NN'].predict(X_test)).ravel()


print("            NN Model:")

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', tp / (tp+fn))

# Let us calculate specificity
print('Specificity               : ',tn /(tn+fp))

# Calculate false postive rate - predicting churn when customer does not have churned
print('False postive rate        : ',fp/(tn+fp))

# positive predictive value
print('Positive predictive value : ', tp / (tp+fp))

# Negative predictive value
print('Negative predictive value : ',tn / (tn+ fn)); Line_Separator()

#-----------------------------------------------------------------------------------------------------

# GRB model
tn, fp, fn, tp  = confusion_matrix(y_test,ML_models['GRB'].predict(X_test)).ravel()


print("            GRB Model:")

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', tp / (tp+fn))

# Let us calculate specificity
print('Specificity               : ',tn /(tn+fp))

# Calculate false postive rate - predicting churn when customer does not have churned
print('False postive rate        : ',fp/(tn+fp))

# positive predictive value
print('Positive predictive value : ', tp / (tp+fp))

# Negative predictive value
print('Negative predictive value : ',tn / (tn+ fn)); Line_Separator()


#-----------------------------------------------------------------------------------------------------

# DT model
tn, fp, fn, tp  = confusion_matrix(y_test,ML_models['DT'].predict(X_test)).ravel()


print("            DT Model:")

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', tp / (tp+fn))

# Let us calculate specificity
print('Specificity               : ',tn /(tn+fp))

# Calculate false postive rate - predicting churn when customer does not have churned
print('False postive rate        : ',fp/(tn+fp))

# positive predictive value
print('Positive predictive value : ', tp / (tp+fp))

# Negative predictive value
print('Negative predictive value : ',tn / (tn+ fn)); Line_Separator()


#-----------------------------------------------------------------------------------------------------

# KN model
tn, fp, fn, tp  = confusion_matrix(y_test,ML_models['KN'].predict(X_test)).ravel()


print("            KN Model:")

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', tp / (tp+fn))

# Let us calculate specificity
print('Specificity               : ',tn /(tn+fp))

# Calculate false postive rate - predicting churn when customer does not have churned
print('False postive rate        : ',fp/(tn+fp))

# positive predictive value
print('Positive predictive value : ', tp / (tp+fp))

# Negative predictive value
print('Negative predictive value : ',tn / (tn+ fn)); Line_Separator()

#-----------------------------------------------------------------------------------------------------

# DT model
tn, fp, fn, tp  = confusion_matrix(y_test,ML_models['XGB'].predict(X_test)).ravel()


print("            XGB Model:")

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', tp / (tp+fn))

# Let us calculate specificity
print('Specificity               : ',tn /(tn+fp))

# Calculate false postive rate - predicting churn when customer does not have churned
print('False postive rate        : ',fp/(tn+fp))

# positive predictive value
print('Positive predictive value : ', tp / (tp+fp))

# Negative predictive value
print('Negative predictive value : ',tn / (tn+ fn)); Line_Separator()

In [None]:
# ROC - Curves for all the models
fig = plt.figure(figsize=(15,22))
fig.set_facecolor("#F3F3F3")


# LR Model
plt.subplot(431)
fpr,tpr,thresholds = roc_curve(y_test,ML_models['LR'].predict_proba(X_test)[:,1])
auc_score = roc_auc_score(y_test,ML_models['LR'].predict_proba(X_test)[:,1])
plt.plot(fpr, tpr, linestyle = "dotted", color = "royalblue",linewidth = 2, label='ROC curve (area = %0.5f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('LR Model: ROC Curve', fontsize =12)
plt.legend(loc="lower right")
plt.tick_params(labelsize=12)
plt.plot([0,1],[0,1],linestyle = "dashed",
             color = "orangered",linewidth = 1.5)
plt.fill_between(fpr,tpr,alpha = .4)
plt.fill_between([0,1],[0,1],color = "y")

# -----------------------------------------------------------------------------------------------------------------------
# RF Model
plt.subplot(432)
fpr,tpr,thresholds = roc_curve(y_test,ML_models['RF'].predict_proba(X_test)[:,1])
auc_score = roc_auc_score(y_test,ML_models['RF'].predict_proba(X_test)[:,1])
plt.plot(fpr, tpr, linestyle = "dotted", color = "royalblue",linewidth = 2, label='ROC curve (area = %0.5f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('RF Model: ROC Curve', fontsize =12)
plt.legend(loc="lower right")
plt.tick_params(labelsize=12)
plt.plot([0,1],[0,1],linestyle = "dashed",
             color = "orangered",linewidth = 1.5)
plt.fill_between(fpr,tpr,alpha = .4)
plt.fill_between([0,1],[0,1],color = "y")

# -----------------------------------------------------------------------------------------------------------------------
# NN Model
plt.subplot(433)
fpr,tpr,thresholds = roc_curve(y_test,ML_models['NN'].predict_proba(X_test)[:,1])
auc_score = roc_auc_score(y_test,ML_models['NN'].predict_proba(X_test)[:,1])
plt.plot(fpr, tpr, linestyle = "dotted", color = "royalblue",linewidth = 2, label='ROC curve (area = %0.5f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('NN Model: ROC Curve', fontsize =12)
plt.legend(loc="lower right")
plt.tick_params(labelsize=12)
plt.plot([0,1],[0,1],linestyle = "dashed",
             color = "orangered",linewidth = 1.5)
plt.fill_between(fpr,tpr,alpha = .4)
plt.fill_between([0,1],[0,1],color = "y")

# -----------------------------------------------------------------------------------------------------------------------
# GRB Model
plt.subplot(434)
fpr,tpr,thresholds = roc_curve(y_test,ML_models['GRB'].predict_proba(X_test)[:,1])
auc_score = roc_auc_score(y_test,ML_models['GRB'].predict_proba(X_test)[:,1])
plt.plot(fpr, tpr, linestyle = "dotted", color = "royalblue",linewidth = 2, label='ROC curve (area = %0.5f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('GRB Model: ROC Curve', fontsize =12)
plt.legend(loc="lower right")
plt.tick_params(labelsize=12)
plt.plot([0,1],[0,1],linestyle = "dashed",
             color = "orangered",linewidth = 1.5)
plt.fill_between(fpr,tpr,alpha = .4)
plt.fill_between([0,1],[0,1],color = "y")

# -----------------------------------------------------------------------------------------------------------------------
# DT Model
plt.subplot(435)
fpr,tpr,thresholds = roc_curve(y_test,ML_models['DT'].predict_proba(X_test)[:,1])
auc_score = roc_auc_score(y_test,ML_models['DT'].predict_proba(X_test)[:,1])
plt.plot(fpr, tpr, linestyle = "dotted", color = "royalblue",linewidth = 2, label='ROC curve (area = %0.5f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('DT Model: ROC Curve', fontsize =12)
plt.legend(loc="lower right")
plt.tick_params(labelsize=12)
plt.plot([0,1],[0,1],linestyle = "dashed",
             color = "orangered",linewidth = 1.5)
plt.fill_between(fpr,tpr,alpha = .4)
plt.fill_between([0,1],[0,1],color = "y")

# -----------------------------------------------------------------------------------------------------------------------
# KN Model
plt.subplot(436)
fpr,tpr,thresholds = roc_curve(y_test,ML_models['KN'].predict_proba(X_test)[:,1])
auc_score = roc_auc_score(y_test,ML_models['KN'].predict_proba(X_test)[:,1])
plt.plot(fpr, tpr, linestyle = "dotted", color = "royalblue",linewidth = 2, label='ROC curve (area = %0.4f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('KN Model: ROC Curve', fontsize =12)
plt.legend(loc="lower right")
plt.tick_params(labelsize=12)
plt.plot([0,1],[0,1],linestyle = "dashed",
             color = "orangered",linewidth = 1.5)
plt.fill_between(fpr,tpr,alpha = .4)
plt.fill_between([0,1],[0,1],color = "y")

# -----------------------------------------------------------------------------------------------------------------------
# XGB Model
plt.subplot(437)
fpr,tpr,thresholds = roc_curve(y_test,ML_models['XGB'].predict_proba(X_test)[:,1])
auc_score = roc_auc_score(y_test,ML_models['XGB'].predict_proba(X_test)[:,1])
plt.plot(fpr, tpr, linestyle = "dotted", color = "royalblue",linewidth = 2, label='ROC curve (area = %0.5f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('XGB Model: ROC Curve', fontsize =12)
plt.legend(loc="lower right")
plt.tick_params(labelsize=12)
plt.plot([0,1],[0,1],linestyle = "dashed",
             color = "orangered",linewidth = 1.5)
plt.fill_between(fpr,tpr,alpha = .4)
plt.fill_between([0,1],[0,1],color = "y")
plt.show()


In [None]:
# Precision recall curves for all the models

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score

# LR_Model

fig = plt.figure(figsize=(15,22))
fig.set_facecolor("#F3F3F3")
plt.subplot(431)

recall,precision,thresholds = precision_recall_curve(y_test,ML_models['LR'].predict_proba(X_test)[:,1])
plt.plot(recall,precision,linewidth = 1.5,
             label = ("avg_pcn : " +
                      str(np.around(average_precision_score(y_test,ML_models['LR'].predict(X_test)),5))))
plt.plot([0,1],[0,0],linestyle = "dashed")
plt.fill_between(recall,precision,alpha = .2)
plt.legend(loc = 'best',
               prop = {"size" : 14})
plt.grid(True,alpha = .15)
plt.title( 'LR Model: precision_recall_curve', fontsize =12)
plt.xlabel("Recall_LR Model",fontsize =12)
plt.ylabel("Precision_LR Model",fontsize =12)
plt.xlim([0.25,1])
plt.yticks(np.arange(0,1,.3))
plt.tick_params(labelsize=12)

#------------------------------------------------------------------------------------------------------
# RF_Model

plt.subplot(432)

recall,precision,thresholds = precision_recall_curve(y_test,ML_models['RF'].predict_proba(X_test)[:,1])
plt.plot(recall,precision,linewidth = 1.5,
             label = ("avg_pcn : " +
                      str(np.around(average_precision_score(y_test,ML_models['RF'].predict(X_test)),5))))
plt.plot([0,1],[0,0],linestyle = "dashed")
plt.fill_between(recall,precision,alpha = .2)
plt.legend(loc = 'best',
               prop = {"size" : 14})
plt.grid(True,alpha = .15)
plt.title( 'RF Model: precision_recall_curve', fontsize =12)
plt.xlabel("Recall_RF Model",fontsize =12)
plt.ylabel("Precision_RF Model",fontsize =12)
plt.xlim([0.25,1])
plt.yticks(np.arange(0,1,.3))
plt.tick_params(labelsize=12)

#------------------------------------------------------------------------------------------------------
# NN_Model

plt.subplot(433)

recall,precision,thresholds = precision_recall_curve(y_test,ML_models['NN'].predict_proba(X_test)[:,1])
plt.plot(recall,precision,linewidth = 1.5,
             label = ("avg_pcn : " +
                      str(np.around(average_precision_score(y_test,ML_models['NN'].predict(X_test)),5))))
plt.plot([0,1],[0,0],linestyle = "dashed")
plt.fill_between(recall,precision,alpha = .2)
plt.legend(loc = 'best',
               prop = {"size" : 14})
plt.grid(True,alpha = .15)
plt.title( 'NN Model: precision_recall_curve', fontsize =12)
plt.xlabel("Recall_NN Model",fontsize =12)
plt.ylabel("Precision_NN Model",fontsize =12)
plt.xlim([0.25,1])
plt.yticks(np.arange(0,1,.3))
plt.tick_params(labelsize=12)

#------------------------------------------------------------------------------------------------------
# GRB_Model

plt.subplot(434)

recall,precision,thresholds = precision_recall_curve(y_test,ML_models['GRB'].predict_proba(X_test)[:,1])
plt.plot(recall,precision,linewidth = 1.5,
             label = ("avg_pcn : " +
                      str(np.around(average_precision_score(y_test,ML_models['GRB'].predict(X_test)),5))))
plt.plot([0,1],[0,0],linestyle = "dashed")
plt.fill_between(recall,precision,alpha = .2)
plt.legend(loc = 'best',
               prop = {"size" : 14})
plt.grid(True,alpha = .15)
plt.title( 'GRB Model: precision_recall_curve', fontsize =12)
plt.xlabel("Recall_GRB Model",fontsize =12)
plt.ylabel("Precision_GRB Model",fontsize =12)
plt.xlim([0.25,1])
plt.yticks(np.arange(0,1,.3))
plt.tick_params(labelsize=12)

#------------------------------------------------------------------------------------------------------
# DT_Model

plt.subplot(435)

recall,precision,thresholds = precision_recall_curve(y_test,ML_models['DT'].predict_proba(X_test)[:,1])
plt.plot(recall,precision,linewidth = 1.5,
             label = ("avg_pcn : " +
                      str(np.around(average_precision_score(y_test,ML_models['DT'].predict(X_test)),5))))
plt.plot([0,1],[0,0],linestyle = "dashed")
plt.fill_between(recall,precision,alpha = .2)
plt.legend(loc = 'best',
               prop = {"size" : 14})
plt.grid(True,alpha = .15)
plt.title( 'DT Model: precision_recall_curve', fontsize =12)
plt.xlabel("Recall_DT Model",fontsize =12)
plt.ylabel("Precision_DT Model",fontsize =12)
plt.xlim([0.25,1])
plt.yticks(np.arange(0,1,.3))
plt.tick_params(labelsize=12)

#------------------------------------------------------------------------------------------------------
# KN_Model

plt.subplot(436)

recall,precision,thresholds = precision_recall_curve(y_test,ML_models['KN'].predict_proba(X_test)[:,1])
plt.plot(recall,precision,linewidth = 1.5,
            label = ("avg_pcn : " +
                     str(np.around(average_precision_score(y_test,ML_models['KN'].predict(X_test)),5))))
plt.plot([0,1],[0,0],linestyle = "dashed")
plt.fill_between(recall,precision,alpha = .2)
plt.legend(loc = 'best',
               prop = {"size" : 14})
plt.grid(True,alpha = .15)
plt.title( 'KN Model: precision_recall_curve', fontsize =12)
plt.xlabel("Recall_KN Model",fontsize =12)
plt.ylabel("Precision_KN Model",fontsize =12)
plt.xlim([0.25,1])
plt.yticks(np.arange(0,1,.3))
plt.tick_params(labelsize=12)

#------------------------------------------------------------------------------------------------------
# XGB_Model

plt.subplot(437)

recall,precision,thresholds = precision_recall_curve(y_test,ML_models['XGB'].predict_proba(X_test)[:,1])
plt.plot(recall,precision,linewidth = 1.5,
             label = ("avg_pcn : " +
                      str(np.around(average_precision_score(y_test,ML_models['XGB'].predict(X_test)),5))))
plt.plot([0,1],[0,0],linestyle = "dashed")
plt.fill_between(recall,precision,alpha = .2)
plt.legend(loc = 'best',
               prop = {"size" : 14})
plt.grid(True,alpha = .15)
plt.title( 'XGB Model: precision_recall_curve', fontsize =12)
plt.xlabel("Recall_XGB Model",fontsize =12)
plt.ylabel("Precision_XGB Model",fontsize =12)
plt.xlim([0.25,1])
plt.yticks(np.arange(0,1,.3))
plt.tick_params(labelsize=12)

plt.show()

In [None]:
# convert df with label

df= pd.DataFrame(X_train)
df_X_train=df.rename(columns={ 0:'V1', 1:'V2', 2:'V3', 3: 'V4', 4: 'V5', 5: 'V6', 6: 'V7', 7: 'V8',
                              8: 'V9', 9: 'V10', 10: 'V11', 11: 'V12', 12: 'V13', 13: 'V14', 14: 'V15',15: 'V16',
                              16:'V17', 17: 'V18', 18: 'V19', 19: 'V20', 20: 'V21', 21: 'V22',22: 'V23', 23: 'V24',
                              24 :'V25', 25: 'V26', 26: 'V27', 27: 'V28', 28:'Amount'})
df_X_train.head()

logreg = LogisticRegression(penalty = 'l2', C = 1, random_state=0).fit(X_train, y_train)
logreg001 = LogisticRegression(C=0.01).fit(X_train, y_train)
logreg100 = LogisticRegression(C=100).fit(X_train, y_train)
creditcard_features = [x for i,x in enumerate(df_X_train.columns) if i!=8]
fig = plt.figure(figsize=(10,6))
fig.set_facecolor("#F3F3F3")
plt.plot(logreg.coef_.T, 'o', label="C=1")
plt.plot(logreg100.coef_.T, '^', label="C=100")
plt.plot(logreg001.coef_.T, 'v', label="C=0.001")
plt.xticks(range(creditcard.shape[1]), creditcard_features, rotation=90)
plt.hlines(0, 0, creditcard.shape[1])
plt.ylim(-5, 5)
plt.grid(b=None)
plt.xlabel("Features",fontsize=12)
plt.ylabel("Coefficient magnitude", fontsize=12)
plt.legend()
plt.savefig('log_coef')
plt.tick_params(labelsize=12)
plt.legend(fontsize=12)
plt.show()

In [None]:
# variables important
# Model - RF

importances = ML_models['RF'].feature_importances_

# make importance relative to the max importance
feature_importance = 100.0 * (importances / importances.max())
sorted_idx = np.argsort(feature_importance)
feature_names = list(df_X_train.columns.values)
feature_names_sort = [feature_names[indice] for indice in sorted_idx]
pos = np.arange(sorted_idx.shape[0]) + .5

# plot the result
fig = plt.figure(figsize=(9,8))
fig.set_facecolor("#F3F3F3")
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, feature_names_sort)
plt.title('RF Model: Relative Feature Importance', fontsize=12)
plt.tick_params(labelsize=12)
plt.show()

In [None]:
# variables important
# Model - GRB

importances = ML_models['GRB'].feature_importances_

# make importance relative to the max importance
feature_importance = 100.0 * (importances / importances.max())
sorted_idx = np.argsort(feature_importance)
feature_names = list(df_X_train.columns.values)
feature_names_sort = [feature_names[indice] for indice in sorted_idx]
pos = np.arange(sorted_idx.shape[0]) + .5

# plot the result
fig = plt.figure(figsize=(9,8))
fig.set_facecolor("#F3F3F3")
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, feature_names_sort)
plt.title('GRB Model: Relative Feature Importance', fontsize=12)
plt.tick_params(labelsize=12)

plt.show()

In [None]:
# variables important
# Model - DT

importances = ML_models['DT'].feature_importances_

# make importance relative to the max importance
feature_importance = 100.0 * (importances / importances.max())
sorted_idx = np.argsort(feature_importance)
feature_names = list(df_X_train.columns.values)
feature_names_sort = [feature_names[indice] for indice in sorted_idx]
pos = np.arange(sorted_idx.shape[0]) + .5

# plot the result
fig = plt.figure(figsize=(9,8))
fig.set_facecolor("#F3F3F3")
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, feature_names_sort)
plt.title('DT Model: Relative Feature Importance', fontsize=12)
plt.tick_params(labelsize=12)

plt.show()

In [None]:
# variables important
# Model - XGB
importances = ML_models['XGB'].feature_importances_

# make importance relative to the max importance
feature_importance = 100.0 * (importances / importances.max())
sorted_idx = np.argsort(feature_importance)
feature_names = list(df_X_train.columns.values)
feature_names_sort = [feature_names[indice] for indice in sorted_idx]
pos = np.arange(sorted_idx.shape[0]) + .5

# plot the result
fig = plt.figure(figsize=(9,8))
fig.set_facecolor("#F3F3F3")
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, feature_names_sort)
plt.title('XGB Model: Relative Feature Importance', fontsize=12)
plt.tick_params(labelsize=12)

plt.show()

 RF model permutation_importances

from sklearn.metrics import r2_score
from rfpimp import permutation_importances

def r2(model, X_train, y_train):
    return r2_score(y_train, model.predict(X_train))

perm_imp_rfpimp = permutation_importances(ML_models['RF'], df_X_train, y_train, r2).reset_index()

 plot the result
fig = plt.figure(figsize=(9,8))
fig.set_facecolor("#F3F3F3")
g=sns.barplot(data=perm_imp_rfpimp, y="Feature", x="Importance")
plt.title('Permutation_importances', fontsize=12)
g.grid(False)
plt.tick_params(labelsize=12)
plt.show()

In [None]:
# XGB Model
df_X_t =pd.DataFrame(X_test)
df_X_test=df_X_t.rename(columns={0:'V1', 1:'V2', 2:'V3', 3: 'V4', 4: 'V5', 5: 'V6', 6: 'V7', 7: 'V8',
                              8: 'V9', 9: 'V10', 10: 'V11', 11: 'V12', 12: 'V13', 13: 'V14', 14: 'V15',15: 'V16',
                              16:'V17', 17: 'V18', 18: 'V19', 19: 'V20', 20: 'V21', 21: 'V22',22: 'V23', 23: 'V24',
                              24 :'V25', 25: 'V26', 26: 'V27', 27: 'V28', 28:'Amount'})

In [None]:
# Cutoff - target for each model on probability

# LR Model
LR_y_pred_test = ML_models['LR'].predict_proba(X_test)
LR_y_pred_df = pd.DataFrame(LR_y_pred_test)
LR_y_pred = LR_y_pred_df.iloc[:,[1]]

y_test.index =LR_y_pred.index
LR_y_pred_final = pd.concat([y_test,LR_y_pred],axis=1)

LR_y_pred_final= LR_y_pred_final.rename(columns={ 1 : 'LR_Fraudulent_prob'})

# RF Model
RF_y_pred_test = ML_models['RF'].predict_proba(X_test)
RF_y_pred_df = pd.DataFrame(RF_y_pred_test)
RF_y_pred = RF_y_pred_df.iloc[:,[1]]

y_test.index =RF_y_pred.index
RF_y_pred_final = pd.concat([y_test,RF_y_pred],axis=1)

RF_y_pred_final= RF_y_pred_final.rename(columns={ 1 : 'RF_Fraudulent_prob'})

# NN Model
NN_y_pred_test = ML_models['NN'].predict_proba(X_test)
NN_y_pred_df = pd.DataFrame(NN_y_pred_test)
NN_y_pred = NN_y_pred_df.iloc[:,[1]]

y_test.index =NN_y_pred.index
NN_y_pred_final = pd.concat([y_test,NN_y_pred],axis=1)

NN_y_pred_final= NN_y_pred_final.rename(columns={ 1 : 'NN_Fraudulent_prob'})

# GRB Model
GRB_y_pred_test = ML_models['GRB'].predict_proba(X_test)
GRB_y_pred_df = pd.DataFrame(GRB_y_pred_test)
GRB_y_pred = GRB_y_pred_df.iloc[:,[1]]

y_test.index =GRB_y_pred.index
GRB_y_pred_final = pd.concat([y_test,GRB_y_pred],axis=1)

GRB_y_pred_final= GRB_y_pred_final.rename(columns={ 1 : 'GRB_Fraudulent_prob'})

# DT Model

DT_y_pred_test = ML_models['DT'].predict_proba(X_test)
DT_y_pred_df = pd.DataFrame(DT_y_pred_test)
DT_y_pred = DT_y_pred_df.iloc[:,[1]]

y_test.index =DT_y_pred.index
DT_y_pred_final = pd.concat([y_test,DT_y_pred],axis=1)

DT_y_pred_final= DT_y_pred_final.rename(columns={ 1 : 'DT_Fraudulent_prob'})

# KN Model
KN_y_pred_test = ML_models['KN'].predict_proba(X_test)
KN_y_pred_df = pd.DataFrame(KN_y_pred_test)
KN_y_pred = KN_y_pred_df.iloc[:,[1]]

y_test.index =KN_y_pred.index
KN_y_pred_final = pd.concat([y_test,KN_y_pred],axis=1)

KN_y_pred_final= KN_y_pred_final.rename(columns={ 1 : 'KN_Fraudulent_prob'})


XGB_y_pred_test = ML_models['XGB'].predict_proba(X_test)
XGB_y_pred_df = pd.DataFrame(XGB_y_pred_test)
XGB_y_pred = XGB_y_pred_df.iloc[:,[1]]

y_test.index =XGB_y_pred.index
XGB_y_pred_final = pd.concat([y_test,XGB_y_pred],axis=1)

XGB_y_pred_final= XGB_y_pred_final.rename(columns={ 1 : 'XGB_Fraudulent_prob'})

In [None]:
# Let's create columns with different probability cutoffs
# LR Model
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
     LR_y_pred_final[i]= LR_y_pred_final.LR_Fraudulent_prob.map(lambda x: 1 if x > i else 0)
print(); Line_Separator1()
print ("LR Model -create columns with different probability cutoffs"); Line_Separator1()
print(LR_y_pred_final.head()); Line_Separator1()

# RF Model
for j in numbers:
     RF_y_pred_final[j]= RF_y_pred_final.RF_Fraudulent_prob.map(lambda x: 1 if x > j else 0)
print(); Line_Separator1()
print ("RF Model -create columns with different probability cutoffs"); Line_Separator1()
print(RF_y_pred_final.head()); Line_Separator1()


# NN Model
for k in numbers:
     NN_y_pred_final[k]= NN_y_pred_final.NN_Fraudulent_prob.map(lambda x: 1 if x > k else 0)
print(); Line_Separator1()
print ("NN Model -create columns with different probability cutoffs"); Line_Separator1()
print(NN_y_pred_final.head()); Line_Separator1()

# GRB Model
for l in numbers:
     GRB_y_pred_final[l]= GRB_y_pred_final.GRB_Fraudulent_prob.map(lambda x: 1 if x > l else 0)
print(); Line_Separator1()
print ("GRB Model -create columns with different probability cutoffs"); Line_Separator1()
print(GRB_y_pred_final.head());Line_Separator1()

# DT Model
for m in numbers:
     DT_y_pred_final[m]= DT_y_pred_final.DT_Fraudulent_prob.map(lambda x: 1 if x > m else 0)
print(); Line_Separator1()
print ("DT Model -create columns with different probability cutoffs"); Line_Separator1()
print(DT_y_pred_final.head());Line_Separator1()

#KN Model
for n in numbers:
     KN_y_pred_final[n]= KN_y_pred_final.KN_Fraudulent_prob.map(lambda x: 1 if x > n else 0)
print(); Line_Separator1()
print ("KN Model -create columns with different probability cutoffs"); Line_Separator1()
print(KN_y_pred_final.head());Line_Separator1()

# XGB Model
for o in numbers:
     XGB_y_pred_final[o]= XGB_y_pred_final.XGB_Fraudulent_prob.map(lambda x: 1 if x > o else 0)
print(); Line_Separator1()
print ("XGB Model -create columns with different probability cutoffs"); Line_Separator1()
print(XGB_y_pred_final.head());Line_Separator1()

In [None]:
# Now let's calculate accuracy , sensitivity, and specificity for various probability cutoffs.

# LR model
print(); Line_Separator1()
print ("LR Model -prob, accuracy, sensi, speci"); Line_Separator1()
LR_cutoff_df = pd.DataFrame(columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(LR_y_pred_final.Class, LR_y_pred_final[i])
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    speci = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    LR_cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(LR_cutoff_df); Line_Separator1()


# RF model
print(); Line_Separator1()
print ("RF Model -prob, accuracy, sensi, speci"); Line_Separator1()
RF_cutoff_df = pd.DataFrame(columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(RF_y_pred_final.Class, RF_y_pred_final[i])
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    speci = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    RF_cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(RF_cutoff_df); Line_Separator1()


# NN model
print(); Line_Separator1()
print ("NN Model -prob, accuracy, sensi, speci"); Line_Separator1()
NN_cutoff_df = pd.DataFrame(columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(NN_y_pred_final.Class, NN_y_pred_final[i])
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    speci = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    NN_cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(NN_cutoff_df); Line_Separator1()


# GRB model
print(); Line_Separator1()
print ("GRB Model -prob, accuracy, sensi, speci"); Line_Separator1()
GRB_cutoff_df = pd.DataFrame(columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(GRB_y_pred_final.Class, GRB_y_pred_final[i])
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    speci = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    GRB_cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(GRB_cutoff_df); Line_Separator1()


# DT model
print(); Line_Separator1()
print ("DT Model -prob, accuracy, sensi, speci"); Line_Separator1()
DT_cutoff_df = pd.DataFrame(columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(DT_y_pred_final.Class, DT_y_pred_final[i])
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    speci = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    DT_cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(DT_cutoff_df); Line_Separator1()

# KN model
print(); Line_Separator1()
print ("DT Model -prob, accuracy, sensi, speci"); Line_Separator1()
KN_cutoff_df = pd.DataFrame(columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(KN_y_pred_final.Class, KN_y_pred_final[i])
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    speci = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    KN_cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(KN_cutoff_df); Line_Separator1()

# XGB model
print(); Line_Separator1()
print ("XGB Model -prob, accuracy, sensi, speci"); Line_Separator1()
XGB_cutoff_df = pd.DataFrame(columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(XGB_y_pred_final.Class, XGB_y_pred_final[i])
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    speci = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    XGB_cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(XGB_cutoff_df); Line_Separator1()

In [None]:
## Finding the optimal cutoff threshold
def Find_Optimal_Cutoff(target, predicted):

    fpr, tpr, threshold = roc_curve(target, predicted)
    i = np.arange(len(tpr))
    roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})
    roc_t = roc.iloc[(roc.tf-0).abs().argsort()[:1]]

    return list(roc_t['threshold'])

# Find optimal probability threshold
LR_threshold = Find_Optimal_Cutoff(LR_y_pred_final.Class,LR_y_pred_final.LR_Fraudulent_prob)
RF_threshold = Find_Optimal_Cutoff(RF_y_pred_final.Class,RF_y_pred_final.RF_Fraudulent_prob)
NN_threshold = Find_Optimal_Cutoff(NN_y_pred_final.Class,NN_y_pred_final.NN_Fraudulent_prob)
GRB_threshold = Find_Optimal_Cutoff(GRB_y_pred_final.Class,GRB_y_pred_final.GRB_Fraudulent_prob)
DT_threshold = Find_Optimal_Cutoff(DT_y_pred_final.Class,DT_y_pred_final.DT_Fraudulent_prob)
KN_threshold = Find_Optimal_Cutoff(KN_y_pred_final.Class,KN_y_pred_final.KN_Fraudulent_prob)
XGB_threshold = Find_Optimal_Cutoff(XGB_y_pred_final.Class,XGB_y_pred_final.XGB_Fraudulent_prob)

print(); Line_Separator1()
print ("ML_Models: Optimal Cutoff Threshold"); Line_Separator1()

print('LR Model Cutoff threshold: ', LR_threshold)
print('RF Model Cutoff threshold: ', RF_threshold)
print('NN Model Cutoff threshold: ', NN_threshold)
print('GRB Model Cutoff threshold: ', GRB_threshold)
print('DT Model Cutoff threshold: ', DT_threshold)
print('KN Model Cutoff threshold: ', KN_threshold)
print('XGB Model Cutoff threshold: ', XGB_threshold); Line_Separator1()

In [None]:
# Creating new column 'predicted' with 1 if Fraudulent > optimal cutoff threshold of each model else 0
# LR model
LR_y_pred_final['pred_Fraudulent'] = LR_y_pred_final.LR_Fraudulent_prob.map(lambda x: 1 if x > 0.6420354633413775 else 0)
print(); Line_Separator()
print('LR_Model');Line_Separator()
print (LR_y_pred_final.Class.value_counts())

# RF Model
RF_y_pred_final['pred_Fraudulent'] = RF_y_pred_final.RF_Fraudulent_prob.map(lambda x: 1 if x >0.016428571428571428 else 0)
print(); Line_Separator()
print('RF_Model');Line_Separator()
print (RF_y_pred_final.Class.value_counts())

# NN Model
NN_y_pred_final['pred_Fraudulent'] = NN_y_pred_final.NN_Fraudulent_prob.map(lambda x: 1 if x >0.00000000005824619724 else 0)
print(); Line_Separator()
print('NN_Model');Line_Separator()
print (NN_y_pred_final.Class.value_counts())

# GRB Model
GRB_y_pred_final['pred_Fraudulent'] = GRB_y_pred_final.GRB_Fraudulent_prob.map(lambda x: 1 if x >0.4782707970396372 else 0)
print(); Line_Separator()
print('GRB_Model');Line_Separator()
print (GRB_y_pred_final.Class.value_counts())

# DT Model
DT_y_pred_final['pred_Fraudulent'] = DT_y_pred_final.DT_Fraudulent_prob.map(lambda x: 1 if x >0.6093012806110987 else 0)
print(); Line_Separator()
print('DT_Model');Line_Separator()
print(DT_y_pred_final.Class.value_counts())

# KN Model
KN_y_pred_final['pred_Fraudulent'] = KN_y_pred_final.KN_Fraudulent_prob.map(lambda x: 1 if x > 0.14285714285714285 else 0)
print();Line_Separator()
print('KN_Model');Line_Separator()
print (KN_y_pred_final.Class.value_counts())

# XGB Model

XGB_y_pred_final['pred_Fraudulent'] = XGB_y_pred_final.XGB_Fraudulent_prob.map(lambda x: 1 if x >0.07481219619512558 else 0)
print();Line_Separator()
print('XGB_Model');Line_Separator()
print (XGB_y_pred_final.Class.value_counts());Line_Separator()

In [None]:
# Confusion matrix for LR, RF, NN, GRB, DT, KN, XGB models after incorporating cutoff traget value of each model

# LR
LR_conf_matrix = metrics.confusion_matrix(LR_y_pred_final.Class, LR_y_pred_final.pred_Fraudulent)

# RF
RF_conf_matrix = metrics.confusion_matrix(RF_y_pred_final.Class, RF_y_pred_final.pred_Fraudulent)

# NN
NN_conf_matrix = metrics.confusion_matrix(NN_y_pred_final.Class, NN_y_pred_final.pred_Fraudulent)

# GRB
GRB_conf_matrix = metrics.confusion_matrix(GRB_y_pred_final.Class, GRB_y_pred_final.pred_Fraudulent)

# DT
DT_conf_matrix = metrics.confusion_matrix(DT_y_pred_final.Class, DT_y_pred_final.pred_Fraudulent)

# KN
KN_conf_matrix = metrics.confusion_matrix(KN_y_pred_final.Class, KN_y_pred_final.pred_Fraudulent)

# XGB
XGB_conf_matrix = metrics.confusion_matrix(XGB_y_pred_final.Class, XGB_y_pred_final.pred_Fraudulent)



fig = plt.figure(figsize=(15,18))
fig.set_facecolor("#F3F3F3")
plt.subplot(431)
ax= sns.heatmap(LR_conf_matrix,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("LR_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)


plt.subplot(432)
ax= sns.heatmap(RF_conf_matrix,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("RF_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)


plt.subplot(433)
ax= sns.heatmap(NN_conf_matrix,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("NN_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)

plt.subplot(434)
ax= sns.heatmap(GRB_conf_matrix ,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("GRB_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)

plt.subplot(435)
ax= sns.heatmap(DT_conf_matrix ,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("DT_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)

plt.subplot(436)
ax= sns.heatmap(KN_conf_matrix ,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("KN_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)


plt.subplot(437)
ax= sns.heatmap(XGB_conf_matrix ,annot=True, annot_kws={"fontsize":15}, fmt = "",square = True,
                xticklabels=["Not Fraudulent","Fraudulent"],
                yticklabels=["Not Fraudulent","Fraudulent"],
                linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title("XGB_conf_matrix", fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=90, horizontalalignment='right')
#bottom, top = ax.get_ylim()
#ax.set_ylim(bottom + 0.5, top - 0.5)

plt.show()

In [None]:
# Confusion matrix for LR, RF, NN, GRB, DT, KN, XGB models after incorporating cutoff target value of each model

# LR model

TP_LR = LR_conf_matrix[1,1] # true positive
TN_LR = LR_conf_matrix[0,0] # true negatives
FP_LR = LR_conf_matrix[0,1] # false positives
FN_LR = LR_conf_matrix[1,0] # false negatives

# RF model

TP_RF = RF_conf_matrix[1,1] # true positive
TN_RF = RF_conf_matrix[0,0] # true negatives
FP_RF = RF_conf_matrix[0,1] # false positives
FN_RF = RF_conf_matrix[1,0] # false negatives

# NN model

TP_NN = NN_conf_matrix[1,1] # true positive
TN_NN = NN_conf_matrix[0,0] # true negatives
FP_NN = NN_conf_matrix[0,1] # false positives
FN_NN = NN_conf_matrix[1,0] # false negatives

# GRB model

TP_GRB = GRB_conf_matrix[1,1] # true positive
TN_GRB = GRB_conf_matrix[0,0] # true negatives
FP_GRB = GRB_conf_matrix[0,1] # false positives
FN_GRB = GRB_conf_matrix[1,0] # false negatives

# DT model

TP_DT = DT_conf_matrix[1,1] # true positive
TN_DT = DT_conf_matrix[0,0] # true negatives
FP_DT = DT_conf_matrix[0,1] # false positives
FN_DT = DT_conf_matrix[1,0] # false negatives

# KN model

TP_KN = KN_conf_matrix[1,1] # true positive
TN_KN = KN_conf_matrix[0,0] # true negatives
FP_KN = KN_conf_matrix[0,1] # false positives
FN_KN = KN_conf_matrix[1,0] # false negatives

# XGB  model

TP_XGB = XGB_conf_matrix[1,1] # true positive
TN_XGB = XGB_conf_matrix[0,0] # true negatives
FP_XGB = XGB_conf_matrix[0,1] # false positives
FN_XGB = XGB_conf_matrix[1,0] # false negatives

In [None]:
print();Line_Separator()

# LR model statistics-----------------------------------------------------------
print("LR Model :");Line_Separator()

# Let's see the sensitivity of our logistic regression model
print('Sensitivity / Recall      : ', TP_LR / float(TP_LR+FN_LR))

# Let us calculate specificity
print('Specificity / Precision   : ',TN_LR / float(TN_LR+FP_LR))

# Calculate false postive rate - predicting Fraudulent when does not have Fraudulent
print('False postive rate        : ',FP_LR/ float(TN_LR+FP_LR))

# positive predictive value
print('Positive predictive value : ', TP_LR / float(TP_LR+FP_LR))

# Negative predictive value
print('Negative predictive value : ', TN_LR / float(TN_LR+ FN_LR))

# F1 Score

LR_Precision  = TP_LR/float(TP_LR + FP_LR)
LR_recall     = TP_LR/float(TP_LR + FN_LR)
LR_F1_Score      = 2*((LR_Precision*LR_recall)/(LR_Precision+LR_recall))
print ('F1 Score                  : ',  LR_F1_Score)

# Accuracy

LR_Accuracy  = (TP_LR + TN_LR)/ float(TP_LR+TN_LR+FP_LR+FN_LR)
print ('Accuracy                  : ',  LR_Accuracy);Line_Separator()



# RF model statistics -------------------------------------------------------------

print("RF Model :");Line_Separator()

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', TP_RF / float(TP_RF+FN_RF))

# Let us calculate specificity
print('Specificity               : ',TN_RF / float(TN_RF+FP_RF))

# Calculate false postive rate - predicting Fraudulent when does not have Fraudulent
print('False postive rate        : ',FP_RF/ float(TN_RF+FP_RF))

# positive predictive value
print('Positive predictive value : ', TP_RF / float(TP_RF+FP_RF))

# Negative predictive value
print('Negative predictive value : ',TN_RF / float(TN_RF+ FN_RF))

# F1 Score

RF_Precision  = TP_RF/float(TP_RF + FP_RF)
RF_recall     = TP_LR/float(TP_RF + FN_RF)
RF_F1_Score      = 2*((RF_Precision*RF_recall)/(RF_Precision+RF_recall))
print ('F1 Score                  : ',  RF_F1_Score)

# Accuracy

RF_Accuracy  = (TP_RF + TN_RF)/ float(TP_RF+TN_RF+FP_RF+FN_RF)
print ('Accuracy                  : ',  RF_Accuracy);Line_Separator()


# NN model statistics -------------------------------------------------------------

print("NN Model :");Line_Separator()

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', TP_NN / float(TP_NN+FN_NN))

# Let us calculate specificity
print('Specificity               : ',TN_NN / float(TN_NN+FP_NN))

# Calculate false postive rate - predicting Fraudulent when does not have Fraudulent
print('False postive rate        : ',FP_NN/ float(TN_NN+FP_NN))

# positive predictive value
print('Positive predictive value : ', TP_NN / float(TP_NN+FP_NN))

# Negative predictive value
print('Negative predictive value : ', TN_NN / float(TN_NN+ FN_NN))

# F1 Score

NN_Precision  = TP_NN/float(TP_NN + FP_NN)
NN_recall     = TP_NN/float(TP_NN + FN_NN)
NN_F1_Score      = 2*((NN_Precision*NN_recall)/(NN_Precision+NN_recall))
print ('F1 Score                  : ',  NN_F1_Score)

# Accuracy

NN_Accuracy  = (TP_NN + TN_NN)/ float(TP_NN+TN_NN+FP_NN+FN_NN)
print ('Accuracy                  : ',  NN_Accuracy);Line_Separator()

# GRB model statistics -------------------------------------------------------------

print("GRB Model :");Line_Separator()

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', TP_GRB / float(TP_GRB+FN_GRB))

# Let us calculate specificity
print('Specificity               : ',TN_GRB / float(TN_GRB+FP_GRB))

# Calculate false postive rate - predicting Fraudulent when does not have Fraudulent
print('False postive rate        : ',FP_GRB/ float(TN_GRB+FP_GRB))

# positive predictive value
print('Positive predictive value : ', TP_GRB / float(TP_GRB+FP_GRB))

# Negative predictive value
print('Negative predictive value : ', TN_GRB / float(TN_GRB+ FN_GRB))

# F1 Score

GRB_Precision  = TP_GRB/float(TP_GRB + FP_GRB)
GRB_recall     = TP_GRB/float(TP_GRB + FN_GRB)
GRB_F1_Score      = 2*((GRB_Precision*GRB_recall)/(GRB_Precision+GRB_recall))
print ('F1 Score                  : ',  GRB_F1_Score)

# Accuracy

GRB_Accuracy  = (TP_GRB + TN_GRB)/ float(TP_GRB+TN_GRB+FP_GRB+FN_GRB)
print ('Accuracy                  : ',  GRB_Accuracy);Line_Separator()

# DT model statistics -------------------------------------------------------------

print("DT Model :");Line_Separator()

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', TP_DT / float(TP_DT+FN_DT))

# Let us calculate specificity
print('Specificity               : ',TN_DT / float(TN_DT+FP_DT))

# Calculate false postive rate - predicting Fraudulent when does not have Fraudulent
print('False postive rate        : ',FP_DT/ float(TN_DT+FP_DT))

# positive predictive value
print('Positive predictive value : ', TP_DT / float(TP_DT+FP_DT))

# Negative predictive value
print('Negative predictive value : ', TN_DT / float(TN_DT+ FN_DT))

# F1 Score

DT_Precision  = TP_DT/float(TP_DT + FP_DT)
DT_recall     = TP_DT/float(TP_DT + FN_DT)
DT_F1_Score      = 2*((DT_Precision*DT_recall)/(DT_Precision+DT_recall))
print ('F1 Score                  : ',  DT_F1_Score)

# Accuracy

DT_Accuracy  = (TP_DT + TN_DT)/ float(TP_DT+TN_DT+FP_DT+FN_DT)
print ('Accuracy                  : ',  DT_Accuracy);Line_Separator()

# KN model statistics -------------------------------------------------------------

print("KN Model :");Line_Separator()

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', TP_KN / float(TP_KN+FN_KN))

# Let us calculate specificity
print('Specificity               : ',TN_KN / float(TN_KN+FP_KN))

# Calculate false postive rate - predicting Fraudulent when does not have Fraudulent
print('False postive rate        : ',FP_KN/ float(TN_KN+FP_KN))

# positive predictive value
print('Positive predictive value : ', TP_KN / float(TP_KN+FP_KN))

# Negative predictive value
print('Negative predictive value : ',TN_KN / float(TN_KN+ FN_KN))

# F1 Score

KN_Precision  = TP_KN/float(TP_KN + FP_KN)
KN_recall     = TP_KN/float(TP_KN + FN_KN)
KN_F1_Score      = 2*((KN_Precision*KN_recall)/(KN_Precision+KN_recall))
print ('F1 Score                  : ',  KN_F1_Score)

# Accuracy

KN_Accuracy  = (TP_KN + TN_KN)/ float(TP_KN+TN_KN+FP_KN+FN_KN)
print ('Accuracy                  : ', KN_Accuracy);Line_Separator()

# XGB model statistics -------------------------------------------------------------

print("XGB Model :");Line_Separator()

# Let's see the sensitivity of our logistic regression model
print('Sensitivity               : ', TP_XGB / float(TP_XGB+FN_XGB))

# Let us calculate specificity
print('Specificity               : ',TN_XGB / float(TN_XGB+FP_XGB))

# Calculate false postive rate -predicting Fraudulent when does not have Fraudulent
print('False postive rate        : ',FP_XGB/ float(TN_XGB+FP_XGB))

# positive predictive value
print('Positive predictive value : ', TP_XGB / float(TP_XGB+FP_XGB))

# Negative predictive value
print('Negative predictive value : ',TN_XGB / float(TN_XGB+ FN_XGB))

# F1 Score

XGB_Precision  = TP_XGB/float(TP_XGB + FP_XGB)
XGB_recall     = TP_XGB/float(TP_XGB + FN_XGB)
XGB_F1_Score      = 2*((XGB_Precision*XGB_recall)/(XGB_Precision+XGB_recall))
print ('F1 Score                  : ',  XGB_F1_Score)

# Accuracy

XGB_Accuracy  = (TP_XGB + TN_XGB)/ float(TP_XGB+TN_XGB+FP_XGB+FN_XGB)
print ('Accuracy                  : ',  XGB_Accuracy);Line_Separator()

**CONCLUSION**

1. We have created 5 Models and have evaluated all of these models based on different evaluation metrics for these models. The parameter area under the ROC curve plays an important role when we select the model in this study.

2. Apart from Area under the ROC curve and precision/recall values, we have to consider the computation time when we think of any model here, as the amount of data we believe here is very high. Different models performed well on this data significant difference being the computation timing between these models. KNN model was not advisable as the information we have has more data points, more than 10k in it, which essentially increases the computational time of the model.

3. Other metrics considered were Accuracy, Sensitivity/Recall, and Specificity/Precision. The selected model should strike the right balance between precision and recall. Identifying the fraud precisely is as important as reducing the misidentification of these transactions because both situations result in a loss for the client. So the model selected shall have the correct precision-recall with Area under the ROC curve. Model recommendations: 1) Based on the Area under the ROC curve, we can consider the logistic regression model with an l2 penalty can be used as the Area under the ROC curve is around 0.9862, which is the highest among all the other models. The Sensitivity/Recall for the LR model is 0.9387, and the Specificity/Precision stands at 0.9383, which are both quite good values. 2) Next model which is recommended is GRB Model. The Area under the ROC curve for this model is 0.977, which is an excellent value among other models. The Sensitivity/Recall for the LR model is 0.9183, and the Specificity/Precision is 0.9183 with an accuracy of .918, which is very well-balanced for our scenario.

4. Other models, such as Random forest and XG boost, also have thriving areas under the ROC curve with values .975 and .966 and well-balanced precision/recall values and can be considered for evaluation.