# Exhuastive Framework for Risk Management and Anomaly Detection in Highly Imbalanced Datasets

As technology continues to advance, the prevalence of cyber security risks has become more pronounced. The digital landscape provides ample opportunities for malicious actors to exploit vulnerabilities in various systems by ways of phishing attacks, ransomware, social engineering tactics, and detection evading automated programs and scripts. Machine learning provides an invaluable role in risk management and can fortify threat detection mechanisms to identify and prevent fradulent activities, and maintaining the integrity of systems and transactions.

However, finding the right balance between robust anomaly detection and user convenience is a chllenging task for organizations, and machine learning models tend to perform well on the majority class, leading to potential oversight of minority class instances such as rare anomaly cases. In this tutorial, we will explore the predictive strengths of several machine learning techniques such as Random Forest, AdaBoost, CatBoost, XGBoost, and LightGBM, on both the majority and minority class, and compare techniques to address the class imbalance issue using oversampling (E.g., SMOTE, ADASYN) and stratified sampling (E.g., Stratified K-Fold Cross Validation) techniques.

### Context

The dataset can be found in Kaggle, and contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly imbalanced, with the positive class (frauds) accounting for 0.172% of all transactions.

The dataset contains only numerical inputs:
- __Time__ contains the seconds elapsed between each transaction and the first transaction in the dataset
- __Amount__ containing the transaction Amount
- __Class__ is the response variable and it takes value 1 in case of fraud and 0 otherwise
- __V1, V2, ... , V28__ are the principal components of the PCA transformation of the other original features.


### Data Preprocessing

In [2]:
import pandas as pd 
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import gc
from datetime import datetime 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from catboost import CatBoostClassifier
from sklearn import svm
import lightgbm as lgb
from lightgbm import LGBMClassifier
import xgboost as xgb
import os
from sklearn.preprocessing import RobustScaler
from imblearn.over_sampling import ADASYN
from skopt import BayesSearchCV
from skopt.callbacks import DeadlineStopper, DeltaYStopper
from skopt.space import Real, Categorical, Integer

%matplotlib inline 
init_notebook_mode(connected=True)
pd.set_option('display.max_columns', 100)

In [4]:
data_df = pd.read_csv('C://Users//wlim129//Desktop//Work//Anomaly Detection//creditcard.csv')

print(data_df.shape)
# data has rows: 284807  columns: 31

data_df.head()

data_df.describe()

(284807, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,2.239053e-15,1.673327e-15,-1.247012e-15,8.190001e-16,1.207294e-15,4.887456e-15,1.437716e-15,-3.772171e-16,9.564149e-16,1.039917e-15,6.406204e-16,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,1.08885,1.020713,0.9992014,0.9952742,0.9585956,0.915316,0.8762529,0.8493371,0.8381762,0.8140405,0.770925,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,-24.58826,-4.797473,-18.68371,-5.791881,-19.21433,-4.498945,-14.12985,-25.1628,-9.498746,-7.213527,-54.49772,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,-0.5354257,-0.7624942,-0.4055715,-0.6485393,-0.425574,-0.5828843,-0.4680368,-0.4837483,-0.4988498,-0.4562989,-0.2117214,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,-0.09291738,-0.03275735,0.1400326,-0.01356806,0.05060132,0.04807155,0.06641332,-0.06567575,-0.003636312,0.003734823,-0.06248109,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,0.4539234,0.7395934,0.618238,0.662505,0.4931498,0.6488208,0.5232963,0.399675,0.5008067,0.4589494,0.1330408,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,23.74514,12.01891,7.848392,7.126883,10.52677,8.877742,17.31511,9.253526,5.041069,5.591971,39.4209,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


### Exploratory Data Analysis

In [None]:
# check missing data

total = data_df.isnull().sum().sort_values(ascending = False)
percent = (data_df.isnull().sum()/data_df.isnull().count()*100).sort_values(ascending = False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent']).transpose()

The dataset does not contain any missing data.

In [None]:
# check for data imbalance

temp = data_df["Class"].value_counts()
df = pd.DataFrame({'Class': temp.index,'values': temp.values})

trace = go.Bar(
    x = df['Class'],y = df['values'],
    name="Credit Card Fraud Class - data unbalance (Not fraud = 0, Fraud = 1)",
    marker=dict(color="Red"),
    text=df['values']
)
data = [trace]
layout = dict(title = 'Credit Card Fraud Class - data unbalance (Not fraud = 0, Fraud = 1)',
          xaxis = dict(title = 'Class', showticklabels=True), 
          yaxis = dict(title = 'Number of transactions'),
          hovermode = 'closest',width=600
         )
fig = dict(data=data, layout=layout)
iplot(fig, filename='class')

Only 492 (or 0.172%) of transactions are fraudulent. The data is highly imbalanced with respect to the target variable Class.

In [None]:
# time density plot for transactions

class_0 = data_df.loc[data_df['Class'] == 0]["Time"]
class_1 = data_df.loc[data_df['Class'] == 1]["Time"]

hist_data = [class_0, class_1]
group_labels = ['Not Fraud', 'Fraud']

fig = ff.create_distplot(hist_data, group_labels, show_hist=False, show_rug=False)
fig['layout'].update(title='Credit Card Transactions Time Density Plot', xaxis=dict(title='Time [s]'))
iplot(fig, filename='dist_only')

Fraudulent transactions are more evenly distributed compared to valid transactions.

In [None]:
# statistical summaries of fraud and non-fraud transactions by hour

data_df['Hour'] = data_df['Time'].apply(lambda x: np.floor(x / 3600))

tmp = data_df.groupby(['Hour', 'Class'])['Amount'].aggregate(['min', 'max', 'count', 'sum', 'mean', 'median', 'var']).reset_index()
df = pd.DataFrame(tmp)
df.columns = ['Hour', 'Class', 'Min', 'Max', 'Transactions', 'Sum', 'Mean', 'Median', 'Var']
df.head()

In [None]:
# total amount by class, non-fraud is blue and fraud is red

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Sum", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Sum", data=df.loc[df.Class==1], color="red")
plt.suptitle("Total Amount")
plt.show();

In [None]:
# total number of transactions by class, non-fraud is blue and fraud is red

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Transactions", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Transactions", data=df.loc[df.Class==1], color="red")
plt.suptitle("Total Number of Transactions")
plt.show();

In [None]:
# average amount per transaction by class, non-fraud is blue and fraud is red

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Mean", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Mean", data=df.loc[df.Class==1], color="red")
plt.suptitle("Average Amount per Transaction")
plt.show();

In [None]:
# maximum amount of transaction by class, non-fraud is blue and fraud is red

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Mean", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Mean", data=df.loc[df.Class==1], color="red")
plt.suptitle("Maximum Amount of Transactions")
plt.show();

In [None]:
# median amount of transaction by class, non-fraud is blue and fraud is red

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Median", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Median", data=df.loc[df.Class==1], color="red")
plt.suptitle("Median Amount of Transactions")
plt.show();

In [None]:
# minimum amount of transaction by class, non-fraud is blue and fraud is red

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,6))
s = sns.lineplot(ax = ax1, x="Hour", y="Min", data=df.loc[df.Class==0])
s = sns.lineplot(ax = ax2, x="Hour", y="Min", data=df.loc[df.Class==1], color="red")
plt.suptitle("Minimum Amount of Transactions")
plt.show();

In [None]:
# summary statistics of both classes, with and without outliers

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,6))
s = sns.boxplot(ax = ax1, x="Class", y="Amount", hue="Class",data=data_df, palette="PRGn",showfliers=True)
s = sns.boxplot(ax = ax2, x="Class", y="Amount", hue="Class",data=data_df, palette="PRGn",showfliers=False)
plt.show();

In [None]:
tmp = data_df[['Amount','Class']].copy()
class_0 = tmp.loc[tmp['Class'] == 0]['Amount']
class_1 = tmp.loc[tmp['Class'] == 1]['Amount']
class_0.describe()

In [None]:
class_1.describe()

Using median as the average, real transactions have a larger average value and Q1, smaller Q3 and larger outliers. Fraudulent transactions have a smaller average and Q1, larger Q3 and smaller outliers.

In [None]:
# plot of fradulent transactions, amount against time in seconds

fraud = data_df.loc[data_df['Class'] == 1]

trace = go.Scatter(
    x = fraud['Time'],y = fraud['Amount'],
    name="Amount",
     marker=dict(
                color='rgb(238,23,11)',
                line=dict(
                    color='red',
                    width=1),
                opacity=0.5,
            ),
    text= fraud['Amount'],
    mode = "markers"
)
data = [trace]
layout = dict(title = 'Amount of fraudulent transactions',
          xaxis = dict(title = 'Time [s]', showticklabels=True), 
          yaxis = dict(title = 'Amount'),
          hovermode='closest'
         )
fig = dict(data=data, layout=layout)
iplot(fig, filename='fraud-amount')

In [None]:
# features correlation

plt.figure(figsize = (14,14))
plt.title('Credit Card Transactions features correlation plot (Pearson)')
corr = data_df.corr()
sns.heatmap(corr,xticklabels=corr.columns,yticklabels=corr.columns,linewidths=.1,cmap="Reds")
plt.show()

Because V1-V28 is obtained from PCA, we notice that there is no notable correlation between them. However, there are certain correlations between some of these features and Time (inverse correlation with V3) and Amount (direct correlation with V7 and V20, inverse correlation with V1 and V5).

In [None]:
# regression line for Amount against V20 and V7

s = sns.lmplot(x='V20', y='Amount',data=data_df, hue='Class', fit_reg=True,scatter_kws={'s':2})
s = sns.lmplot(x='V7', y='Amount',data=data_df, hue='Class', fit_reg=True,scatter_kws={'s':2})
plt.show()

We can confirm that the two couples of features are correlated (the regression lines for Class = 0 have a positive slope, whilst the regression line for Class = 1 have a smaller positive slope).

In [None]:
# regression line for Amount against V2 and V5

s = sns.lmplot(x='V2', y='Amount',data=data_df, hue='Class', fit_reg=True,scatter_kws={'s':2})
s = sns.lmplot(x='V5', y='Amount',data=data_df, hue='Class', fit_reg=True,scatter_kws={'s':2})
plt.show()

We can confirm that the two couples of features are inverse correlated (the regression lines for Class = 0 have a negative slope while the regression lines for Class = 1 have a very small negative slope).

In [None]:
# Features density plot

var = data_df.columns.values

i = 0
t0 = data_df.loc[data_df['Class'] == 0]
t1 = data_df.loc[data_df['Class'] == 1]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(8,4,figsize=(16,28))

for feature in var:
    i += 1
    plt.subplot(8,4,i)
    sns.kdeplot(t0[feature], bw_method=0.5,label="Class = 0", warn_singular=False)
    sns.kdeplot(t1[feature], bw_method=0.5,label="Class = 1", warn_singular=False)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();

V4 and V11 have clearly separated distributions for Class values 0 and 1. V12, V14, V18 are partially separated. V1, V2, V3, V10 have a quite distinct profile, whilst V25, V26, V28 have similar profiles for the two values of Class.

In general, with just few exceptions (Time and Amount), the features distribution for legitimate transactions is centered around 0, sometime with a long queue at one of the extremities, while the fraudulent transactions (values of Class = 1) have a skewed (asymmetric) distribution.

### Feature Engineering

In [None]:
# stratified Train/Validation/Test split

target = 'Class'

VALID_SIZE = 0.20 # simple validation using train_test_split
TEST_SIZE = 0.20 # test size using_train_test_split
RANDOM_STATE = 333

X = data_df.drop([target, 'Hour'],axis = 1)
Y = data_df[target]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=TEST_SIZE, stratify=Y, random_state=RANDOM_STATE, shuffle=True)
X_train, X_valid, Y_train, Y_valid = train_test_split(X_train, Y_train, test_size=VALID_SIZE, stratify=Y_train, random_state=RANDOM_STATE, shuffle=True)


We will now proceed with scaling. Max-min Normalization features will have a small Standard deviation compared to Standardization. Normalization will scale most of the data to a small interval, which means all features will have a small scale but do not handle outliers well. Whereas, Standardization is robust to outliers. They transform the probability distribution for an input variable to standard Gaussian Standardization and can become skewed or biased if the input variable contains outlier values.

To overcome this, the median and Interquartile range can be used when standardizing numerical input variables which technique is referred to as robust scaling. Robust scaling uses percentile to scale numerical input variables that contain outliers by scaling numerical input variables using the median and interquartile range. It calculated the median, 25th, and 75th percentiles. The value of each variable is then subtracted with median and divided by Interquartile range (IQR). Value = (value- median) / (p75 — p25)

This results in variable having mean to 0, median and standard deviation to 1.

In [None]:
# robust scaling

rob_scaler_amount = RobustScaler()
rob_scaler_time = RobustScaler()
rob_scaler_amount.fit(X_train['Amount'].values.reshape(-1,1))
rob_scaler_time.fit(X_train['Time'].values.reshape(-1,1))
X_train['Time'] = rob_scaler_amount.transform(X_train['Amount'].values.reshape(-1,1))
X_train['Amount'] = rob_scaler_time.transform(X_train['Time'].values.reshape(-1,1))
X_valid['Time'] = rob_scaler_amount.transform(X_valid['Amount'].values.reshape(-1,1))
X_valid['Amount'] = rob_scaler_time.transform(X_valid['Time'].values.reshape(-1,1))
X_test['Time'] = rob_scaler_amount.transform(X_test['Amount'].values.reshape(-1,1))
X_test['Amount'] = rob_scaler_time.transform(X_test['Time'].values.reshape(-1,1))
X_train.head()
X_valid.head()
X_test.head()

Since our train data is highly imbalance with way fewer fraudulent case (less than 0.2%) compared to non-fraudulent case, we can risk achieving high accuracy by consistently predicting the majority class but not the minority class. To tackle this, we can apply oversampling technique, specifically choosing ADASYN instead of other techniques like SMOTE to balance the highly imbalance training data. The main difference between them is that ADASYN generates synthetic samples adaptively based on the density distribution of minority class samples, while SMOTE generates synthetic samples by interpolating between minority class samples. For cases where data is highly imbalance, ADASYN is often considered more suitable as it focuses on generating synthetic samples in regions of the feature space where the class imbalance is most pronounced, but SMOTE treats all minority class instances equally in a uniform approach.

In [None]:
# oversampling train data with ADASYN

X_train, Y_train = ADASYN(random_state=RANDOM_STATE).fit_resample(X_train, Y_train)
Y_train.value_counts()

### Modelling

#### Random Forest Classifier

In [None]:
# fitting the RFC model

RFC_METRIC = 'gini'
NUM_ESTIMATORS = 100
NO_JOBS = -1

clf = RandomForestClassifier(n_jobs=NO_JOBS, 
                             random_state=RANDOM_STATE,
                             criterion=RFC_METRIC,
                             n_estimators=NUM_ESTIMATORS,
                             verbose=False)

clf.fit(X_train, Y_train.values)

preds = clf.predict(X_valid)

In [None]:
# feature importance

tmp = pd.DataFrame({'Feature': X_train.columns.to_numpy(), 'Feature importance': clf.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show() 

The most important features are V4, V14, V12, V17, V18, V10.

In [None]:
# confusion matrix

cm = pd.crosstab(Y_valid.values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

In [None]:
# AUC-ROC

roc_auc_score(Y_valid.values, preds)

The ROC-AUC score obtained with RandomForestClassifier is 0.87.

#### AdaBoost Classifier

In [None]:
# fitting the AdaBoost model

clf = AdaBoostClassifier(random_state=RANDOM_STATE,
                         algorithm='SAMME.R',
                         learning_rate=0.8,
                         n_estimators=NUM_ESTIMATORS)

clf.fit(X_train, Y_train.values)

preds = clf.predict(X_valid)

In [None]:
# feature importance

tmp = pd.DataFrame({'Feature': X_train.columns.to_numpy(), 'Feature importance': clf.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()   

In [None]:
# confusion matrix

cm = pd.crosstab(Y_valid.values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

In [None]:
# AUC-ROC

roc_auc_score(Y_valid.values, preds)

The ROC-AUC score obtained with AdaBoostClassifier is 0.91.

#### CatBoost Classifier

In [None]:
# CatBoostClassifier

VERBOSE_EVAL = 50

clf = CatBoostClassifier(iterations=500,
                         learning_rate=0.02,
                         depth=12,
                         eval_metric='AUC',
                         random_seed = RANDOM_STATE,
                         bagging_temperature = 0.2,
                         od_type='Iter',
                         metric_period = VERBOSE_EVAL,
                         od_wait=100)

clf.fit(X_train, Y_train.values, eval_set=[(X_train, Y_train), (X_valid, Y_valid)])

preds = clf.predict(X_test)

In [None]:
# feature importance

tmp = pd.DataFrame({'Feature': X_train.columns.to_numpy(), 'Feature importance': clf.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()   

In [None]:
# confusion matrix
cm = pd.crosstab(Y_test.values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

In [None]:
# AUC-ROC

roc_auc_score(Y_test.values, preds)

The ROC-AUC score obtained with CatBoostClassifier is 0.92.

#### XGBoost

In [None]:
# prepare the train and valid datasets
dtrain = xgb.DMatrix(X_train, Y_train.values)
dvalid = xgb.DMatrix(X_valid, Y_valid.values)
dtest = xgb.DMatrix(X_test, Y_test.values)

watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

# set parameters
params = {}
params['objective'] = 'binary:logistic'
params['eta'] = 0.01
params['silent'] = True
params['max_depth'] = 2
params['subsample'] = 0.8
params['colsample_bytree'] = 0.9
params['eval_metric'] = 'auc'
params['random_state'] = RANDOM_STATE

In [None]:
# training the model

MAX_ROUNDS = 1000
EARLY_STOP = 50

model = xgb.train(params, 
                  dtrain, 
                  MAX_ROUNDS, 
                  watchlist, 
                  early_stopping_rounds=EARLY_STOP, 
                  maximize=True, 
                  verbose_eval=VERBOSE_EVAL)

In [None]:
# feature importance

fig, (ax) = plt.subplots(ncols=1, figsize=(8,5))
xgb.plot_importance(model, height=0.8, title="Features importance (XGBoost)", ax=ax, color="green") 
plt.show()

In [None]:
# confusion matrix
preds = model.predict(dtest)
preds = [round(value) for value in preds]

cm = pd.crosstab(Y_test.values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

In [None]:
# AUC-ROC

roc_auc_score(Y_test.values, preds)

The AUC score for the prediction of fresh data (test set) for XGBoost is 0.92.

#### LightGBM

In [None]:
# training the model

lgbm = LGBMClassifier(boosting_type= 'gbdt',
                      objective= 'binary',
                      metric='auc',
                      learning_rate= 0.01,
                      num_leaves= 7,  
                      max_depth= 4,  
                      min_child_samples= 100, 
                      max_bin= 100,  
                      subsample= 0.9, 
                      subsample_freq= 1, 
                      colsample_bytree= 0.7,  
                      min_child_weight= 0,  
                      min_split_gain= 0,  
                      n_jobs=os.cpu_count(),
                      verbose= 1,)

In [None]:
# fitting the model

lgbm.fit(X_train, Y_train.values, eval_set=[(X_train, Y_train), (X_valid, Y_valid)],
        eval_metric='auc')

preds = lgbm.predict(X_test)

In [None]:
# feature importance

tmp = pd.DataFrame({'Feature': X_train.columns.to_numpy(), 'Feature importance': lgbm.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()   

In [None]:
# confusion matrix

cm = pd.crosstab(Y_test.values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

In [None]:
# AUC-ROC

roc_auc_score(Y_test.values, preds)

The AUC-ROC score obtained with LightGBM is 0.93.

We observed that amongst all the model, LightGBM produces the best results with the highest accuracy.

To further improve our model, we will proceed with hyperparameter tuning with Bayesian optimization and use stratified K-Fold cross validation technique for our training and validation sets.

#### Bayesian optimization, stratified K-Fold cross validation

In [None]:
# set initial parameters

X = data_df.drop([target, 'Hour'],axis = 1)
Y = data_df[target]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=TEST_SIZE, stratify=Y, random_state=RANDOM_STATE, shuffle=True)

lgbm = LGBMClassifier(boosting_type='dart',
                      objective='binary',
                      metric='auc',
                      n_jobs=os.cpu_count(), 
                      verbose=-1,
                      random_state=RANDOM_STATE,
                      min_split_gain= 0, # or can regularize next step by searching L1/L2 
                      )

In [None]:
# specifying search space

search_spaces = {
    'learning_rate': Real(0.001, 0.01, 'log-uniform'),   
    'n_estimators' : Integer(50, 200),
    'num_leaves': Integer(10, 100),                    
    'max_depth': Integer(15, 100),                       
    'subsample': Real(0.1, 1.0, 'uniform'),           
    'subsample_freq': Integer(0, 10),                   
    'min_child_samples': Integer(10, 200),  
    #'reg_lambda': Real(1e-9, 100.0, 'log-uniform'),      # L2 regularization
    #'reg_alpha': Real(1e-9, 100.0, 'log-uniform'),       # L1 regularization
    }

In [None]:
# setting up the optimizer

opt = BayesSearchCV(estimator=lgbm,                                    
                    search_spaces=search_spaces,                                              
                    cv=5, # stratified K-Fold automatically used here (refer to documentation)
                    n_iter=30,                                   
                    n_points=3,                                       
                    n_jobs=-1,                                       
                    return_train_score=False,                         
                    refit=False,                                      
                    optimizer_kwargs={'base_estimator': 'GP'},       
                    random_state=RANDOM_STATE)

In [None]:
# running the optimizer

np.int = np.int64 # skopt uses np.int which was deprecated, so we change it to np.int64 manually
opt.fit(X_train, Y_train)

In [None]:
# results

best_param = opt.best_params_

# OrderedDict([('learning_rate', 0.008334804024808152),
#             ('max_depth', 24),
#             ('min_child_samples', 18),
#             ('n_estimators', 188),
#             ('num_leaves', 58),
#             ('subsample', 0.9953948691683815),
#             ('subsample_freq', 2)])

In [None]:
opt.best_scores_

The best score have an AUC-ROC of 0.9994118808839343.

In [None]:
# fitting the test set

lgbm = LGBMClassifier(boosting_type='dart',
                      objective='binary',
                      metric='auc',
                      n_jobs=os.cpu_count(), 
                      verbose=-1,
                      random_state=RANDOM_STATE,
                      min_split_gain= 0,
                      learning_rate=best_param['learning_rate'],
                      max_depth=best_param['max_depth'],
                      min_child_samples=best_param['min_child_samples'],
                      n_estimators=best_param['n_estimators'],
                      num_leaves=best_param['num_leaves'],
                      subsample=best_param['subsample'],
                      subsample_freq=best_param['subsample_freq'],
                      )

lgbm.fit(X_train, Y_train.values)

preds = lgbm.predict(X_test)

In [None]:
# feature importance

tmp = pd.DataFrame({'Feature': X_train.columns.to_numpy(), 'Feature importance': lgbm.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()   

In [None]:
# confusion matrix

cm = pd.crosstab(Y_test.values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

In [None]:
# AUC-ROC

roc_auc_score(Y_test.values, preds)

The AUC-ROC score obtained with hyperparameter tuned LightGBM is 0.85.

As we observe, even with hyperparameter tuning, the model performed worse compared to our previous model without tuning. This shows that the highly imbalance nature of our dataset can skew results drastically, and stratified K-Fold without ovesampling techqniues is not enough to address the imbalance nature. It is better to oversample our minority cases on the training data prior to training the model.

The improvement to the model would be to create an algorithm to oversample the training data for each stratified K-Folds individually, before training the model and validating them with their respective validation sets. As this method is computationally intensive, I will leave it up to readers to attempt this and compare the results.