## Credit Card Fraud Detection

Case study for Data Mining and Business Intelligence Course.

**Objectives**
- Apply Data Mining techniques to develop smarter Fraud Detection systems.
- Understand the effect of PCA method on choice of classification model.
- Analyze different classifier performance metrics.
- Preprocessing and validating performance on imbalanced data.
- Apply sampling methods to prevent overfitting.

**Observations**
- Data is heavily unbalanced, fraudulent transactions only include 0.17% of all transactions
- Undersampling gives astronomically better results than oversampling because the performance isn't heavily affected due to information loss as compared to large duplication of information. 
- Accuracy is not a an appropriate metric of performance for this problem, AUC and miss-ratio are better metrics to gain insight into classifier performance.
- As features V1-V24 are generated as a result of PCA, they are weakly co-related, making Naive Bayes a good choice for the model.
- Test if cross validation reduces false negatives i.e. classifying a fraudulent transaction as not fraud.
- Bernoulli Naive Bayes should perform better than Gaussian Naive Bayes in practise, as it penalizes the model for incorrect classifications. However, in this data set, the effect is just marginally visible.

**Tasks done**
- Implemented random under sampling to solve the problem of overfitting due to highly unbalanced data. (0.17% of fraudulent transactions)
- Implemented Bernoulli Naive Bayes model to classify fraudulent transactions.
- Studied and analysed different performance metrics like precision, miss rate, F1 Score, ROC AUC and confusion matrix for gaining an intuitive evaluation of classifier performance.

[Dataset - Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud)

In [None]:
import pandas as pd

cc_data = pd.read_csv('data/creditcard.csv')

maj_class = cc_data[cc_data.Class == 0]
min_class = cc_data[cc_data.Class == 1]

## Class distribution
maj_class_count = maj_class.shape[0]
min_class_count = min_class.shape[0]
print(f"Class distribution: {round((min_class_count/maj_class_count)*100,4)}% frauduluent transactions | minority class")

In [None]:
from sklearn.model_selection import train_test_split


## Pre-process data

## Random under sampling on majority class
maj_class_us = maj_class.sample(n=min_class_count)
cc_us = pd.concat([maj_class_us, min_class], axis=0)
#cc_us = cc_data

## Split training and testing data
X = cc_us[[c for c in cc_us.columns if c != 'Class']]
Y = cc_us['Class']

trainX, testX, trainY, testY = train_test_split(X, Y, random_state=1)


In [None]:
## TESTING: Random oversampling
# min_class_os = min_class.sample(n=492, replace=True)
# cc_us = pd.concat([maj_class, min_class_os], axis=0)

# X = cc_us[[c for c in cc_us.columns if c != 'Class']]
# Y = cc_us['Class']

# trainX, testX, trainY, testY = train_test_split(X, Y, random_state=1)

In [None]:
import numpy as np
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix,  roc_auc_score, f1_score, average_precision_score

model = BernoulliNB()

model.fit(trainX, trainY)

pred = model.predict(testX)


#print(confusion_matrix(pred, testY))

cm = confusion_matrix(pred, testY)
tp,tn,fp,fn = cm[0][0], cm[1][1], cm[1][0], cm[0][1]


print(f"TP = {tp} | FP = {fp} | TN = {tn} | FN = {fn}")
print("accuracy: ", accuracy_score(pred, testY))
print("ROC AUC:", roc_auc_score(pred, testY))
print("precision: ", (tp/(tp+fp)))
print("Miss rate: ", (fn/(tp+fn)))
print("F1 score:", f1_score(pred, testY))
print("AP:", average_precision_score(pred, testY))
