# **Fraud Detection Using Supervised Learning**
#### DTSA 5509 - Introduction to Machine Learning: Supervised Learning - Final Project ####
<hr>
<hr>

> **1. Project Overview**
> 
> **2. Project Setup**
>
> **3. Exploratory Data Analysis (EDA)**
>
> **4. Feature Selection**
>
> **5. Feature Engineering**
>
> **6. Classification Models**
>
>>
>> *6.1 Logistic Regression*
>>
>> *6.2 Decision Tree*
>>
>> *6.3 Random Forest*
>>
>> *6.4 Naive Bayes*
>>
>> *6.5 Support Vector Machines (SVM)*
>>
> **7. Classification Performance**
>
> **8. Area Under the Precision-Recall Curve (AUPRC)**
>
> **9. Results and Conclusion**

## **1) Project Overview**

### *About the Data:* ###
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, with 492 cases of frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

The dataset contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, neither the original features nor more background information about the data can be provided. Features `V1`, `V2`, … `V28` are the principal components obtained with PCA. 

The only features which have not been transformed with PCA are `Time` and `Amount`. Feature `Time` contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature `Amount` is the transaction amount – this feature can be used for example-dependent cost-sensitive learning. Feature `Class` is the response variable which takes value 1 in case of fraud and 0 otherwise.

### *Objective:* ###
Find the most accurate supervised learning model for classifying fradulent credit card transcations.

*Based on [this Kaggle dataset.](https://www.kaggle.com/datasets/whenamancodes/fraud-detection?datasetId=2472961&sortBy=voteCount)*
<hr>

## **2) Project Setup**

In [None]:
# SUPPRESS WARNINGS
import warnings
warnings.filterwarnings('ignore')

In [None]:
# IMPORT LIBRARIES
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from scipy.stats import zscore

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import time

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, average_precision_score, precision_recall_curve

In [None]:
# SET NOTEBOOK OPTIONS
%matplotlib inline
plt.rcParams['figure.figsize'] = [15, 15]
np.random.seed(31415)

In [None]:
# READ IN DATA
data = pd.read_csv('/kaggle/input/dtsa-5509-final-project/creditcard.csv')

<hr>

## **3. Exploratory Data Analysis (EDA)**
<hr>

In [None]:
# Preview data
display(data)

# Check data types
data.dtypes

All variables are the correct data type.<br>
30 numerical features and 1 boolean variable (`Class`).

In [None]:
# Check for missing values
any(data.isnull().sum())

No missing values.

In [None]:
# Check for blank values
np.where(data.applymap(lambda x: x == ''))

No blank values.

In [None]:
# Count frequency of 'Class' labels
# 0 = Not Fraud; 1 = Fraud
print('Non-Fraudulent Transactions:', len(data[data['Class'] == 0]))
print('Fraudulent Transactions:', len(data[data['Class'] == 1]))

In [None]:
# 'Class' labels countplot with percentages
total = data.Class.count() # total count

ax = sns.countplot(x='Class', data=data)
ax.bar_label(ax.containers[0], fmt=lambda x: f'{(x/total)*100:0.2f}%')
ax.set(xlabel = 'Class', ylabel = 'Count')
plt.show()

Only 0.17% of the data is fraud data.

## **4. Feature Selection**
<hr>

### Outlier Detection
<hr>

Reduce the number of features by selecting the most important features using the interquartile range (IQR)| method for outlier detection.<br>
Features with a greater number of outliers that are also labelled as fraud *(`Class` = 1)* will be selected for the classification algorithms.
<hr>

In [None]:
# IQR outlier detection function
def detect_outliers(df):
    
    out_data = pd.DataFrame().reindex_like(df)
    out_data = out_data.fillna(0)

    df = df.apply(zscore)

    features = list(df.columns.values)
    n_features = len(features)

    for col in range(n_features):
        feature = df[features[col]]
        q1 = np.percentile(feature, 25) # 1st quartile (25%)
        q3 = np.percentile(feature, 75) # 3rd quartile (75%)
        iqr = q3 - q1 # IQR

        # Outlier Bounds
        lower = (q1 - 1.5 * iqr)
        upper = (q3 + 1.5 * iqr)
        
        # Consider any data point LESS than lower bound or GREATER than upper bound as an outlier
        outlier_index = df[(df[features[col]] < lower) | (df[features[col]] > upper)].index # list of outlier indices for every features 
        out_data.loc[outlier_index, features[col]] = 1 # mark outlier values as 1, otherwise 0
    
    return out_data

In [None]:
# Feature dataframe
features = [col for col in data if col.startswith('V')] # select feature variables only
feature_data = data[features]
feature_data

In [None]:
# Apply IQR outlier detection function
out_data = detect_outliers(feature_data)
out_data

In [None]:
# Get indices of rows labelled as fraud
idx = data[data['Class']==1].index

# Count number of fraud outliers for each feature
out_count = out_data.iloc[idx,:].sum().sort_values()
out_count = pd.DataFrame(data=out_count).T

out_count

In [None]:
sns.barplot(out_count).set(xlabel='Feature', ylabel='Outlier Count')
plt.yticks(np.arange(0, 500, 50))
plt.show()

In [None]:
# Get Q3 (75th percentile) of outlier counts for feature selection
round(np.percentile(out_count, 75))

In [None]:
# Select features where number of outliers is greater than 314
selected_features = out_count.columns[out_count.loc[0].gt(314)].tolist()

selected_data = data[selected_features]
selected_data['Class'] = data['Class'] # add in 'Class' variable
selected_data

### Feature Plots
<hr>

In [None]:
# Distribution plots
fig, axes = plt.subplots(4, 2, figsize=(15, 15))
ax = axes.flatten()

for i, col in enumerate(selected_features):
    sns.histplot(selected_data[col], ax=ax[i], kde=True, element='poly').set_ylabel('', labelpad=0)
fig.tight_layout(w_pad=6, h_pad=4)
fig.delaxes(ax[7])
plt.show()

In [None]:
# Pairplot of 1000 rows, ordered by fraud cases
sns.pairplot(selected_data.sort_values('Class', ascending=False).head(1000), hue='Class');

## **5. Feature Engineering**
<hr>
Standardize features to have a mean of 0 and variance of 1.
<hr>

In [None]:
standardizer = StandardScaler()

In [None]:
# Standardize features (X)
X = selected_data.drop('Class', axis=1)
X = standardizer.fit_transform(X)
X

In [None]:
# Target (y)
y = selected_data['Class']
y = np.array(y)
y

### Training and Testing Data
<hr>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

## **6. Classification Models**
<hr>

1. Logistic Regression

2. Decision Tree

3. Random Forest

4. Naive Bayes

5. Support Vector Machines (SVM)

<hr>

In [None]:
models = {}
times = {}

### **6.1. Logistic Regression** ###
<hr>

Logistic regression uses the logistic function to classify the inputs into two categories and calculate the probability (between 0 and 1) of an input belonging to the default class (class 0).

New inputs are classified based on their calculated probabilities using a decision threshold of 0.5. If the probability is greater than 0.5, we can take the output as a prediction for the default class (class 0), otherwise the prediction is for the other class (class 1).
<hr>

In [None]:
start_time = time.time() # code execution start time

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

end_time = time.time() # code execution end time
# print("Total execution time: {} seconds".format(end_time - start_time))

In [None]:
models['Logistic Regression'] = y_pred_lr
times['Logistic Regression'] = end_time - start_time

### **6.2. Decision Tree** ###
<hr>

Decision tree classification recursively partitions the data based on input features to make predictions. The tree structure consists of nodes representing feature tests and branches representing possible outcomes. 

The goal is to split the data in a way that maximizes the separation of classes (0 or 1) or minimizes prediction errors. New data is classified or predicted by following the path from the root to a leaf node, which provides the final output.

<hr>

In [None]:
start_time = time.time() # code execution start time

# dt = DecisionTreeClassifier(criterion='gini')
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

end_time = time.time() # code execution end time
# print("Total execution time: {} seconds".format(end_time - start_time))

In [None]:
models['Decision Tree'] = y_pred_dt
times['Decision Tree'] = end_time - start_time

### **6.3. Random Forest** ###
<hr>
Random Forest classification assembles multiple decision trees to enhance predictive accuracy. It reduces overfitting by training each tree on different data subsets. The final prediction is a combination of outputs from individual trees, leading to more reliable and accurate results.
<hr>

In [None]:
start_time = time.time() # code execution start time

# rf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

end_time = time.time() # code execution end time
# print("Total execution time: {} seconds".format(end_time - start_time))

In [None]:
models['Random Forest'] = y_pred_rf
times['Random Forest'] = end_time - start_time

### **6.4. Naive Bayes** ###
<hr>
Naive Bayes is particularly effective for dealing with high-dimensional data because it assumes feature independent. and is based on Bayes' therom, which assumes that input features are independent of each other.

The algorithm calculates class probabilities for based on Bayes' theorem, and assigns the most probable class to an input.
<hr>

In [None]:
start_time = time.time() # code execution start time

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)

end_time = time.time() # code execution end time
# print("Total execution time: {} seconds".format(end_time - start_time))

In [None]:
models['Naive Bayes'] = y_pred_nb
times['Naive Bayes'] = end_time - start_time

### **6.5. Support Vector Machines (SVM)** ###
<hr>

Support Vector Machine (SVM) aims to find the optimal boundary that best separates the classes in the data (class 0 and 1) by identifying support vectors, which are data points closest to the decision boundary. 

By utilizing kernel functions, SVM accommodates complex patterns in the data, enabling it to handle both linear and non-linear relationships.

<hr>

In [None]:
start_time = time.time() # code execution start time

svm = SVC(kernel='rbf')
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)

end_time = time.time() # code execution end time
# print("Total execution time: {} seconds".format(end_time - start_time))

In [None]:
models['Support Vector Machines'] = y_pred_svm
times['Support Vector Machines'] = end_time - start_time

## **7. Classification Performance**
<hr>
To compare model performance we will look at the following metrics:

1. **Accuracy:** The proportion of correctly classified instances over the total number of instances.

2. **Precision:** The ratio of true positive predictions to the total predicted positives.

3. **Recall:** Also known as Sensitivity or True Positive Rate, it's the ratio of true positive predictions to the total actual positives.

4. **F1-Score:** A single value that balances a model's precision and recall, effectively summarizing its ability to correctly identify positive instances while minimizing false positives and false negatives.

5. **Average Precision Score:** A metric used to evaluate the quality of binary classification models, measuring the average precision of positive class predictions across different levels of classification thresholds.

6. **Runtime (seconds):** The amount of time (in seconds) an algorithm takes to complete its operations. It's a critical performance metric that indicates the efficiency and speed of a process or computation.

<hr>

In [None]:
# Calculate performance metrics
accuracy, precision, recall, f1score, avg_precision, runtime = [], [], [], [], [], []

for i in list(range(0, len(models))):
    accuracy.append(accuracy_score(y_test, list(models.values())[i]))
    precision.append(precision_score(y_test, list(models.values())[i]))
    recall.append(recall_score(y_test, list(models.values())[i]))
    f1score.append(f1_score(y_test, list(models.values())[i]))
    avg_precision.append(average_precision_score(y_test, list(models.values())[i]))
    runtime.append(list(times.values())[i])

In [None]:
# Performance metrics dataframe
metrics = pd.DataFrame([accuracy, precision, recall, f1score, avg_precision, runtime]).T
metrics.columns = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'Average Precision Score', 'Runtime']
metrics.index = models.keys()
metrics

In [None]:
# Performance metrics plot
metrics.drop('Runtime', axis=1).plot.bar(rot=360, color=['powderblue', 'salmon', 'rosybrown', 'lavender', 'palegoldenrod'])
plt.legend(ncol= len(models.keys()), loc='upper center', fontsize=12)
plt.tight_layout();

In [None]:
# Highest Accuracy
print('Highest Accuracy:', metrics['Accuracy'].idxmax(), '= {:.4f}'.format(metrics['Accuracy'].max()))

# Highest Precision
print('Highest Precision:', metrics['Precision'].idxmax(), '= {:.4f}'.format(metrics['Precision'].max()))

# Highest Recall
print('Highest Recall:', metrics['Recall'].idxmax(), '= {:.4f}'.format(metrics['Recall'].max()))

# Highest F1-Score
print('Highest F1-Score:', metrics['F1-Score'].idxmax(), '= {:.4f}'.format(metrics['F1-Score'].max()))

# Highest Average Precision Score
print('Highest Average Precision Score:', metrics['Average Precision Score'].idxmax(), '= {:.4f}'.format(metrics['Average Precision Score'].max()))

# Fastest Runtime 
print('Fastest Runtime:', metrics['Runtime'].idxmax(), '= {:.4f} seconds'.format(metrics['Average Precision Score'].max()))

## **8. Area Under the Precision-Recall Curve (AUPRC)**
<hr>

Given the class imbalance ratio in the data, we will also look at accuracy using the Area Under the Precision-Recall Curve (AUPRC).

The AUPRC evaluates the quality of binary classification models, and is a graphical representation of the trade-off between precision and recall as the classification threshold varies. A high AUPRC value indicates that the model maintains high precision across different levels of recall, making it effective at correctly identifying positive instances while minimizing false positives.

<hr>

In [None]:
# Calculate AUPRC
auprc = {}

for model_name in metrics.index:
    precision = metrics.loc[model_name, 'Precision']
    recall = metrics.loc[model_name, 'Recall']
    auprc_value = auc(recall, precision) if isinstance(precision, list) else precision
    auprc[model_name] = auprc_value
    
auprc    

In [None]:
# Plot AUPRC
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
ax = axes.flatten()

for i in list(range(0, len(models))):
    precision, recall, _ = precision_recall_curve(y_test, list(models.values())[i])
    
    ax[i].step(recall, precision, color='b', alpha=0.2, where='post')
    ax[i].fill_between(recall, precision, alpha=0.2, color='b')
    ax[i].set_xlabel('Recall')
    ax[i].set_ylabel('Precision')
    ax[i].set_title('Precision-Recall Curve: ' + list(models.keys())[i])
    ax[i].text(0.2, 0.3, f'AUPRC = {list(auprc.values())[i]:.4f}', fontsize=12, color='red')
fig.tight_layout(w_pad=6, h_pad=4)
fig.delaxes(ax[5])
plt.show()

In [None]:
# Highest AUPRC
print('Highest AUPRC:', max(auprc), '=', '{:.4f}'.format(max(auprc.values())))

## **9. Results and Conclusion**
<hr>

### Results
In summary, we conducted a comprehensive comparison of five supervised classification models — Logistic Regression, Decision Tree, Random Forest, Naive Bayes, and Support Vector Machines (SVM) — for credit card fraud detection. The objective was to identify the most effective model for accurately identifying fraudulent transactions while minimizing false positives.

The performance of each model was evaluated using two critical metrics: the F1-score and the Area Under the Precision-Recall Curve (AUPRC).

The Random Forest model demonstrated the highest F1-score among the evaluated models (0.8603). This suggests that the Random Forest model strikes a balance between precision and recall, making it well-suited for cases where false positives and false negatives carry different consequences. It effectively manages to classify fraudulent transactions while minimizing errors in both directions.

On the other hand, the Support Vector Machines (SVM) model exhibited the highest AUPRC (0.9612). This indicates that the SVM model excels at correctly identifying and ranking fraudulent transactions with high precision. Given the focus on identifying the positive class and the imbalanced nature of fraud detection data, the SVM model's strong AUPRC performance is particularly noteworthy.


### Conclusion
In the realm of credit card fraud detection, accurate identification of fraudulent transactions is of paramount importance, and precision is paramount. False positives can disrupt legitimate transactions and erode user trust. The SVM model's high AUPRC score signifies its ability to identify and rank fraudulent transactions with precision, mitigating the risk of false positives and highliting the model's capability to accurately classify rare instances, such as fraudulent transactions, by concentrating on the positive class and minimizing the chances of misclassification.

Since credit card fraud data is inherently imbalanced, with a vast majority of transactions being legitimate, the SVM model's adeptness at handling such scenarios makes it a pragmatic choice. By emphasizing the positive class, it exhibits the ability to correctly identify rare instances of fraudulent activity in large and noisy data.

In conclusion, the Support Vector Machines (SVM) model stands as the optimal choice for credit card fraud detection. Its robust AUPRC performance, strategic approach to imbalanced data, and holistic solution underscore its suitability for a critical and dynamic domain.
<hr>

<hr>
<hr>

#### [GitHub Project Repository](https://github.com/yevi7113/Supervised-Learning-Project/tree/main)