<a href="https://colab.research.google.com/github/zcwisc/GB657/blob/main/Module_1_FraudDetectionViaLOF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fraud Detection via Local Outlier Factor Anomaly Detection

One area of application for *anomaly detecation* is identifying **fraud**, e.g. in financial or insurance transactions. There are several key reasons of why anomaly detection, as opposed or in addition to supervised learning / classification (i.e., predicting the "fraud" label), is useful here:
- Often we only know of very limited transactions that they were fraudulent; many fraud cases may not have been detected.
- Fraudsters may change their strategies; just "learning" what characterizied a fraud detection in the past may not be appropopriate.

## Local Outlier Factor for Anomaly Detection

We consider fraud detection via **Local Outlier Factor** Anomaly Detection. From [this Kaggle Codebook](https://www.kaggle.com/code/vijeetnigam26/credit-card-fraud-detection):

- LOF (Local Outlier Factor) is an unsupervised anomaly detection algorithm that assesses the local density deviation of a data point compared to its neighbors. It quantifies the degree of abnormality of a data point based on its relative density.

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F1341832%2F27b93ecfd56afa80040f9b1ebecccbed%2F1_217TN2_-cgZ1d7hZUWhYUA.png?generation=1600286513104059&alt=media" height=25% width=35% style="text-align:center;">

- The LOF algorithm works by calculating a local reachability density for each data point, which represents how isolated or tightly grouped the point is compared to its neighbors. Anomaly scores are assigned based on the degree to which a point's density deviates from the density of its neighbors. Points with significantly lower density compared to their neighbors are considered outliers with higher LOF scores.

- LOF is effective in identifying anomalies in datasets with varying densities or clusters of different sizes. It can handle data with complex structures and does not rely on strict assumptions about the data distribution. LOF provides a local perspective on anomalies, allowing for more fine-grained anomaly detection in the dataset.

So let's take a look.

## Data

### Preparatory Steps

As usually, let's clone our git repository so as to have access to the data.

In [None]:
!git clone https://github.com/zcwisc/GB657.git

And let's install some relevant libraries.

In [None]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report,accuracy_score, confusion_matrix
from sklearn.neighbors import LocalOutlierFactor

## Kaggle Credit Card Fraud Dataset

We use a sample of a dataset on credit card transactions (with some known fraudulent cases from kaggle). The real dataset has about 250K transactions, but we use a small sample for ease of computation (you can try it out on the larger dataset!):

In [None]:
ccdata = pd.read_csv('GB657/Module_1_CreditCardFraud_sample.csv', index_col=0)

In [None]:
ccdata.head()

The data contains a sample of credict card transactions in 9/2013 by European cardholders over two dates. It contains only numerical features V1-V28, which appears to be resulting from transformations (details were protected by the company). We also have 'Time' and 'Amount': 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset, and 'Amount' is the transaction Amount. Importantly, 'Class' is the response variable and it takes value 1 in case the transaction was fraudulent.

Since we have little interpretable information, we won't dive into a detailed *exploratory data analysis* (you should in an actual application, though), and simply jump into the anomaly detection. We use for 'X' our feature matrix and our 'y' is the information on whether it's fraudulent:

In [None]:
X = ccdata.drop('Class', axis=1)
y = ccdata['Class']



## Anomaly Detection via Local Outlier Factor

We consider the local outlier factor analysis. As an input factor, we need to set the number of possible outliers. We use experience value: there are around 500 of the 290K transactions that are fraudulent, resulting in about one in 500. However, we may have a sense that we miss about one in two of fraudulent transactions, so we set:

In [None]:
outlier_fraction = 1/250

We rely on the default parameter otherwise (check the [documentation](https://scikit-learn.org/dev/modules/generated/sklearn.neighbors.LocalOutlierFactor.html) for details):

In [None]:
LOF_detect = LocalOutlierFactor(contamination=outlier_fraction)

We consider the samples that are detected as fraudulent:

In [None]:
y_pred = LOF_detect.fit_predict(X)

The output here is '1' for non-outliers and '-1' for outliers:

In [None]:
outliers = (y_pred == -1)

Let's visualize via a confusion matrix with the known fraud cases:

In [None]:
conf_matrix = confusion_matrix(y, outliers)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues",
            xticklabels=['Not Outlier', 'Outlier'],
            yticklabels=['Not Outlier', 'Outlier'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

So, 7 of the detected 166 outliers are fraudulent---and we are missing 45 cases. So, it doesn't perform very well. However, it appears from the [work by others](https://www.kaggle.com/code/vijeetnigam26/credit-card-fraud-detection) that for the full dataset, we get better results with this approach. We can also work on tuning the approach some more.  