## Fraud Detection with SMOTEENN

Anomaly Detection is one of popular machine learning topics and use cases. A unique challenge to this scenario is classifying fraud with tiny number of actual fraud transactions. Most datasets have an imbalance number of fraudulent/legit transactions, training a model without any sampling methods applied will yield dangerous results (the model is biased towards one class given its biased training data).

In this example, we demonstrate fraud detection using random forest classifier with SMOTEENN to balance our datasets.


### Table of Content
- [Import data](#Import-Data)
- [Classify using Random Forest](#Classify-using-Random-Forest)
- [Oversampling using SMOTEENN](#Oversampling-using-SMOTEENN)
- [Conclusion](#...and-we-achieve-100%-precision/recall.-wait-wha-)

### Reading and References
- [Detecting Finance Frauds with SMOTEENN](#https://towardsdatascience.com/detecting-financial-fraud-using-machine-learning-three-ways-of-winning-the-war-against-imbalanced-a03f8815cce9)
- [Comparison between different sampling methods](#https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/combine/plot_comparison_combine.html)
- [Dealing with Imbalance Data](#https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18)


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN

### Import Data

We're using credit card data from [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud). Fraud transactions are marked as `1` in `Class` column. As you can see, there's approximately 1.7% of fraud transactions in this dataset.

In [2]:
raw_data = pd.read_csv("data/creditcard.csv")
raw_data["Class"].value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

Next, we split the data into training and test sets with 70:30 ratio and feed the training data to our random forest classifier.

In [3]:
def feature_label_split(data, label_name):
    """Split dataset to features and labels."""
    
    labels = np.array(data[label_name])
    features = np.array(data.drop(label_name, axis=1))
    
    return features, labels

features, labels = feature_label_split(raw_data, label_name="Class")
train_x, test_x, train_y, test_y = train_test_split(features, labels, test_size=0.3)

print("training data: {} rows".format(train_x.shape[0]))
print("test data: {} rows".format(test_x.shape[0]))

training data: 199364 rows
test data: 85443 rows


### Classify using Random Forest

I've written a wrapper function to initialise a Random Forest classifier model that takes in training features `train_x`, training labels `train_y`, and test features `test_x`. The random forest will fit the model using training data: `train_x` and `train_y`, and then predict a transaction is legit or fraud given test data `test_x`.

In [4]:
def random_forest(train_x, test_x, train_y):
    """Execute random forest classifier."""
    
    forest = RandomForestClassifier(
        n_estimators=100,
        criterion="gini",
        max_depth=5,
        min_samples_split=2,
        min_samples_leaf=1
    )
    
    forest.fit(train_x, train_y)
    predictions = forest.predict(test_x)
    
    return predictions

def compute_roc_auc(index):
    y_predict = clf.predict_proba(X.iloc[index])[:,1]
    fpr, tpr, thresholds = roc_curve(y.iloc[index], y_predict)
    auc_score = auc(fpr, tpr)
    return fpr, tpr, auc_score

I will then use the predictions from `random_forest` function to report the model's performance by matching its predictions to the actual labels `test_y`.

We'll use Confusion Matrix to map out the False Positive and False Negatives. As a fraud detection model, it's crucial to **reduce number of False Negatives** - we don't want a model that flags actual fraud transactions as legit. We also measure the model performance using Precision/Recall.

In [5]:
def report(y_true, y_predict):
    """Show model performance report."""

    cm = confusion_matrix(y_true, y_predict)
    print("[CONFUSION MATRIX]")
    print("True Positive: {}\tFalse Positive: {}".format(cm[0][0], cm[0][1]))
    print("False Negative: {}\tTrue Negative: {}".format(cm[1][0], cm[1][1]))

    # recall/precision.
    print("\n[PRECISION/RECALL]")
    print(classification_report(y_true, y_predict))
    
    print("\nEND REPORT")

In [6]:
%%time
predictions = random_forest(train_x, test_x, train_y)

report(test_y, predictions)

[CONFUSION MATRIX]
True Positive: 85273	False Positive: 14
False Negative: 42	True Negative: 114

[PRECISION/RECALL]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85287
           1       0.89      0.73      0.80       156

    accuracy                           1.00     85443
   macro avg       0.95      0.87      0.90     85443
weighted avg       1.00      1.00      1.00     85443


END REPORT
Wall time: 56.5 s


From the result, our performed really well on legit transactions (Class `0`). It didn't do so well on predicting fraud transactions (Class `1`) due to high number of false negatives (Type II Error) in our confusion matrix. Same results is shown on our precision/recall report (80% recall).

This isn't a good model because *it predicted many fraud transactions as legit*, and that's a pretty bad model since we want to minimize Type II errors in our case.

### Oversampling using SMOTEENN

#### Power of "Fake Data".

In order to combat class imbalance, we can apply oversampling methods on our fraud transactions (Class `1`) data. Oversampling creates more synthetic datapoints based on existing fraud transactions attributes, so we have equal amount of data in both classes. This provides our Random Forest model more data to train, thus lesser bias towards predicting only legit transactions.

We're gonna use **SMOTEENN** as our oversampling technique.

#### What is SMOTEENN

It's an oversampling technique derived by combining two methods: create synthetic samples using **SMOTE**, and smoothen the data using **ENN**.

SMOTE **(Synthetic Minority Oversampling TEchnique)** is done using few steps:
- Get the [K-Nearest Neighbours](#https://medium.com/@chiragsehra42/k-nearest-neighbors-explained-easily-c26706aa5c7f) of all fraud transactions into datapoints **x**
- Draw a line between **x**
- Create synthetic samples along the lines between **x**

This technique tends to introduce noisy data into the dataset, because it creates synthetic datapoints along the relationship (lines) between fraud transactions KNN datapoints, it's possible to create data that are borderline close to legit transactions. This may hinder our model performance, but that's where ENN comes in.

ENN **(Edited Nearest Neighbours)** removes any datapoints whose class is different than the majority within its nearest neighbours. This undersampling method cleans noisy data (especially borderline datapoints between different classes) by removing them (meaning it removes both legit and fraud transactions data) so the datapoints between legit and fraud are easily distinguishable.

Let's go ahead and apply SMOTEENN on our dataset. Assuming `x` is our features, and `y` is our labels in our original credit card dataset.

In [7]:
%%time
x = raw_data.drop("Class", axis=1).copy()
y = raw_data["Class"].copy()

smote = SMOTEENN(sampling_strategy="auto", random_state=42)
sx, sy = smote.fit_sample(x, y)

print("Feature size before SMOTEENN: ", len(x))
print("Feature size after SMOTEENN: ", len(sx))

Feature size before SMOTEENN:  284807
Feature size after SMOTEENN:  541135
Wall time: 15.6 s


In [8]:
s_train_features, s_test_features, s_train_labels, s_test_labels = train_test_split(sx, sy, test_size=0.3)

print("training data: {} rows".format(s_train_features.shape[0]))
print("test data: {} rows".format(s_test_features.shape[0]))

training data: 378794 rows
test data: 162341 rows


...now let's retrain the model and see our results.

In [9]:
%%time

smoteenn_predictions = random_forest(s_train_features, s_test_features, s_train_labels)

report(s_test_labels, smoteenn_predictions)

[CONFUSION MATRIX]
True Positive: 79381	False Positive: 385
False Negative: 4474	True Negative: 78101

[PRECISION/RECALL]
              precision    recall  f1-score   support

           0       0.95      1.00      0.97     79766
           1       1.00      0.95      0.97     82575

    accuracy                           0.97    162341
   macro avg       0.97      0.97      0.97    162341
weighted avg       0.97      0.97      0.97    162341


END REPORT
Wall time: 1min 35s


### END