## Fraud Detection with SMOTEENN

Anomaly Detection is one of popular machine learning topics and use cases. One of its challenges is classifying fraud with tiny number of actual fraud transactions, meaning imbalance number of fraudulent/legit transactions.

Training a model on such dataset may cause biases towards one class, because there's so many legit transactions, the model is likely to predict the majority class only. One way to overcome that is oversampling.


### Reading and References
- [Detecting Finance Frauds with SMOTEENN](#https://towardsdatascience.com/detecting-financial-fraud-using-machine-learning-three-ways-of-winning-the-war-against-imbalanced-a03f8815cce9)
- [Comparison between different sampling methods](#https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/combine/plot_comparison_combine.html)
- [Dealing with Imbalance Data](#https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18)


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN

In this exercise, we're using credit card data from [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud). Fraud transactions are marked as `1` in `Class` column. As you can see, there's approximately 1.7% of fraud transactions in this dataset.

In [2]:
raw_data = pd.read_csv("data/creditcard.csv")
raw_data["Class"].value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

Let's split the data into training and test sets with standard 80:20 ratio. Why? Seems like everyone is doing it.

In [10]:
# extract features and labels
features = np.array(raw_data.drop(["Time", "Class", "Amount"], axis=1))
labels = np.array(raw_data["Class"])

train_x, test_x, train_y, test_y = train_test_split(features, labels, test_size=0.2)

print("training data: {} rows".format(train_x.shape[0]))
print("test data: {} rows".format(test_x.shape[0]))

training data: 227845 rows
test data: 56962 rows


I'm going to use only RandomForestClassifier. Maybe some other days I'll do algorithm selection with other models e.g. Logistic Regression

In [11]:
def random_forest(train_x, test_x, train_y):
    """Execute random forest classifier."""
    
    forest = RandomForestClassifier(
        n_estimators=100,
        criterion="gini",
        max_depth=5,
        min_samples_split=2,
        min_samples_leaf=1
    )
    
    forest.fit(train_x, train_y)
    predictions = forest.predict(test_x)
    
    return predictions

We'll use Confusion Matrix to map out the False Positive and False Negatives. As a fraud detection model, it's crucial to **reduce number of False Negatives** - we don't want a model that flags actual fraud transactions as legit.

As a second opinion, We also measure the model performance using Precision/Recall.

In [12]:
def report(y_true, y_predict):
    """Show model performance report."""

    cm = confusion_matrix(y_true, y_predict)
    print("[CONFUSION MATRIX]")
    print("True Positive: {}\tFalse Positive: {}".format(cm[0][0], cm[0][1]))
    print("False Negative: {}\tTrue Negative: {}".format(cm[1][0], cm[1][1]))

    # recall/precision.
    print("\n[PRECISION/RECALL]")
    print(classification_report(y_true, y_predict))
    
    print("\nEND REPORT")

Let's give this a go..

In [13]:
%%time
predictions = random_forest(train_x, test_x, train_y)

report(test_y, predictions)

[CONFUSION MATRIX]
True Positive: 56866	False Positive: 3
False Negative: 25	True Negative: 68

[PRECISION/RECALL]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56869
           1       0.96      0.73      0.83        93

    accuracy                           1.00     56962
   macro avg       0.98      0.87      0.91     56962
weighted avg       1.00      1.00      1.00     56962


END REPORT
Wall time: 53.4 s


There you go. The model performed really well on legit transactions (Class `0`), but it made many False Negatives on fraud transaction (Class `1`). *It predicted many fraud transactions as legit*, and that's a pretty bad model because we want to minimize Type II errors.

### Oversampling using SMOTEENN

#### Power of "Fake Data".

In order to combat class imbalance, we can apply oversampling methods on our fraud transactions (Class `1`) data. Oversampling creates more synthetic datapoints based on existing fraud transactions attributes, so we have equal amount of data in both classes. This provides our Random Forest model more data to train, thus lesser bias towards predicting only legit transactions.

We're gonna use **SMOTEENN** as our oversampling technique.

#### What is SMOTEENN

It's an oversampling technique derived by combining two methods: create synthetic samples using **SMOTE**, and smoothen the data using **ENN**.

SMOTE **(Synthetic Minority Oversampling TEchnique)** is done using few steps:
- Get the [K-Nearest Neighbours](#https://medium.com/@chiragsehra42/k-nearest-neighbors-explained-easily-c26706aa5c7f) of all fraud transactions into datapoints **x**
- Draw a line between **x**
- Create synthetic samples along the lines between **x**

This technique tends to introduce noisy data into the dataset, because it creates synthetic datapoints along the relationship (lines) between fraud transactions KNN datapoints, it's possible to create data that are borderline close to legit transactions. This may hinder our model performance, but that's where ENN comes in.

ENN **(Edited Nearest Neighbours)** removes any datapoints whose class is different than the majority within its nearest neighbours. This undersampling method cleans noisy data (especially borderline datapoints between different classes) by removing them (meaning it removes both legit and fraud transactions data) so the datapoints between legit and fraud are easily distinguishable.

Let's go ahead and apply SMOTEENN on our dataset. Assuming `x` is our features, and `y` is our labels in our original credit card dataset.

In [15]:
%%time
# extract features and labels
x = raw_data.drop(["Time", "Class", "Amount"], axis=1).copy()
y = raw_data["Class"].copy()

smote = SMOTEENN(sampling_strategy="auto", random_state=42)
sx, sy = smote.fit_sample(x, y)

print("Feature size before SMOTEENN: ", len(x))
print("Feature size after SMOTEENN: ", len(sx))

Feature size before SMOTEENN:  284807
Feature size after SMOTEENN:  568222
Wall time: 9min 22s


In [16]:
s_train_features, s_test_features, s_train_labels, s_test_labels = train_test_split(sx, sy, test_size=0.3)

print("training data: {} rows".format(s_train_features.shape[0]))
print("test data: {} rows".format(s_test_features.shape[0]))

training data: 397755 rows
test data: 170467 rows


...now let's retrain the model and see our results.

In [17]:
%%time

smoteenn_predictions = random_forest(s_train_features, s_test_features, s_train_labels)

report(s_test_labels, smoteenn_predictions)

[CONFUSION MATRIX]
True Positive: 85100	False Positive: 330
False Negative: 9122	True Negative: 75915

[PRECISION/RECALL]
              precision    recall  f1-score   support

           0       0.90      1.00      0.95     85430
           1       1.00      0.89      0.94     85037

    accuracy                           0.94    170467
   macro avg       0.95      0.94      0.94    170467
weighted avg       0.95      0.94      0.94    170467


END REPORT
Wall time: 1min 40s


### END