## Fraud Detection with SMOTEENN sampling

Anomaly Detection is one of popular machine learning topics and use cases. A unique challenge to this scenario is classifying fraud with tiny sample size of actual fraud. Most datasets have an imbalance number of fraudulent/legit transactions, training a model without any sampling methods applied will yield dangerous results (the model is biased towards one class given its biased training data).

In this example, we demonstrate fraud detection using random forest classifier with SMOTEENN to resample our data.

### Table of Content
- [Data import and pre-process](#Import-Data-and-Pre-Process)
- Classifying without over-sampling
- Classifying after over-sampling
- Conclusion

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

from imblearn.over_sampling import SMOTE

### Import Data and Pre-Process

We're using credit card data from [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud). Fraud transactions are marked as `1` in `Class` column. As you can see, there's less than 1% of fraud transactions in this dataset.

In [2]:
raw_data = pd.read_csv("data/creditcard.csv")
raw_data["Class"].value_counts()

0    284315
1       492
Name: Class, dtype: int64

Next, we split the data into training and test sets with 70:30 ratio.

In [3]:
def feature_label_split(data, label_name):
    """Split dataset to features and labels."""
    
    labels = np.array(data[label_name])
    features = np.array(data.drop(label_name, axis=1))
    
    return features, labels

features, labels = feature_label_split(raw_data, label_name="Class")
train_x, test_x, train_y, test_y = train_test_split(features, labels, test_size=0.3)

print("training data: {} rows".format(train_x.shape[0]))
print("test data: {} rows".format(test_x.shape[0]))

training data: 199364 rows
test data: 85443 rows


In [4]:
def random_forest(train_x, test_x, train_y):
    """Execute random forest classifier."""
    
    forest = RandomForestClassifier(n_estimators=500, random_state=69)
    
    forest.fit(train_x, train_y)
    predictions = forest.predict(test_x)
    
    return predictions

    
def report(y_true, y_predict):
    """Show model performance report."""

    cm = confusion_matrix(y_true, y_predict)
    print("[CONFUSION MATRIX]")
    print("True Positive: {}\tFalse Positive: {}".format(cm[0][0], cm[0][1]))
    print("False Negative: {}\tTrue Negative: {}".format(cm[1][0], cm[1][1]))

    # recall/precision.
    print("\n[PRECISION/RECALL]")
    print(classification_report(y_true, y_predict))
    
    print("\nEND REPORT")

### Classifying without oversampling

In [5]:
%%time
predictions = random_forest(train_x, test_x, train_y)

report(test_y, predictions)

[CONFUSION MATRIX]
True Positive: 85286	False Positive: 8
False Negative: 30	True Negative: 119

[PRECISION/RECALL]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85294
           1       0.94      0.80      0.86       149

    accuracy                           1.00     85443
   macro avg       0.97      0.90      0.93     85443
weighted avg       1.00      1.00      1.00     85443


END REPORT
Wall time: 11min 17s


From the result, our performed really well on legit transactions (Class `0`) since it represents 99% of the dataset. It didn't do so well on predicting fraud transactions due to high number of false negatives (Type II Error) in our confusion matrix. Same results is shown on our precision/recall report (80% recall).

This isn't a good model because *it predicted many transactions as legit, but were actually frauds*.

### oversampling using SMOTE

Assuming `x` is our features, and `y` is our labels.

In [13]:
x = fraud_data.drop("Class", axis=1).copy()
y = fraud_data["Class"].copy()

print(y.value_counts())

0    284315
1       492
Name: Class, dtype: int64


In [14]:
smote = SMOTE(random_state=42)
sx, sy = smote.fit_sample(x, y)


In [15]:
s_train_features, s_test_features, s_train_labels, s_test_labels = train_test_split(sx, sy, test_size=0.3)

print("Train Dataset: ", s_train_features.shape, s_train_labels.shape)
print("Test Dataset: ", s_test_features.shape, s_test_labels.shape)

Train Dataset:  (398041, 30) (398041,)
Test Dataset:  (170589, 30) (170589,)


### retrain random forest model with dataset using SMOTE

In [17]:
%%time

# create the model.
smote_forest = RandomForestClassifier(n_estimators=500, random_state=42)

# train the model.
smote_forest.fit(s_train_features, s_train_labels)

# make predictions.
s_predictions = smote_forest.predict(s_test_features)

NameError: name 's_test_data' is not defined

In [18]:
report(smote_test_features, smote_predictions)

NameError: name 'smote_test_features' is not defined