## Fraud Detection
This exercise is a follow from the previous (Fraud Detection with SMOTEENN).

Today, I wanna focus on **hyperparameters tuning**. This is an important process in tweaking the machine learning model to get the best possible parameters for maximum results.

In [13]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report

from imblearn.combine import SMOTEENN

For the sake of testing, I'm going to reuse the original dataset with no SMOTEENN over-sampling. So I have a smaller dataset, and hopefully this reduces total processing time.

In [2]:
# read data from csv.
raw_data = pd.read_csv("data/creditcard.csv")

# extract features and labels
features = np.array(raw_data.drop(["Time", "Amount", "Class"], axis=1))
labels = np.array(raw_data["Class"])

X = features
y = labels

# hold off SMOTEENN over-sampling for the time being.

# smoteenn = SMOTEENN(sampling_strategy="auto", random_state=42)
# X, y = smoteenn.fit_sample(features, labels)

# print("Feature size after SMOTEENN: ", len(X))

### Apply PCA

Normally I wouldn't apply PCA on a dataset with 28 features, but perhaps it can reduce computation time and efforts, and every second counts. I've checked before and found out I can reduce the dimension to 20 and still retain ~87% of the principle component, so let's go ahead with that.

In [3]:
pca = PCA(n_components=18)

reduced_X = pca.fit_transform(X)
print("Sum of Explained Variance Ratio: {}".format(sum(pca.explained_variance_ratio_)))


Sum of Explained Variance Ratio: 0.874525524402183


### RandomizedSearchCV

Here's the list of hyperparameters I would like to tune. I could use either **GridSearchCV** or **RandomizedSearchCV**. GridSearch would take ages but it would get me the best hyperparameters. Since we don't have ages, and we want to optimise between performance and computational time, let's use RandomizedSearch.

In [4]:
forest = RandomForestClassifier(random_state=42)

param_distribution = { 
    "n_estimators": [100, 200],
    # "max_features": ["auto", "sqrt", "log2"],
    "min_samples_split": [2, 3, 4],
    "min_samples_leaf": [1, 2, 3],
    "max_depth" : [4, 5, 6],
    "criterion" :['gini', 'entropy']
}

# split dataset to training and test set with 80:20 ratio.
train_x, test_x, train_y, test_y = train_test_split(features, labels, test_size=0.2)

And away we go...

In [5]:
%%time

cv_forest = RandomizedSearchCV(estimator=forest, param_distributions=param_distribution, cv=5, n_iter=50)
cv_forest.fit(train_x, train_y)

Wall time: 5h 6min


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None

...oh god. I finished 2 episodes of Chernobyl, and it's still running!!

Here's the best suggested hyperparameters.

In [6]:
cv_forest.best_params_

{'n_estimators': 200,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_depth': 6,
 'criterion': 'entropy'}

In [7]:
cv_forest.best_score_

0.9994864930105993

Let's create a new RandomForestClassifier with the best hyperparameters, and retrain the model.

In [8]:
best_forest = RandomForestClassifier(
    n_estimators=200,
    min_samples_split=2,
    min_samples_leaf=1,
    max_depth=6,
    criterion="entropy"
)

In [15]:
%%time

best_forest.fit(train_x, train_y)

predictions = best_forest.predict(test_x)

Wall time: 3min 5s


In [16]:
cm = confusion_matrix(test_y, predictions)
print("[CONFUSION MATRIX]")
print("True Positive: {}\tFalse Positive: {}".format(cm[0][0], cm[0][1]))
print("False Negative: {}\tTrue Negative: {}".format(cm[1][0], cm[1][1]))

# recall/precision.
print("\n[PRECISION/RECALL]")
print(classification_report(test_y, predictions))

[CONFUSION MATRIX]
True Positive: 56858	False Positive: 3
False Negative: 27	True Negative: 74

[PRECISION/RECALL]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56861
           1       0.96      0.73      0.83       101

    accuracy                           1.00     56962
   macro avg       0.98      0.87      0.92     56962
weighted avg       1.00      1.00      1.00     56962



well shit...

### Notes

I'm unsure if the hyperparameters tuning process was slow due to `cv` argument. When `cv=5`, it's applying **(Stratified)K-Fold CV** onto the dataset, which seems unnecessary since RandomForest essentially applies bootstraping (divide original dataset into several chunks and train multiple decision trees with these chunks of sampled data), so there isn't a risk of overfitting the data that K-Fold is trying to mitigate.

...but given that we're using an imbalance class dataset, it's possible the Stratified K-Fold will ensure equal proportion of legit and fraud transactions across all sample datasets (so I won't end up a sample training set of Class `0` / legit transactions data only).

### END