# Dataset context
- Dataset Context: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
- It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
- This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
- It contains only numerical input variables which are the result of a PCA transformation. Features V1, V2, … V28 are the principal components obtained with PCA
- The only features which have not been transformed with PCA are 'Time' and 'Amount'
- Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset.
- The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
- Dùng các model: Decision trees, Deep learning, Random forest


# EDA

In [1]:
import pandas as pd

# Load the dataset
df_original = pd.read_csv("./data/creditcard.csv")
df_original.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [2]:
df_original.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,...,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [3]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df_original, title="Profiling Report")

In [None]:
profile.to_file("eda_report.html")

From the profiling report:
- Dataset has 773 (0.3%) duplicate rows
- `Class` is highly correlated with 6 other attributes: V10, V11, V12, V14, V16, V17, V18.
- `Class` is highly imbalanced with 98.2% of the sample has value of 1.

In [5]:
# Clean duplicated rows
df_cleaned = df_original.drop_duplicates()
df_cleaned.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,283726.0,283726.0,283726.0,283726.0,283726.0,283726.0,283726.0,283726.0,283726.0,283726.0,...,283726.0,283726.0,283726.0,283726.0,283726.0,283726.0,283726.0,283726.0,283726.0,283726.0
mean,94811.0776,0.005917,-0.004135,0.001613,-0.002966,0.001828,-0.001139,0.001801,-0.000854,-0.001596,...,-0.000371,-1.5e-05,0.000198,0.000214,-0.000232,0.000149,0.001763,0.000547,88.472687,0.001667
std,47481.047891,1.948026,1.646703,1.508682,1.414184,1.377008,1.331931,1.227664,1.179054,1.095492,...,0.723909,0.72455,0.623702,0.605627,0.52122,0.482053,0.395744,0.328027,250.399437,0.040796
min,0.0,-56.40751,-72.715728,-48.325589,-5.683171,-113.743307,-26.160506,-43.557242,-73.216718,-13.434066,...,-34.830382,-10.933144,-44.807735,-2.836627,-10.295397,-2.604551,-22.565679,-15.430084,0.0,0.0
25%,54204.75,-0.915951,-0.600321,-0.889682,-0.850134,-0.68983,-0.769031,-0.552509,-0.208828,-0.644221,...,-0.228305,-0.5427,-0.161703,-0.354453,-0.317485,-0.326763,-0.070641,-0.052818,5.6,0.0
50%,84692.5,0.020384,0.063949,0.179963,-0.022248,-0.053468,-0.275168,0.040859,0.021898,-0.052596,...,-0.029441,0.006675,-0.011159,0.041016,0.016278,-0.052172,0.001479,0.011288,22.0,0.0
75%,139298.0,1.316068,0.800283,1.02696,0.739647,0.612218,0.396792,0.570474,0.325704,0.595977,...,0.186194,0.528245,0.147748,0.439738,0.350667,0.240261,0.091208,0.078276,77.51,0.0
max,172792.0,2.45493,22.057729,9.382558,16.875344,34.801666,73.301626,120.589494,20.007208,15.594995,...,27.202839,10.50309,22.528412,4.584549,7.519589,3.517346,31.612198,33.847808,25691.16,1.0


# Baseline model

In [34]:
# Simple baseline model with decision tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Split the data into features and target variable
X = df_cleaned[["V10", "V11", "V12", "V14", "V16", "V17", "V18"]]
y = df_cleaned["Class"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Create a Decision Tree classifier
clf = DecisionTreeClassifier()

print("Start Traning...")
# Train the classifier
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

In [24]:
# randomly undersampling the majority class to balance the dataset and quickly test the model
from sklearn.utils import resample

df_undersampled = resample(df_cleaned[df_cleaned.Class == 0], replace=False, n_samples=492, random_state=123)
df_undersampled = pd.concat([df_undersampled, df_cleaned[df_cleaned.Class == 1]])
df_undersampled.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,965.0,965.0,965.0,965.0,965.0,965.0,965.0,965.0,965.0,965.0,...,965.0,965.0,965.0,965.0,965.0,965.0,965.0,965.0,965.0,965.0
mean,86560.905699,-2.200307,1.651245,-3.29894,2.148513,-1.525798,-0.730773,-2.540876,0.433975,-1.209436,...,0.215339,0.049894,-0.032659,-0.051498,0.041935,0.035321,0.11695,0.036297,112.076218,0.490155
std,48121.41655,5.295259,3.523269,5.974497,3.181847,4.050289,1.647392,5.5196,4.032161,2.311864,...,2.003625,0.976913,1.167593,0.564974,0.670954,0.488635,0.904484,0.40449,254.650928,0.500162
min,406.0,-30.55238,-10.979679,-31.103685,-3.538179,-22.105532,-6.406267,-43.557242,-41.044261,-13.434066,...,-22.797604,-8.887017,-19.254328,-2.6597,-4.781606,-1.189179,-7.263482,-1.86929,0.0,0.0
25%,45501.0,-2.785102,-0.230857,-4.865184,-0.276189,-1.69078,-1.532462,-2.918287,-0.235083,-2.294535,...,-0.164481,-0.516242,-0.247452,-0.380576,-0.289125,-0.293871,-0.069161,-0.060246,1.75,0.0
50%,79668.0,-0.769172,0.951569,-1.22281,1.212136,-0.503492,-0.644602,-0.650245,0.13819,-0.695263,...,0.142675,0.038631,-0.03656,0.007684,0.058504,-0.018853,0.047434,0.036574,18.65,0.0
75%,131948.0,1.084043,2.664274,0.358117,4.090177,0.425877,-0.018103,0.268049,0.871446,0.256297,...,0.641211,0.576358,0.193731,0.382801,0.408814,0.35647,0.451243,0.208858,99.99,1.0
max,172288.0,2.366845,22.057729,3.236927,12.114672,11.095089,9.146401,12.699095,20.007208,4.87455,...,27.202839,8.361985,8.225555,1.196869,2.395505,2.745261,3.052358,1.779364,3684.62,1.0


In [43]:
# Only Select V10, V11, V12, V14, V16, V17, V18 freature to classify Class
X_under = df_undersampled[["V10", "V11", "V12", "V14", "V16", "V17", "V18"]]
y_under = df_undersampled["Class"]
# Split the data into training and testing sets

X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_under, y_under, test_size=0.2, random_state=42)

# Create a Decision Tree classifier
clf = DecisionTreeClassifier()

print("Start Traning...")
# Train the classifier
clf.fit(X_train_under, y_train_under)
# Make predictions
y_pred = clf.predict(X_test_under)

# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test_under, y_pred))
print("Classification Report:\n", classification_report(y_test_under, y_pred))

Start Traning...
Confusion Matrix:
 [[83 12]
 [ 9 89]]
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.87      0.89        95
           1       0.88      0.91      0.89        98

    accuracy                           0.89       193
   macro avg       0.89      0.89      0.89       193
weighted avg       0.89      0.89      0.89       193



# Undersampled Training and Cross Validation with Decision Tree, Random Forest, Pytorch Neural Network

In [44]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


classifiers = {
    "DecisionTreeClassifier": DecisionTreeClassifier(),
    "RandomForestClassifier": RandomForestClassifier(),
}

In [45]:
from sklearn.model_selection import cross_val_score
import time


for key, classifier in classifiers.items():
    classifier.fit(X_train_under, y_train_under)
    training_score = cross_val_score(classifier, X_train_under, y_train_under, cv=5)
    print(
        "Classifiers: ",
        classifier.__class__.__name__,
        "Has a training score of",
        round(training_score.mean(), 2) * 100,
        "% accuracy score",
    )

Classifiers:  DecisionTreeClassifier Has a training score of 89.0 % accuracy score
Classifiers:  RandomForestClassifier Has a training score of 93.0 % accuracy score


In [46]:
# Use GridSearchCV to find the best parameters.
from sklearn.model_selection import GridSearchCV


# DecisionTree Classifier
tree_params = {
    "criterion": ["gini", "entropy"],
    "max_depth": list(range(2, 4, 1)),
    "min_samples_leaf": list(range(5, 7, 1)),
}
grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params)
grid_tree.fit(X_train_under, y_train_under)

# tree best estimator
tree_clf = grid_tree.best_estimator_

# RandomForest Classifier
forest_params = {
    "criterion": ["gini", "entropy"],
    "max_depth": list(range(2, 4, 1)),
    "min_samples_leaf": list(range(5, 7, 1)),
}
grid_forest = GridSearchCV(RandomForestClassifier(), forest_params)
grid_forest.fit(X_train_under, y_train_under)

# forest best estimator
forest_clf = grid_forest.best_estimator_

In [47]:

from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict

# Create a DataFrame with all the scores and the classifiers names.
tree_pred = cross_val_predict(tree_clf, X_train_under, y_train_under, cv=5)
forest_pred = cross_val_predict(forest_clf, X_train_under, y_train_under, cv=5)

In [48]:
from sklearn.metrics import roc_auc_score


print("Decision Tree Classifier ROC AUC validation score: ", roc_auc_score(y_train_under, tree_pred))
print("Random Forest Classifier ROC AUC validation Score: ", roc_auc_score(y_train_under, forest_pred))

Decision Tree Classifier ROC AUC validation score:  0.913329974811083
Random Forest Classifier ROC AUC validation Score:  0.9193316540722082


In [49]:
# Test with undersampled data
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix


tree_pred_test = tree_clf.predict(X_test_under)
forest_pred_test = forest_clf.predict(X_test_under)

print("Confusion Matri for Decision Tree:\n", confusion_matrix(y_test_under, tree_pred_test))
print("Classification Report for Decision Tree:\n", classification_report(y_test_under, tree_pred_test))
print("Confusion Matri for Random Forest:\n", confusion_matrix(y_test_under, forest_pred_test))
print("Classification Report for Random Forest:\n", classification_report(y_test_under, forest_pred_test))

Confusion Matri for Decision Tree:
 [[93  2]
 [11 87]]
Classification Report for Decision Tree:
               precision    recall  f1-score   support

           0       0.89      0.98      0.93        95
           1       0.98      0.89      0.93        98

    accuracy                           0.93       193
   macro avg       0.94      0.93      0.93       193
weighted avg       0.94      0.93      0.93       193

Confusion Matri for Random Forest:
 [[94  1]
 [11 87]]
Classification Report for Random Forest:
               precision    recall  f1-score   support

           0       0.90      0.99      0.94        95
           1       0.99      0.89      0.94        98

    accuracy                           0.94       193
   macro avg       0.94      0.94      0.94       193
weighted avg       0.94      0.94      0.94       193



In [50]:
# Test with original data
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix


tree_pred_test = tree_clf.predict(X_test)
forest_pred_test = forest_clf.predict(X_test)

print("Confusion Matri for Decision Tree:\n", confusion_matrix(y_test, tree_pred_test))
print("Classification Report for Decision Tree:\n", classification_report(y_test, tree_pred_test))
print("Confusion Matri for Random Forest:\n", confusion_matrix(y_test, forest_pred_test))
print("Classification Report for Random Forest:\n", classification_report(y_test, forest_pred_test))

Confusion Matri for Decision Tree:
 [[54826  1830]
 [   13    77]]
Classification Report for Decision Tree:
               precision    recall  f1-score   support

           0       1.00      0.97      0.98     56656
           1       0.04      0.86      0.08        90

    accuracy                           0.97     56746
   macro avg       0.52      0.91      0.53     56746
weighted avg       1.00      0.97      0.98     56746

Confusion Matri for Random Forest:
 [[55617  1039]
 [   14    76]]
Classification Report for Random Forest:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99     56656
           1       0.07      0.84      0.13        90

    accuracy                           0.98     56746
   macro avg       0.53      0.91      0.56     56746
weighted avg       1.00      0.98      0.99     56746



In [51]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import torch.nn.functional as F


class Model(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size, dropout_prob):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size1)
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)
        self.fc3 = nn.Linear(hidden_size2, output_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout_prob)
        self.batchnorm1 = nn.BatchNorm1d(hidden_size1)
        self.batchnorm2 = nn.BatchNorm1d(hidden_size2)

    def forward(self, x):
        x = self.fc1(x)
        x = self.batchnorm1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        x = self.batchnorm2(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc3(x)
        return x


# Split the data into features and target variable
X = df_cleaned[["V10", "V11", "V12", "V14", "V16", "V17", "V18"]]
y = df_cleaned["Class"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
n_inputs = X_train.shape[1]


X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_val_tensor = torch.tensor(X_val.values, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val.values, dtype=torch.long)


hidden_size1 = 64
hidden_size2 = 32
output_size = 2
dropout_prob = 0.3
learning_rate = 0.001




# Create the ImprovedModel
model = Model(n_inputs, hidden_size1, hidden_size2, output_size, dropout_prob)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())


# Training loop
epochs = 500
patience = 10  # Number of epochs without improvement before early stopping
best_val_accuracy = 0.0
best_val_f1 = 0.0
counter = 0
early_stop_threshold = 0.001

for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()

    # Validation
    with torch.no_grad():
        val_outputs = model(X_val_tensor)
        val_loss = criterion(val_outputs, y_val_tensor)
        val_accuracy = accuracy_score(y_val_tensor, val_outputs.argmax(dim=1).numpy())
        val_f1 = f1_score(y_val_tensor, val_outputs.argmax(dim=1).numpy())

    print(
        f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}, Validation Loss: {val_loss.item():.4f}, Validation Accuracy: {val_accuracy:.4f}, Validation F1: {val_f1:.4f}"
    )

    if val_f1 - best_val_f1 > early_stop_threshold:
        best_val_f1 = val_f1
        counter = 0  # Reset counter
    else:
        counter += 1

    if counter >= patience:
        print("Early stopping. No improvement in validation F1 for {} epochs.".format(patience))
        break  # Early stopping

Epoch [1/500], Loss: 0.7228, Validation Loss: 0.7038, Validation Accuracy: 0.5337, Validation F1: 0.0021
Epoch [2/500], Loss: 0.7047, Validation Loss: 0.6869, Validation Accuracy: 0.5644, Validation F1: 0.0023
Epoch [3/500], Loss: 0.6850, Validation Loss: 0.6665, Validation Accuracy: 0.6060, Validation F1: 0.0040
Epoch [4/500], Loss: 0.6680, Validation Loss: 0.6508, Validation Accuracy: 0.6392, Validation F1: 0.0027
Epoch [5/500], Loss: 0.6503, Validation Loss: 0.6329, Validation Accuracy: 0.6777, Validation F1: 0.0042
Epoch [6/500], Loss: 0.6341, Validation Loss: 0.6179, Validation Accuracy: 0.7072, Validation F1: 0.0045
Epoch [7/500], Loss: 0.6171, Validation Loss: 0.6005, Validation Accuracy: 0.7436, Validation F1: 0.0056
Epoch [8/500], Loss: 0.6018, Validation Loss: 0.5870, Validation Accuracy: 0.7743, Validation F1: 0.0064
Epoch [9/500], Loss: 0.5863, Validation Loss: 0.5699, Validation Accuracy: 0.8018, Validation F1: 0.0086
Epoch [10/500], Loss: 0.5718, Validation Loss: 0.5588, 

In [52]:
import torch
import numpy as np
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix,
)

# Convert the test data from Pandas DataFrames to NumPy arrays

# Create PyTorch tensors from the NumPy arrays
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

# Pass the test data through the trained model to obtain predictions
with torch.no_grad():
    test_outputs = model(X_test_tensor)

# Calculate the predicted labels
predicted_labels = test_outputs.argmax(dim=1).numpy()

# Calculate the accuracy on the test data
test_accuracy = accuracy_score(y_test_tensor, predicted_labels)
print(f"Test Accuracy: {test_accuracy:.4f}")

# Calculate precision, recall, and F1 score
precision = precision_score(y_test_tensor, predicted_labels)
recall = recall_score(y_test_tensor, predicted_labels)
f1 = f1_score(y_test_tensor, predicted_labels)

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

# Generate a classification report
report = classification_report(y_test_tensor, predicted_labels)
print("Classification Report:")
print(report)

# Generate and print the confusion matrix
conf_matrix = confusion_matrix(y_test_tensor, predicted_labels)
print("Confusion Matrix:")
print(conf_matrix)

Test Accuracy: 0.9989
Precision: 0.7885
Recall: 0.4556
F1 Score: 0.5775
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56656
           1       0.79      0.46      0.58        90

    accuracy                           1.00     56746
   macro avg       0.89      0.73      0.79     56746
weighted avg       1.00      1.00      1.00     56746

Confusion Matrix:
[[56645    11]
 [   49    41]]
