# Credit Card Fraud Detection

The dataset Repo:
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?resource=download



### Import packages

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest #iForest
from sklearn.neighbors import LocalOutlierFactor #LOF
from sklearn.covariance import EllipticEnvelope #Robust Coveriance
from sklearn.svm import OneClassSVM #OCSVM
from sklearn.metrics import classification_report, accuracy_score

import warnings
warnings.filterwarnings('ignore')

### EDA

Explor on data and show any interesting fact abuot this famous dataset.

In [2]:
# Load the dataset (adjust the path as necessary)
data = pd.read_csv('E:/Nexus/Nexus_Assignments/Assignments/a14/creditcard.csv')

# EDA
print(data.shape)
print(data.info())
print(data.columns)


(284807, 31)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 2

### Preprocessing Phase

In [3]:
# Assuming 'Class' is the column indicating fraud (1) or non-fraud (0)
X = data.drop('Class', axis=1)
y = data['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Prepare the models

In [4]:
# Define the models
models = {
    "Isolation Forest": IsolationForest(n_estimators=100, contamination='auto', random_state=42),
    "Local Outlier Factor": LocalOutlierFactor(n_neighbors=20, contamination='auto'),
    "Robust Covariance": EllipticEnvelope(support_fraction=1., contamination=0.1),
    "One-Class SVM": OneClassSVM(kernel='linear', gamma=0.001, nu=0.05),
}

### Traning the models

In [5]:
# Provide a for loop to walk on the models dict and train each of them one by one within the loop

for name, model in models.items():
    print(f"\n=== {name} ===")
    
    # LOF requires special handling (no .predict() on test set by default)
    if name == "Local Outlier Factor":
        # LOF predicts during training using fit_predict
        y_pred_train = model.fit_predict(X_train)
        # LOF outputs -1 for anomalies, 1 for normal → convert to (0,1)
        y_pred_train = (y_pred_train == -1).astype(int)
        print("LOF does not support test-set prediction unless novelty=True.")
        print(classification_report(y_train, y_pred_train))
        continue

    # For all other models: fit → predict
    model.fit(X_train)
    y_pred = model.predict(X_test)

    # Convert output (-1 = anomaly, 1 = normal) to (1 = anomaly, 0 = normal)
    y_pred = (y_pred == -1).astype(int)

    print(classification_report(y_test, y_pred))



=== Isolation Forest ===
              precision    recall  f1-score   support

           0       1.00      0.96      0.98     56864
           1       0.04      0.83      0.07        98

    accuracy                           0.96     56962
   macro avg       0.52      0.89      0.53     56962
weighted avg       1.00      0.96      0.98     56962


=== Local Outlier Factor ===
LOF does not support test-set prediction unless novelty=True.
              precision    recall  f1-score   support

           0       1.00      0.98      0.99    227451
           1       0.02      0.16      0.03       394

    accuracy                           0.98    227845
   macro avg       0.51      0.57      0.51    227845
weighted avg       1.00      0.98      0.99    227845


=== Robust Covariance ===
              precision    recall  f1-score   support

           0       1.00      0.90      0.95     56864
           1       0.01      0.73      0.02        98

    accuracy                         

### Evaluate the models
You can use the following setup as a guidline to evaluate and compare the performance of models.

In [6]:
# Reshape the prediction values to 0 (normal) and 1 (fraud)
# y_pred[y_pred == 1] = 0
# y_pred[y_pred == -1] = 1

# Calculate accuracy and other metrics
# print(f"{name}:")
# print(classification_report(y_test, y_pred))
# print("Accuracy:", accuracy_score(y_test, y_pred))
# print("-" * 30)

### Interpretation
Highlight the best model performance and explain what caused this model can outperform others?

Best Model Performance (using ChatGPT): Isolation Forest

From the results, Isolation Forest stands out as the best-performing model based on its f1-score, recall, and precision metrics for both classes. Let's break down the metrics for Isolation Forest and compare them with others:

Precision (Anomalies = 1):

Isolation Forest: 0.04, which is quite low, but typical in anomaly detection due to the imbalanced nature of the dataset.

LOF: 0.02 (lower than Isolation Forest).

Robust Covariance: 0.01 (even lower).

One-Class SVM: 0.00 (worst precision).

Recall (Anomalies = 1):

Isolation Forest: 0.83 (fairly high for anomaly detection).

LOF: 0.16.

Robust Covariance: 0.73.

One-Class SVM: 0.07 (very low recall).

f1-score (Anomalies = 1):

Isolation Forest: 0.07 (the highest among the models for anomalies).

LOF: 0.03.

Robust Covariance: 0.02.

One-Class SVM: 0.00.

Accuracy:

Isolation Forest: 0.96 (highest accuracy).

LOF: 0.98 (best accuracy).

Robust Covariance: 0.90.

One-Class SVM: 0.95.

Explanation of Outperformance:

Isolation Forest has a stronger recall for anomalies (0.83), meaning it detects more of the actual anomalies compared to the other models. Despite having a low precision for anomalies (0.04), it manages to maintain a high accuracy (0.96) by identifying the majority of the non-anomalies correctly. This is typical for imbalanced anomaly detection, where models tend to predict the majority class (non-anomalies) correctly and fail to catch a lot of anomalies.

LOF also shows a good accuracy and recall but fails to achieve the f1-score balance that Isolation Forest does for anomaly detection. LOF struggles with precision, especially when trying to detect anomalies (precision = 0.02), meaning it has a lot of false positives for anomalies.

Robust Covariance and One-Class SVM both have very poor performance in detecting anomalies (low recall and f1-score for anomalies). Their accuracies are lower than Isolation Forest, and they fail to perform well in anomaly detection.