# Fraud Detection - Modeling (Refactored)

This notebook demonstrates a modular approach to modeling fraud detection. We use the `ModelTrainer` class for training and evaluation, ensuring consistency across different datasets.

In [1]:
import sys
import os
import pandas as pd
import numpy as np
sys.path.append(os.path.abspath('../'))

from sklearn.model_selection import train_test_split
from scripts.imbalance_handler import ImbalanceHandler
from scripts.data_clean import DataCleaner
from scripts.modeling_utils import ModelTrainer



## 1. Data Preparation & SMOTE

We load the processed datasets, perform stratified splits, and apply SMOTE to balance the training data.

In [2]:
cleaner = DataCleaner()
handler = ImbalanceHandler()
trainer = ModelTrainer()

# --- Fraud Data ---
fraud_df = pd.read_csv("../data/processed/processed_data.csv")
fraud_df_ml = cleaner.prepare_for_modeling(fraud_df, target_col='class')
X_f = fraud_df_ml.drop('class', axis=1)
y_f = fraud_df_ml['class']
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X_f, y_f, test_size=0.2, random_state=42, stratify=y_f)
X_train_f_s, y_train_f_s = handler.resample_smote(X_train_f, y_train_f)

# --- Credit Card Data ---
cc_df = pd.read_csv("../data/raw/creditcard.csv")
X_cc = cc_df.drop('Class', axis=1)
y_cc = cc_df['Class']
X_train_cc, X_test_cc, y_train_cc, y_test_cc = train_test_split(X_cc, y_cc, test_size=0.2, random_state=42, stratify=y_cc)
X_train_cc_s, y_train_cc_s = handler.resample_smote(X_train_cc, y_train_cc)

print("Data Preparation Complete.")

Original shape: (120889, 196)
Resampled shape: (219136, 196)
Original shape: (227845, 30)
Resampled shape: (454902, 30)
Data Preparation Complete.


## 2. Baseline Model: Logistic Regression

In [3]:
lr_f = trainer.train_logistic_regression(X_train_f_s, y_train_f_s)
res_lr_f = trainer.evaluate_model(lr_f, X_test_f, y_test_f, "LR Fraud")

lr_cc = trainer.train_logistic_regression(X_train_cc_s, y_train_cc_s)
res_lr_cc = trainer.evaluate_model(lr_cc, X_test_cc, y_test_cc, "LR Credit Card")

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



--- LR Fraud Evaluation ---
AUC-PR: 0.6002
F1-Score: 0.6080

Confusion Matrix:
[[26706   687]
 [ 1294  1536]]

--- LR Credit Card Evaluation ---
AUC-PR: 0.7825
F1-Score: 0.2083

Confusion Matrix:
[[56205   659]
 [   10    88]]


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 3. Ensemble Model: LightGBM

In [4]:
lgb_f = trainer.train_lightgbm(X_train_f_s, y_train_f_s)
res_lgb_f = trainer.evaluate_model(lgb_f, X_test_f, y_test_f, "LGBM Fraud")

lgb_cc = trainer.train_lightgbm(X_train_cc_s, y_train_cc_s)
res_lgb_cc = trainer.evaluate_model(lgb_cc, X_test_cc, y_test_cc, "LGBM Credit Card")


--- LGBM Fraud Evaluation ---
AUC-PR: 0.6682
F1-Score: 0.6074

Confusion Matrix:
[[26399   994]
 [ 1162  1668]]

--- LGBM Credit Card Evaluation ---
AUC-PR: 0.7823
F1-Score: 0.6357

Confusion Matrix:
[[56786    78]
 [   16    82]]


## 4. Cross-Validation

In [5]:
trainer.perform_cross_validation(X_f, y_f, lgb_f, "LGBM Fraud")
trainer.perform_cross_validation(X_cc, y_cc, lgb_cc, "LGBM Credit Card")


--- 5-Fold CV Results for LGBM Fraud ---
F1: 0.6982 (+/- 0.0041)
AUC-PR: 0.7180 (+/- 0.0067)

--- 5-Fold CV Results for LGBM Credit Card ---
F1: 0.5535 (+/- 0.1570)
AUC-PR: 0.5511 (+/- 0.1416)


{'fit_time': array([7.94089913, 7.9468863 , 7.58429241, 7.50310397, 9.87208033]),
 'score_time': array([2.226331  , 1.55785227, 1.48911047, 1.09989929, 1.27319121]),
 'test_f1': array([0.46101695, 0.60909091, 0.51301115, 0.36103152, 0.82352941]),
 'test_auc_pr': array([0.44305804, 0.64254258, 0.48210284, 0.40437722, 0.78345813])}

## 5. Model Comparison

In [6]:
results = [res_lr_f, res_lr_cc, res_lgb_f, res_lgb_cc]
comparison_df = trainer.compare_models(results)
comparison_df

Unnamed: 0,Model,AUC-PR,F1-Score
1,LR Credit Card,0.782549,0.208284
3,LGBM Credit Card,0.782299,0.635659
2,LGBM Fraud,0.668179,0.607429
0,LR Fraud,0.600223,0.607956


## 6. Save Models

Finally, we save our best performing models for future use or deployment.

In [7]:
# Save LGBM models as they showed higher AUC-PR generally
trainer.save_model(lgb_f, "../models/lgbm_fraud_model.joblib")
trainer.save_model(lgb_cc, "../models/lgbm_creditcard_model.joblib")

# Optional: Save Logistic Regression models as baselines
trainer.save_model(lr_f, "../models/lr_fraud_model.joblib")
trainer.save_model(lr_cc, "../models/lr_creditcard_model.joblib")

Model saved to: ../models/lgbm_fraud_model.joblib
Model saved to: ../models/lgbm_creditcard_model.joblib
Model saved to: ../models/lr_fraud_model.joblib
Model saved to: ../models/lr_creditcard_model.joblib
