In [36]:
import os
import sys
import pandas as pd

# 1️⃣ Add the project root (one level above /notebooks) to the very beginning of sys.path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# 2️⃣ Import after the path has been updated
from ccfd_utils.preprocessing import preprocess_data

# 3️⃣ Load and preprocess the data
df = pd.read_csv('../data/creditcard.csv')
X_train, X_test, y_train, y_test = preprocess_data(df)


LOGISTIC REGRESSION

To establish a baseline, a Logistic Regression classifier was trained on the preprocessed dataset using class weighting to address the imbalance between fraudulent and non-fraudulent transactions. The model achieved a high overall accuracy of ~98%, however this number is misleading due to the severe class imbalance. More importantly, it achieved a recall of 0.92 for the fraud class, meaning it was able to correctly detect most fraudulent transactions.
On the other hand, the precision for the fraud class was only 0.06, indicating a large number of false positives.
This makes Logistic Regression a good baseline for sensitivity (recall), but further models are necessary to improve precision without significantly reducing recall.

In [40]:
from sklearn.metrics import classification_report, confusion_matrix

# Logistic Regression Model
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
print("Logistic Regression:\n", classification_report(y_test, lr_pred))
print(confusion_matrix(y_test, lr_pred))



Logistic Regression:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99     56864
           1       0.06      0.92      0.11        98

    accuracy                           0.98     56962
   macro avg       0.53      0.95      0.55     56962
weighted avg       1.00      0.98      0.99     56962

[[55478  1386]
 [    8    90]]


DECISION TREES
The Decision Tree model significantly improved performance compared to the baseline Logistic Regression. It achieved an overall accuracy of ~100% and, more importantly, much better performance on the fraud class.
It reached a precision of 0.72 and a recall of 0.71 for fraudulent transactions, meaning it correctly captured the majority of frauds while greatly reducing false positives.
This indicates that the model is able to learn non-linear relationships that the linear Logistic Regression model could not capture.
However, decision trees can overfit easily

In [41]:
from sklearn.tree import DecisionTreeClassifier
# Decision Tree Model
dt_model = DecisionTreeClassifier(class_weight='balanced')
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
print("Decision Tree:\n", classification_report(y_test, dt_pred))
print(confusion_matrix(y_test, dt_pred))



Decision Tree:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.72      0.71      0.72        98

    accuracy                           1.00     56962
   macro avg       0.86      0.86      0.86     56962
weighted avg       1.00      1.00      1.00     56962

[[56837    27]
 [   28    70]]


RANDOM FOREST

The Random Forest model achieved the best performance so far.
It reached an overall accuracy close to 100%, with a precision of 0.96 and recall of 0.74 for the fraud class.
This means that while the model still misses a few fraudulent transactions, the majority of the detected frauds are actually correct (very few false positives).

Because Random Forest is an ensemble of many decision trees, it is able to capture complex, non-linear patterns in the data that single models (such as Logistic Regression or a single Decision Tree) could not.
Overall, it offers a strong balance between recall and precision, which is important in fraud detection scenarios.

Note: The n_jobs=-1 parameter tells the Random Forest to use all available CPU cores, which speeds up training significantly compared to using a single core.

In [None]:


from sklearn.ensemble import RandomForestClassifier
# Random Forest Model
rf_model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    n_jobs=-1,          # use all available cores .Enable parallel processing (uses multiple cores):
    random_state=42
)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print("Random Forest:\n", classification_report(y_test, rf_pred))
print(confusion_matrix(y_test, rf_pred))


Random Forest:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.96      0.74      0.84        98

    accuracy                           1.00     56962
   macro avg       0.98      0.87      0.92     56962
weighted avg       1.00      1.00      1.00     56962

[[56861     3]
 [   25    73]]


XGBOOST 

XGBoost delivered the best overall performance among all models.
It achieved 0.84 recall and 0.89 precision for the fraud class, meaning that the majority of fraudulent transactions were detected and most of the fraud predictions were correct.
This indicates a very strong balance between catching frauds and keeping false alarms low.

XGBoost works by building trees sequentially, each one correcting the errors of the previous tree.
Because of this boosting approach, it is able to capture subtle patterns in the data that the other models may miss.
For problems like credit card fraud detection—where fraud patterns are rare and complex—XGBoost often performs better than single models or bagging methods like Random Forest.

In [46]:
#xgboost
from xgboost import XGBClassifier
# XGBoost Model
xgb_model = XGBClassifier(
    n_estimators=100,
    scale_pos_weight=len(y_train) / sum(y_train == 1),  # Adjust for class imbalance
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)
print("XGBoost:\n", classification_report(y_test, xgb_pred))
print(confusion_matrix(y_test, xgb_pred))

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.89      0.84      0.86        98

    accuracy                           1.00     56962
   macro avg       0.95      0.92      0.93     56962
weighted avg       1.00      1.00      1.00     56962

[[56854    10]
 [   16    82]]


In [49]:
# --- after all evaluations ---

# Save the two best models
import joblib

joblib.dump(rf_model,  "../models/random_forest.pkl")
joblib.dump(xgb_model, "../models/xgboost_best.pkl")

print("Models saved ✅")


Models saved ✅
