# Fraud Detection System - V1: Baseline Models

**Date:** 2026-01-30  
**Author:** *Luis Renteria Lezano*  
[LinkedIn](https://www.linkedin.com/in/renteria-luis) | [GitHub](https://github.com/renteria-luis)

## Executive Summary
- **Goal:** Build and evaluate **baseline classification models** to detect **fraudulent credit card transactions**. Focus on **high recall** to catch as many frauds as possible, while balancing precision to reduce false positives.  
- **Source:** This analysis uses the Credit Card Fraud Detection dataset published on [Kaggle by MLG ULB](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data).  
- **Data:** [`../data/raw/creditcard.csv`](../data/raw/creditcard.csv). 
- **Feature Engineering:** Applied from [`../src/features.py`](../src/features.py), fitted on `X_train` and transformed on both train and test. Features include: `amount_log`, `is_micro_transaction`, `is_large_transaction`, `hour`, `is_night`, `hour_sin`, `hour_cos`.  
- **Target variable:** `Class`:
    - 0 = legitimate transaction  
    - 1 = fraudulent transaction  
- **Evaluation focus:** Recall prioritized, metrics include F1, ROC-AUC, PR-AUC. Accuracy is not considered due to class imbalance.  
- **Imbalance handling:** `class_weight='balanced'` or SMOTE applied on training data.


## 1. Reproducibility & Environment Setup
- Pin versions in [`../requirements.txt`](../requirements.txt).
- Keep raw data immutable [`../data/raw`](../data/raw).

In [3]:
%reload_ext autoreload
%autoreload 2

import sys
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

from sklearn.pipeline import Pipeline
sys.path.append('..')
from src.features import FeatureEngineering

# 1. Global Reproducibility
SEED = 42
np.random.seed(SEED)

# 2. Path Management
BASE_DIR = Path("..")
DATA_RAW = BASE_DIR / "data" / "raw"
DATA_PROCESSED = BASE_DIR / "data" / "processed"
MODELS_DIR = BASE_DIR / "models"

# 3. Plotting Style
sns.set_theme(style='whitegrid', context='notebook', palette='viridis')
plt.rcParams["figure.figsize"] = (10, 6)

# 4. Global Settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Load & Split Data

In [5]:
raw_file = DATA_RAW / "creditcard.csv"
df = pd.read_csv('../data/raw/creditcard.csv')
df.head(3)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.36,-0.073,2.536,1.378,-0.338,0.462,0.24,0.099,0.364,0.091,-0.552,-0.618,-0.991,-0.311,1.468,-0.47,0.208,0.026,0.404,0.251,-0.018,0.278,-0.11,0.067,0.129,-0.189,0.134,-0.021,149.62,0
1,0.0,1.192,0.266,0.166,0.448,0.06,-0.082,-0.079,0.085,-0.255,-0.167,1.613,1.065,0.489,-0.144,0.636,0.464,-0.115,-0.183,-0.146,-0.069,-0.226,-0.639,0.101,-0.34,0.167,0.126,-0.009,0.015,2.69,0
2,1.0,-1.358,-1.34,1.773,0.38,-0.503,1.8,0.791,0.248,-1.515,0.208,0.625,0.066,0.717,-0.166,2.346,-2.89,1.11,-0.121,-2.262,0.525,0.248,0.772,0.909,-0.689,-0.328,-0.139,-0.055,-0.06,378.66,0


In [22]:
X = df.drop('Class', axis=1)
y = df[['Class']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=SEED)

X_train.to_parquet(DATA_PROCESSED / 'X_train.parquet', index=False)
X_test.to_parquet(DATA_PROCESSED / 'X_test.parquet', index=False)
y_train.to_parquet(DATA_PROCESSED / 'y_train.parquet', index=False)
y_test.to_parquet(DATA_PROCESSED / 'y_test.parquet', index=False)

print(f'''> Training set has: {X_train.shape[0]} samples
> Test set has: {X_test.shape[0]} samples
> # of features: {len(X_test.columns)}''')

> Training set has: 227845 samples
> Test set has: 56962 samples
> # of features: 30


## Preprocessing