# Credit Card Fraud Detection - Model Development

## Objectives
- Build robust classical ML models for fraud detection
- Handle severe class imbalance
- Optimize for business metrics
- Compare multiple algorithms - Logistic Regression, Decision Tree, Random Forest, SVM
- Deploy best model with proper evaluation

## Challegnes identified from EDA
- Severe class imbalance
- PCA transformed features limit interpretability of data

In [61]:
# import libraries and insights from EDA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import json
import warnings
warnings.filterwarnings("ignore")

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Metric libraries

# Libraries to handle imbalanced datasets
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE

print("Required libraries imported successfully!")

Required libraries imported successfully!


In [22]:
# Load dataset and insights from EDA
df = pd.read_csv('../data/creditcard.csv')
insights = json.load(open('../data/eda_insights.json'))

print("Dataset and insights loaded successfully!")

Dataset and insights loaded successfully!


In [23]:
# Extract key insights
with open('../data/eda_insights.json', 'r') as f:
    insights = json.load(f)

print("=== KEY INSIGHTS FROM EDA ===")
# Display insights
for key, value in insights.items():
    print(f"{key}: {value}")

=== KEY INSIGHTS FROM EDA ===
memory_usage_mb: 67.36
dataset_shape: [284807, 32]
imbalance_ratio: 577.88
fraud_rate: 0.17
missing_values: {}
duplicates_percentage: 0.379555277784605
top_15_features: ['V17', 'V14', 'V12', 'V10', 'V16', 'V3', 'V7', 'V11', 'V4', 'V18', 'V1', 'V9', 'V5', 'V2', 'V6']
significant_features: ['V14', 'V4', 'V12', 'V11', 'V10', 'V3', 'V2', 'V16', 'V9', 'V7', 'V17', 'V1', 'V6', 'V21', 'V18', 'V5', 'V27', 'V8', 'V19', 'V20', 'V28', 'Time', 'V24', 'Amount']
amount_stats: {'legitimate_mean': 88.29102242231328, 'fraudulent_mean': 122.21132113821139, 'max_transaction_amount': 25691.16, 'min_transaction_amount': 0.0, 'amount_std': 250.1201092401885}


In [24]:
# Feature preperation
print("=== FEATURE PREPARATION ===")

# Split dataset into features and target
X = df.drop(columns=['Class'])
y = df['Class']
print("\nFeatures and target variable separated successfully!")

# Consider only top 15 features based on EDA insights
top_features = insights['top_15_features']
X_selected = X[top_features]
print(f"\nSelected top {len(top_features)} features for modelling.")
print(f"\nSelected features: {', '.join(feature for feature in top_features)}")

=== FEATURE PREPARATION ===

Features and target variable separated successfully!

Selected top 15 features for modelling.

Selected features: V17, V14, V12, V10, V16, V3, V7, V11, V4, V18, V1, V9, V5, V2, V6


In [30]:
# Time based train-test split
print("\n=== TIME BASED TRAIN-TEST SPLIT ===")

# sort by 'Time' to ensure temporal order
df_sorted = df.sort_values(by='Time').reset_index(drop=True)
X_sorted = df_sorted[top_features]
y_sorted = df_sorted['Class']
print("\nDataset sorted by 'Time' for temporal split.")

# Split into train and test sets (First 80% train, Last 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_sorted, y_sorted, test_size=0.2, shuffle=False)
print("\nTrain-test split completed successfully!")

# print shapes of train and test sets
print(f"\nTrain set shape: {X_train.shape}, Test set shape: {X_test.shape}")
print(f"Train fraud rate: {y_train.mean() * 100:.2f}%, Test fraud rate: {y_test.mean() * 100:.2f}%")


=== TIME BASED TRAIN-TEST SPLIT ===

Dataset sorted by 'Time' for temporal split.

Train-test split completed successfully!

Train set shape: (227845, 15), Test set shape: (56962, 15)
Train fraud rate: 0.18%, Test fraud rate: 0.13%


In [33]:
# Feature scaling
print("\n=== FEATURE SCALING ===")
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("\nFeature scaling completed using RobustScaler!")


=== FEATURE SCALING ===

Feature scaling completed using RobustScaler!


In [62]:
# Handling class imbalance
print("\n=== HANDLING CLASS IMBALANCE ===")

# Stratergy 1: Original class distribution
print("\nStrategy 1: Original class distribution")
print(f"Legitimate: {len(y_train[y_train==0])} ({len(y_train[y_train==0]) / len(y_train) * 100:.2f}%)")
print(f"Fraudulent: {len(y_train[y_train==1])} ({len(y_train[y_train==1]) / len(y_train) * 100:.2f}%)")
# display(len(y_train[y_train==0]))

# Stratergy 2: SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42, k_neighbors=5)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)
print("\nStrategy 2: SMOTE applied")
print(f"Legitimate: {len(y_train_smote[y_train_smote==0])} ({len(y_train_smote[y_train_smote==0]) / len(y_train_smote) * 100:.2f}%)")
print(f"Fraudulent: {len(y_train_smote[y_train_smote==1])} ({len(y_train_smote[y_train_smote==1]) / len(y_train_smote) * 100:.2f}%)")

# Stratergy 3: Borderline SMOTE
borderline_smote = BorderlineSMOTE(random_state=42, k_neighbors=5)
print("\nStrategy 3: Borderline SMOTE applied")
X_train_borderline_smote, y_train_borderline_smote = borderline_smote.fit_resample(X_train_scaled, y_train)
print(f"Legitimate: {len(y_train_borderline_smote[y_train_borderline_smote==0])} ({len(y_train_borderline_smote[y_train_borderline_smote==0]) / len(y_train_borderline_smote) * 100:.2f}%)")
print(f"Fraudulent: {len(y_train_borderline_smote[y_train_borderline_smote==1])} ({len(y_train_borderline_smote[y_train_borderline_smote==1]) / len(y_train_borderline_smote) * 100:.2f}%)")


=== HANDLING CLASS IMBALANCE ===

Strategy 1: Original class distribution
Legitimate: 227428 (99.82%)
Fraudulent: 417 (0.18%)

Strategy 2: SMOTE applied
Legitimate: 227428 (50.00%)
Fraudulent: 227428 (50.00%)

Strategy 3: Borderline SMOTE applied
Legitimate: 227428 (50.00%)
Fraudulent: 227428 (50.00%)
