# AI for Real-Time Credit Card Fraud Detection

**Copyright (c) 2026 Shrikara Kaudambady. All rights reserved.**

This notebook implements a machine learning model to detect fraudulent credit card transactions. The key challenge in this problem is the **severe class imbalance**â€”fraudulent transactions are extremely rare. We will address this by using the **SMOTE (Synthetic Minority Over-sampling Technique)** and evaluate our model using metrics appropriate for imbalanced data, such as the Area Under the Precision-Recall Curve (AUPRC).

### 1. Setup and Library Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc
from imblearn.over_sampling import SMOTE
import lightgbm as lgb

sns.set_theme(style="whitegrid")

### 2. Load and Explore the Data
We will use a standard, anonymized credit card fraud dataset. The first step is to load the data and examine the class distribution to understand the imbalance.

In [None]:
# Load the dataset from a public source
df = pd.read_csv('https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv')

print("Dataset shape:", df.shape)
print("\nClass Distribution:")
class_dist = df['Class'].value_counts()
print(class_dist)
print(f"\nProportion of Fraudulent Transactions: {class_dist[1] / class_dist[0] * 100:.4f}%")

# Visualize the imbalance
sns.countplot(x='Class', data=df)
plt.title('Class Distribution (0: Legitimate, 1: Fraud)')
plt.show()

### 3. Data Preprocessing
The features in this dataset are already anonymized via PCA, but the `Time` and `Amount` columns are not. We should scale these columns to prevent them from disproportionately influencing the model.

In [None]:
scaler = StandardScaler()

df['scaled_Amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1, 1))
df['scaled_Time'] = scaler.fit_transform(df['Time'].values.reshape(-1, 1))

df_processed = df.drop(['Time', 'Amount'], axis=1)

print("Data scaled successfully.")

### 4. Handling Class Imbalance with SMOTE
This is the most critical step. We first split our data into training and testing sets. Then, we apply SMOTE *only to the training set* to create a balanced dataset for the model to learn from.

In [None]:
X = df_processed.drop('Class', axis=1)
y = df_processed['Class']

# Split data BEFORE applying SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Original training set shape:", X_train.shape)
print("Original training set fraud count:", y_train.sum())

# Apply SMOTE
smote = SMOTE(random_state=42)
print("\nApplying SMOTE to the training data...")
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("\nResampled training set shape:", X_train_resampled.shape)
print("Resampled training set class distribution:", pd.Series(y_train_resampled).value_counts())

### 5. Train the Classification Model
We'll use a LightGBM classifier, which is a fast and powerful gradient boosting model, trained on our new, balanced dataset.

In [None]:
model = lgb.LGBMClassifier(objective='binary', random_state=42)

print("Training LightGBM model...")
model.fit(X_train_resampled, y_train_resampled)
print("Training complete.")

### 6. Model Evaluation
We evaluate the model on the original, imbalanced test set, as this represents the real-world scenario. We will focus on precision, recall, and the area under the precision-recall curve (AUPRC).

In [None]:
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print("--- Classification Report ---")
print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))

# Calculate Precision-Recall curve and AUPRC
precision, recall, _ = precision_recall_curve(y_test, y_prob)
auprc = auc(recall, precision)
print(f"Area Under the Precision-Recall Curve (AUPRC): {auprc:.4f}")

# Plot Precision-Recall Curve
plt.figure(figsize=(10, 7))
plt.plot(recall, precision, label=f'LightGBM (AUPRC = {auprc:.4f})')
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()

# Plot Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Legitimate', 'Fraud'], yticklabels=['Legitimate', 'Fraud'])
plt.title('Confusion Matrix')
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')
plt.show()