# Fraud Detection System - V1: Exploratory Data Analysis (EDA)

**Date:** 2026-01-24    
**Author:** *Luis Renteria Lezano*  
[LinkedIn](https://www.linkedin.com/in/renteria-luis) | [GitHub](https://github.com/renteria-luis)

## Executive Summary
- **Goal:** Understand the key factors influencing **fraudulent transactions** in credit card operations and prepare **clean, structured data** suitable for building **baseline and advanced classification models**. The focus is on **detecting anomalies**, **identifying patterns of fraud**, and creating a **robust dataset** that can support **machine learning algorithms** for **real-time fraud detection**. Special attention is given to the **highly imbalanced target class**, ensuring proper handling of **rare fraudulent cases** during model training and evaluation.
- **Source:** This analysis uses the Credit Card Fraud Detection dataset published on [Kaggle by MLG ULB](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data).
- **Data:** [`../data/raw/creditcard.csv`](../data/raw/creditcard.csv).
- **Target variable:** `Class`:
    - 0 = legitimate transaction
    - 1 = fraudulent transaction

## 1. Reproducibility & Environment Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ks_2samp
from pathlib import Path

# 1. Global Reproducibility
SEED = 42
np.random.seed(SEED)

# 2. Path Management
BASE_DIR = Path("..")
DATA_RAW = BASE_DIR / "data" / "raw"
DATA_PROCESSED = BASE_DIR / "data" / "processed"
MODELS_DIR = BASE_DIR / "models"

# 3. Plotting Style
sns.set_theme(style='whitegrid', context='notebook', palette='viridis')
plt.rcParams["figure.figsize"] = (10, 6)

# 4. Global Settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## 2. Data Loading & Overview
### 2.1 Load Data

In [None]:
raw_file = DATA_RAW / "creditcard.csv"
df = pd.read_csv('../data/raw/creditcard.csv')

### 2.2 Dataset Shape & Info

In [None]:
df.info(verbose=False)

### 1.3 First Rows Preview

In [None]:
df.head(3)

In [None]:
df.tail(3)

**Findings:**
- Features V1-V28 are PCA-transformed (anonymized for privacy purposes)
- Time: seconds elapsed since first transaction
- Amount: transaction value
- Class: 0 (legitimate) / 1 (fraud)

## 3. Data Quality Assessment
### 3.1 Missing Values

In [None]:
missing_per_col = df.isna().sum()

if missing_per_col.sum() > 0:
    missing_per_col[missing_per_col > 0].sort_values(ascending=False).head()
else:
    print('There are no missing values in the dataset.')

### 3.2 Duplicates

In [None]:
duplicates = df.duplicated().sum()

if duplicates > 0:
    print(f'There are {duplicates} duplicated rows in the dataset.')
    df = df.drop_duplicates()
else:
    print('There are no duplicates in the dataset.')

## 4. Target Variable Analysis (Class Imbalance)
### 4.1 Class Distribution

In [None]:
counts = df['Class'].value_counts()
percent = df['Class'].value_counts(normalize=True) * 100

summary = pd.DataFrame({
    'Count': counts,
    'Percentage': percent
}).reset_index()

summary.rename(columns={'index': 'Class'}, inplace=True)
summary

In [None]:
plt.figure(figsize=(6,4))
ax = sns.barplot(x='Class', y='Count', data=summary)
plt.title('Number of Transactions per Class')

for i, row in summary.iterrows():
    ax.text(i, row['Count'] + 100, f"{row['Percentage']:.2f}%", ha='center')

plt.show()

**Key Finding:**
- Class 0: 283,253 (99.83%)
- Class 1: 473 (0.17%)
- **Severe imbalance → will need SMOTE/class weights**

### 4.2 Imbalance Ratio

In [None]:
# imbalance_ratio = legitimate / fraud
print(f"For every fraudulent transaction, there are {int(summary.loc[0, 'Count'] / summary.loc[1, 'Count'])} legitimate transactions.")

**Implication for modeling:**
- Accuracy is useless metric (99.8% by predicting all 0)
- Focus on Precision, Recall, F1, ROC-AUC, AUPRC
- Stratified split essential

## 5. Statistical Analysis
### 5.1 Summary

In [None]:
df.describe().loc[:, ['Time', 'Amount']]

**Observations:**
- Time: [0, 172792] seconds (~48 hours of data)
- Amount: highly right-skewed (mean << max)
- V1-V28: already standardized (PCA result)

### 5.2 Feature Types Classification

In [None]:
num_cols = df.select_dtypes(['int64', 'float64', 'number']).columns
cat_cols = df.select_dtypes(['object', 'category', 'string']).columns
bool_cols = df.select_dtypes(['bool']).columns

print(f"Continuous Numerical features: {', '.join(list(num_cols))}")
print(f"Categorical features: {', '.join(list(cat_cols))}")
print(f"Boolean/Binary: Class (target)")

**Note:** Will create categorical features in FE (hour bins, amount ranges)

## 6. Feature Analysis
### 6.1 Time Feature

In [None]:
fig, ax = plt.subplots(figsize=(12, 5))
sns.histplot(data=df, x='Time', hue='Class', element='step', stat='density', common_norm=False, bins=48, alpha=0.5, palette='colorblind')
sns.kdeplot(data=df[df['Class']==0], x='Time', color='blue')
ax.set(title='Time Feature [48hr range]', xlabel='Elapsed time in seconds', ylabel='Density')
plt.tight_layout()
plt.show()

In [None]:
hour_48 = df['Time'] // 3600
heatmap_data = pd.crosstab(hour_48, df['Class'], normalize='index')

fig, ax = plt.subplots(figsize=(12, 3))
sns.heatmap(heatmap_data[[1]].T, cmap='viridis', annot=False, ax=ax)
ax.set(title='Fraud rate by hour (48h window)', xlabel='Hour (0–47)', ylabel='Fraud')
plt.tight_layout()
plt.show()

**Insights**:  
- The "blue line peaks" represent the hours of commercial activity (day/afternoon), and the valleys represent the rest period (early morning). It is standard human behavior. Transaction volume drops between 2:00 AM and 6:00 AM.  
- Although there is a peak of fraudulent transactions at around 40,000 seconds (day), the graph suggests that fraud is relatively more common during the early morning (valleys) between 10,000–20,000 seconds and 90,000–100,000 seconds.  
- **Potential FE:** hour_of_day, is_night using sine/cosine under this logical assumption (T = 0 → 00:00)

### 6.2 Amount Feature

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
sns.histplot(data=df, x='Amount', kde=True, ax=ax[0])  # Right-skewed feature
sns.histplot(data=df, 
             x=np.log1p(df['Amount']), 
             ax=ax[1], hue='Class', 
             element='step', 
             stat='density', 
             common_norm=False, 
             alpha=0.5, 
             palette='colorblind')  # Transformed feature

ax[0].set(title='Original Skewed Feature (Amount)', xlabel='Amount (USD)', ylabel='')
ax[1].set(title='New Transformed Target', xlabel='log1p(Amount)', ylabel='')

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(6, 4))
sns.boxplot(data=df, x='Class', y=df['Amount'].pipe(lambda x: np.log1p(x)), hue='Class', palette='colorblind', legend=False)
ax.set(title='Boxplot fraud vs legitimate', xlabel='Class', ylabel='log1p(Amount)')
plt.show()

**Insights**:  
- Class 1 median << Class 0 median.  
- Class 0 has many high-value outliers (Amount > 8).  
- XGBoost may flag legitimate high-value transactions as fraud.  
- Peak ≈ 0.7 → Amount ≈ 1 USD (e^0.7−1 ≈ 1.01).  
  - Typical carding/trialing behavior: minimal charge to test card.  
- Peak ≈ 4.5 → Amount ≈ 89 USD.  
  - Sweet spot: profitable but under automatic alert limits.

### 5.3 PCA Features (V1 - V28) & Top Discriminative V Features
A variable is discriminative if its values allow a clear separation between class 0 and class 1.  
The higher the K-S statistic, the more distinct the distributions between classes → the better the discriminative power.

In [None]:
# 1. Identify the most discriminative variables using the K-S Test
v_features = [f'V{i}' for i in range(1, 29)]
ks_stats = {}

for col in v_features:
    stat, _ = ks_2samp(df[df['Class'] == 0][col], df[df['Class'] == 1][col])
    ks_stats[col] = stat

# Select Top 10
top_10_features = sorted(ks_stats, key=ks_stats.get, reverse=True)[:10]
print(f"Top 10 discriminative features: {top_10_features}")

# 2. Visualization: Density Plots
fig, axes = plt.subplots(2, 5, figsize=(20, 10))
axes = axes.flatten()

for i, col in enumerate(top_10_features):
    sns.kdeplot(data=df, x=col, hue='Class', common_norm=False, ax=axes[i], fill=True)
    axes[i].set_title(f'{col} (K-S: {ks_stats[col]:.2f})')

plt.tight_layout()
plt.show()

In [None]:
# 3. Scatter Plot of critical variables (e.g., V14 vs V10, V12, V4)
fig, ax = plt.subplots(1, 3, figsize=(13, 4))
sns.scatterplot(data=df, x='V14', y='V10', hue='Class', s=8, palette='colorblind', ax=ax[0])
sns.scatterplot(data=df, x='V14', y='V12', hue='Class', s=8, palette='colorblind', ax=ax[1])
sns.scatterplot(data=df, x='V14', y='V4', hue='Class', s=8, palette='colorblind', ax=ax[2])

fig.suptitle('Detected Anomaly: V14 vs V10, V12, V4')
plt.tight_layout()
plt.show()

**Insight:** V14, V10, V12, V4, V11, V17, V3, V16, V7, V2 show strong separation, being V14 the most discriminative.