# Fraud Detection System - V1: Exploratory Data Analysis (EDA)

**Date:** 2026-01-24    
**Author:** *Luis Renteria Lezano*  
[LinkedIn](https://www.linkedin.com/in/renteria-luis) | [GitHub](https://github.com/renteria-luis)

## Executive Summary
- **Goal:** Understand the key factors influencing **fraudulent transactions** in credit card operations and prepare **clean, structured data** suitable for building **baseline and advanced classification models**. The focus is on **detecting anomalies**, **identifying patterns of fraud**, and creating a **robust dataset** that can support **machine learning algorithms** for **real-time fraud detection**. Special attention is given to the **highly imbalanced target class**, ensuring proper handling of **rare fraudulent cases** during model training and evaluation.
- **Source:** This analysis uses the Credit Card Fraud Detection dataset published on [Kaggle by MLG ULB](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data).
- **Data:** [`../data/raw/creditcard.csv`](../data/raw/creditcard.csv).
- **Target variable:** `Class`:
    - 0 = legitimate transaction
    - 1 = fraudulent transaction

## 1. Reproducibility & Environment Setup
- Pin versions in [`../requirements.txt`](../).
- Keep raw data immutable [`../data/raw/`](../data/raw/).

In [2]:
%reload_ext autoreload
%autoreload 2

import sys
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ks_2samp

from sklearn.model_selection import train_test_split

sys.path.append('..')
from src.features import FeatureEngineering

# 1. Global Reproducibility
SEED = 42
np.random.seed(SEED)

# 2. Path Management
BASE_DIR = Path("..")
ASSETS_DIR = Path('../assets/figures')
DATA_RAW = BASE_DIR / "data" / "raw"
DATA_PROCESSED = BASE_DIR / "data" / "processed"
MODELS_DIR = BASE_DIR / "models"
MODELS_DIR = BASE_DIR / "models"

# 3. Plotting Style
sns.set_theme(style='whitegrid', context='notebook', palette='viridis')
plt.rcParams["figure.figsize"] = (10, 6)

# 4. Global Settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## 2. Data Loading & Overview
### 2.1 Load Data

In [3]:
raw_file = DATA_RAW / 'PS_20174392719_1491204439457_log.csv'
df = pd.read_csv(raw_file)

### 2.2 Dataset Shape & Info

In [4]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


### 2.3 First Rows Preview

In [5]:
df.head(3)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0


### 2.4 Feature Exploration

In [8]:
for col in df.columns:
    # Ignore floats
    if pd.api.types.is_float_dtype(df[col]):
        continue
    n_unique = df[col].nunique()
    
    if n_unique > 10:
        print(f"{col}: high cardinality ({n_unique} unique values)")
    else:
        print(f"{col}: {n_unique} unique values -> {df[col].unique()}")

step: high cardinality (743 unique values)
type: 5 unique values -> ['PAYMENT' 'TRANSFER' 'CASH_OUT' 'DEBIT' 'CASH_IN']
nameOrig: high cardinality (6353307 unique values)
nameDest: high cardinality (2722362 unique values)
isFraud: 2 unique values -> [0 1]
isFlaggedFraud: 2 unique values -> [0 1]


**Findings (PaySim 1):**

* `step`: time step in hours since the first transaction.
* `type`: type of transaction (*CASH-IN*, *CASH-OUT*, *TRANSFER*, *DEBIT*, *PAYMENT*).
* `amount`: transaction amount.
* `nameOrig`: ID of the customer who initiates the transaction (high cardinality -> dropped).
* `nameDest`: ID of the customer who receives the transaction (high cardinality -> dropped).
* `oldbalanceOrg`: balance of the origin account before the transaction.
* `newbalanceOrig`: balance of the origin account after the transaction (data leakage -> dropped).
* `oldbalanceDest`: balance of the destination account before the transaction.
* `newbalanceDest`: balance of the destination account after the transaction (data leakage -> dropped).
* `isFraud`: 0 (legitimate) / 1 (fraud).
* `isFlaggedFraud`: 0/1, automatically flagged by business rules (rare, mostly 0 -> dropped).