# Data Exploration - Fraud Detection Dataset

This notebook demonstrates loading and exploring the fraud detection dataset from S3.
We analyze feature distributions, dataset statistics, and class imbalance to inform
feature engineering and model training decisions.

**Requirements covered:** 2.1 (Load Parquet from S3), 2.2 (Visualization utilities), 2.3 (Statistical summaries), 2.4 (Schema and sample records)

## 1. Setup and Imports

In [None]:
import sys
import io

import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add project src to path for ExperimentTracker
sys.path.insert(0, '../src')
from experiment_tracking import ExperimentTracker

sns.set_theme(style='whitegrid')
%matplotlib inline

## 2. Load Data from S3

Load the train, validation, and test Parquet splits from the `fraud-detection-data` S3 bucket.

**Requirement 2.1**: Load Parquet datasets from S3 (train, validation, test splits)

In [None]:
BUCKET_NAME = 'fraud-detection-data'
DATA_PREFIX = 'processed'

s3_client = boto3.client('s3')


def load_parquet_from_s3(bucket: str, key: str) -> pd.DataFrame:
    """Load a Parquet file from S3 into a pandas DataFrame."""
    response = s3_client.get_object(Bucket=bucket, Key=key)
    return pd.read_parquet(io.BytesIO(response['Body'].read()))


train_df = load_parquet_from_s3(BUCKET_NAME, f'{DATA_PREFIX}/train.parquet')
val_df = load_parquet_from_s3(BUCKET_NAME, f'{DATA_PREFIX}/validation.parquet')
test_df = load_parquet_from_s3(BUCKET_NAME, f'{DATA_PREFIX}/test.parquet')

print(f'Train set:      {train_df.shape[0]:>8,} rows, {train_df.shape[1]} columns')
print(f'Validation set: {val_df.shape[0]:>8,} rows, {val_df.shape[1]} columns')
print(f'Test set:       {test_df.shape[0]:>8,} rows, {test_df.shape[1]} columns')

## 3. Dataset Schema and Sample Records

**Requirement 2.4**: Display dataset schema and sample records

In [None]:
print('=== Dataset Schema ===')
print(f'{"Column":<20} {"Dtype":<15} {"Non-Null Count"}')
print('-' * 55)
for col in train_df.columns:
    non_null = train_df[col].notna().sum()
    print(f'{col:<20} {str(train_df[col].dtype):<15} {non_null}/{len(train_df)}')

In [None]:
print('=== Sample Records (first 5 rows) ===')
train_df.head()

## 4. Statistical Summary

**Requirement 2.3**: Statistical summary functions for dataset characteristics (record counts, missing values, feature ranges)

In [None]:
def dataset_summary(df: pd.DataFrame, name: str) -> pd.DataFrame:
    """Generate a statistical summary for a dataset split.

    Returns a DataFrame with count, missing, min, max, mean, std, and
    median for every column.
    """
    summary = pd.DataFrame({
        'count': df.count(),
        'missing': df.isnull().sum(),
        'missing_pct': (df.isnull().sum() / len(df) * 100).round(2),
        'min': df.min(numeric_only=True),
        'max': df.max(numeric_only=True),
        'mean': df.mean(numeric_only=True).round(4),
        'std': df.std(numeric_only=True).round(4),
        'median': df.median(numeric_only=True).round(4)
    })
    print(f'\n=== {name} Summary ({len(df):,} records) ===')
    return summary


train_summary = dataset_summary(train_df, 'Train')
train_summary

In [None]:
val_summary = dataset_summary(val_df, 'Validation')
val_summary

In [None]:
test_summary = dataset_summary(test_df, 'Test')
test_summary

## 5. Feature Distribution Visualizations

**Requirement 2.2**: Visualization utilities for feature distributions, correlations, and class imbalance analysis

In [None]:
def plot_feature_distributions(df: pd.DataFrame, features: list, ncols: int = 4) -> None:
    """Plot histograms for the given features, coloured by Class label."""
    nrows = int(np.ceil(len(features) / ncols))
    fig, axes = plt.subplots(nrows, ncols, figsize=(5 * ncols, 4 * nrows))
    axes = axes.flatten()

    for idx, feat in enumerate(features):
        ax = axes[idx]
        for label, colour in [(0, 'steelblue'), (1, 'crimson')]:
            subset = df[df['Class'] == label][feat]
            ax.hist(subset, bins=50, alpha=0.6, label=f'Class {label}', color=colour)
        ax.set_title(feat)
        ax.legend(fontsize=8)

    # Hide unused axes
    for idx in range(len(features), len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.show()


# Plot PCA features V1-V28
pca_features = [f'V{i}' for i in range(1, 29)]
plot_feature_distributions(train_df, pca_features)

In [None]:
# Distribution of Time and Amount
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(train_df['Time'], bins=50, color='steelblue', edgecolor='white')
axes[0].set_title('Transaction Time Distribution')
axes[0].set_xlabel('Time (seconds)')
axes[0].set_ylabel('Count')

axes[1].hist(train_df['Amount'], bins=50, color='darkorange', edgecolor='white')
axes[1].set_title('Transaction Amount Distribution')
axes[1].set_xlabel('Amount')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

### 5.1 Feature Correlation Heatmap

In [None]:
def plot_correlation_heatmap(df: pd.DataFrame, title: str = 'Feature Correlation Matrix') -> None:
    """Plot a correlation heatmap for all numeric features."""
    corr = df.select_dtypes(include=[np.number]).corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))

    plt.figure(figsize=(16, 14))
    sns.heatmap(corr, mask=mask, cmap='coolwarm', center=0,
                linewidths=0.5, fmt='.2f', square=True)
    plt.title(title)
    plt.tight_layout()
    plt.show()


plot_correlation_heatmap(train_df)

### 5.2 Top Correlated Features with Target

In [None]:
target_corr = train_df.select_dtypes(include=[np.number]).corr()['Class'].drop('Class').abs().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
target_corr.head(15).plot(kind='barh', color='teal')
plt.title('Top 15 Features Correlated with Class (absolute value)')
plt.xlabel('|Correlation|')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 6. Class Imbalance Analysis

Fraud detection datasets are typically highly imbalanced. Understanding the class
distribution is critical for choosing the right evaluation metrics and sampling strategies.

In [None]:
def analyze_class_imbalance(df: pd.DataFrame, name: str) -> dict:
    """Analyze and visualize class distribution.

    Returns a dict with counts, percentages, and imbalance ratio.
    """
    counts = df['Class'].value_counts().sort_index()
    total = len(df)
    legit = int(counts.get(0, 0))
    fraud = int(counts.get(1, 0))
    ratio = legit / fraud if fraud > 0 else float('inf')

    print(f'=== {name} Class Distribution ===')
    print(f'  Legitimate (0): {legit:>8,}  ({legit/total*100:.2f}%)')
    print(f'  Fraud      (1): {fraud:>8,}  ({fraud/total*100:.2f}%)')
    print(f'  Imbalance ratio: {ratio:.1f}:1')

    return {'legitimate': legit, 'fraud': fraud, 'ratio': ratio}


train_imbalance = analyze_class_imbalance(train_df, 'Train')
val_imbalance = analyze_class_imbalance(val_df, 'Validation')
test_imbalance = analyze_class_imbalance(test_df, 'Test')

In [None]:
# Visualize class distribution across splits
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, (df, name) in zip(axes, [(train_df, 'Train'), (val_df, 'Validation'), (test_df, 'Test')]):
    counts = df['Class'].value_counts().sort_index()
    colors = ['steelblue', 'crimson']
    bars = ax.bar(['Legitimate (0)', 'Fraud (1)'], counts.values, color=colors)
    ax.set_title(f'{name} Set Class Distribution')
    ax.set_ylabel('Count')
    for bar, count in zip(bars, counts.values):
        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height(),
                f'{count:,}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# Amount distribution by class
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, label, title in zip(axes, [0, 1], ['Legitimate Transactions', 'Fraudulent Transactions']):
    subset = train_df[train_df['Class'] == label]['Amount']
    ax.hist(subset, bins=50, color='steelblue' if label == 0 else 'crimson', edgecolor='white')
    ax.set_title(f'{title} - Amount Distribution')
    ax.set_xlabel('Amount')
    ax.set_ylabel('Count')

plt.tight_layout()
plt.show()

## 7. Log Exploration to ExperimentTracker

Record the data exploration session so it is tracked alongside model experiments.

In [None]:
tracker = ExperimentTracker(region_name='us-east-1')

experiment_id = tracker.start_experiment(
    experiment_name='fraud-detection-data-exploration',
    algorithm='data-exploration',
    user='data-scientist',
    dataset_version='v1.0',
    code_version='notebook-01'
)

tracker.log_parameters(experiment_id, {
    'train_rows': int(train_df.shape[0]),
    'val_rows': int(val_df.shape[0]),
    'test_rows': int(test_df.shape[0]),
    'num_features': int(train_df.shape[1]),
    'bucket': BUCKET_NAME
})

tracker.log_metrics(experiment_id, {
    'train_fraud_ratio': train_imbalance['ratio'],
    'train_fraud_count': float(train_imbalance['fraud']),
    'train_legit_count': float(train_imbalance['legitimate']),
    'missing_values_total': float(train_df.isnull().sum().sum())
})

tracker.close_experiment(experiment_id)

print(f'Exploration logged as experiment: {experiment_id}')

## Summary

Key observations from data exploration:

1. **Schema**: The dataset contains Time, Amount, V1-V28 (PCA features), and Class columns.
2. **Class imbalance**: Fraud cases are a small minority â€” consider oversampling, SMOTE, or class-weight adjustments during training.
3. **Feature correlations**: Several V-features show meaningful correlation with the target, which can guide feature selection.
4. **Amount distribution**: Fraudulent transactions tend to have different amount patterns than legitimate ones.

Next steps: proceed to feature engineering (notebook 04) or hyperparameter tuning (notebook 02).