# Credit Card Fraud Detection: Data Exploration Tutorial

## 🎯 Learning Objectives
By the end of this tutorial, you will:
- Understand how to explore a fraud detection dataset
- Learn to identify class imbalance and its implications
- Discover which features are most important for fraud detection
- Create visualizations to understand data patterns
- Handle missing values and outliers

## 📊 What is This File About?
The `data_exploration.py` file is our starting point for understanding the credit card fraud dataset. It performs **Exploratory Data Analysis (EDA)** - the crucial first step in any machine learning project.

**Why is EDA important?**
- Helps us understand our data before building models
- Reveals patterns and anomalies
- Identifies data quality issues
- Guides feature engineering decisions

## 📁 Dataset Overview
- **Source**: European cardholders (September 2013)
- **Size**: 284,807 transactions over 2 days
- **Fraud Rate**: ~0.172% (highly imbalanced!)
- **Features**: 
  - V1-V28: PCA-transformed features (anonymized for privacy)
  - Time: Seconds elapsed from first transaction
  - Amount: Transaction amount
  - Class: Target variable (0=Normal, 1=Fraud)

## 1. Setting Up Our Environment

First, let's import all the necessary libraries for data exploration:

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import warnings

# Configure visualization settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set display options for better viewing
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

print("✅ Libraries imported successfully!")

## 2. Loading the Dataset

Let's load the credit card fraud dataset and examine its structure:

In [None]:
# Load the dataset
file_path = '../creditcard.csv'
df = pd.read_csv(file_path)

# Display basic information
print("🔍 Dataset Overview")
print("=" * 50)
print(f"📊 Shape: {df.shape[0]:,} transactions × {df.shape[1]} features")
print(f"💾 Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"📅 Time span: {df['Time'].max() / 3600:.1f} hours")
print("\n📋 Column names:")
print(df.columns.tolist())

In [None]:
# Display first few rows
print("\n📊 First 5 transactions:")
df.head()

## 3. Understanding the Target Variable

The most important aspect of fraud detection is understanding the class distribution:

In [None]:
# Analyze class distribution
print("🎯 Target Variable Analysis (Class)")
print("=" * 50)

# Count and percentage
class_counts = df['Class'].value_counts()
class_percentages = df['Class'].value_counts(normalize=True) * 100

# Display results
print(f"✅ Normal transactions (0): {class_counts[0]:,} ({class_percentages[0]:.3f}%)")
print(f"🚨 Fraudulent transactions (1): {class_counts[1]:,} ({class_percentages[1]:.3f}%)")
print(f"\n⚠️  Imbalance ratio: {class_counts[0]/class_counts[1]:.0f}:1")
print(f"📊 This means for every fraud transaction, there are ~{class_counts[0]/class_counts[1]:.0f} normal ones!")

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Bar plot
ax1.bar(['Normal', 'Fraud'], class_counts.values, color=['#2ecc71', '#e74c3c'])
ax1.set_title('Transaction Counts by Class', fontsize=14, fontweight='bold')
ax1.set_ylabel('Number of Transactions')
for i, v in enumerate(class_counts.values):
    ax1.text(i, v + 1000, f'{v:,}', ha='center', fontweight='bold')

# Pie chart
ax2.pie(class_counts.values, labels=['Normal', 'Fraud'], autopct='%1.3f%%', 
        colors=['#2ecc71', '#e74c3c'], startangle=90)
ax2.set_title('Class Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

### 💡 Key Insight: Extreme Class Imbalance

The fraud rate is only ~0.172%! This extreme imbalance presents challenges:
- **Accuracy Paradox**: A model that predicts all transactions as normal would achieve 99.83% accuracy!
- **Need for Special Metrics**: We'll need precision, recall, and F1-score instead of just accuracy
- **Sampling Strategies**: We may need to use SMOTE, undersampling, or class weights

## 4. Missing Values and Data Quality

In [None]:
# Check for missing values
print("❓ Missing Values Check")
print("=" * 50)

missing_values = df.isnull().sum()
if missing_values.sum() == 0:
    print("✅ Great news! No missing values found in the dataset.")
else:
    print("⚠️  Missing values found:")
    print(missing_values[missing_values > 0])

# Check data types
print("\n📊 Data Types:")
print(df.dtypes.value_counts())

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\n🔍 Duplicate rows: {duplicates}")

# Statistical summary
print("\n📈 Statistical Summary of Key Features:")
df[['Time', 'Amount', 'V1', 'V2', 'V3', 'Class']].describe()

## 5. Exploring Transaction Amounts

Transaction amounts can be a strong indicator of fraud. Let's analyze the distribution:

In [None]:
# Amount analysis
print("💰 Transaction Amount Analysis")
print("=" * 50)

# Overall statistics
print(f"Range: ${df['Amount'].min():.2f} - ${df['Amount'].max():,.2f}")
print(f"Mean: ${df['Amount'].mean():.2f}")
print(f"Median: ${df['Amount'].median():.2f}")
print(f"Std Dev: ${df['Amount'].std():.2f}")

# Compare amounts by class
print("\n💰 Amount Statistics by Class:")
amount_by_class = df.groupby('Class')['Amount'].agg(['mean', 'median', 'std', 'min', 'max'])
amount_by_class.index = ['Normal', 'Fraud']
print(amount_by_class.round(2))

# Visualize amount distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Overall amount distribution
ax1 = axes[0, 0]
df['Amount'].hist(bins=100, ax=ax1, edgecolor='black', alpha=0.7)
ax1.set_title('Transaction Amount Distribution', fontsize=12, fontweight='bold')
ax1.set_xlabel('Amount ($)')
ax1.set_ylabel('Frequency')
ax1.set_yscale('log')

# 2. Amount distribution by class (box plot)
ax2 = axes[0, 1]
df.boxplot(column='Amount', by='Class', ax=ax2)
ax2.set_title('Amount Distribution by Class', fontsize=12, fontweight='bold')
ax2.set_xlabel('Class (0=Normal, 1=Fraud)')
ax2.set_ylabel('Amount ($)')
ax2.set_yscale('log')
plt.suptitle('')  # Remove default title

# 3. Amount distribution (log scale) for better visibility
ax3 = axes[1, 0]
# Add small value to handle zero amounts
amount_log = np.log10(df['Amount'] + 1)
normal_amount_log = np.log10(df[df['Class']==0]['Amount'] + 1)
fraud_amount_log = np.log10(df[df['Class']==1]['Amount'] + 1)

ax3.hist([normal_amount_log, fraud_amount_log], bins=50, label=['Normal', 'Fraud'], 
         color=['#2ecc71', '#e74c3c'], alpha=0.7)
ax3.set_title('Log10(Amount+1) Distribution by Class', fontsize=12, fontweight='bold')
ax3.set_xlabel('Log10(Amount + 1)')
ax3.set_ylabel('Frequency')
ax3.legend()

# 4. Kernel Density Estimation
ax4 = axes[1, 1]
df[df['Class']==0]['Amount'].plot.density(ax=ax4, label='Normal', color='#2ecc71')
df[df['Class']==1]['Amount'].plot.density(ax=ax4, label='Fraud', color='#e74c3c')
ax4.set_title('Amount Density Distribution', fontsize=12, fontweight='bold')
ax4.set_xlabel('Amount ($)')
ax4.set_xlim(0, 500)  # Focus on smaller amounts for clarity
ax4.legend()

plt.tight_layout()
plt.show()

### 💡 Key Insights from Amount Analysis:
- Fraud transactions tend to have different amount patterns than normal ones
- Many small transactions (note the log scale needed for visualization)
- Fraudulent transactions show different distribution characteristics

## 6. Time Pattern Analysis

Understanding when fraud occurs can help with real-time detection:

In [None]:
# Time analysis
print("⏰ Time Pattern Analysis")
print("=" * 50)

# Convert time from seconds to hours
df['Hour'] = df['Time'] / 3600

print(f"Time range: {df['Time'].min():.0f} - {df['Time'].max():.0f} seconds")
print(f"Duration: {df['Hour'].max():.1f} hours ({df['Hour'].max()/24:.1f} days)")

# Create time-based visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Transaction volume over time
ax1 = axes[0, 0]
df['Hour'].hist(bins=48, ax=ax1, edgecolor='black', alpha=0.7)
ax1.set_title('Transaction Volume Over Time', fontsize=12, fontweight='bold')
ax1.set_xlabel('Hours from Start')
ax1.set_ylabel('Number of Transactions')

# 2. Fraud rate over time
ax2 = axes[0, 1]
# Create hourly bins
hourly_bins = pd.cut(df['Hour'], bins=48)
fraud_rate_by_hour = df.groupby(hourly_bins)['Class'].agg(['mean', 'count'])
fraud_rate_by_hour['mean'].plot(ax=ax2, color='red', marker='o', markersize=4)
ax2.set_title('Fraud Rate Over Time', fontsize=12, fontweight='bold')
ax2.set_xlabel('Hours from Start')
ax2.set_ylabel('Fraud Rate')
ax2.grid(True, alpha=0.3)

# 3. Transaction patterns by class
ax3 = axes[1, 0]
normal_time = df[df['Class']==0]['Hour']
fraud_time = df[df['Class']==1]['Hour']
ax3.hist([normal_time, fraud_time], bins=48, label=['Normal', 'Fraud'], 
         color=['#2ecc71', '#e74c3c'], alpha=0.7, density=True)
ax3.set_title('Time Distribution by Class (Normalized)', fontsize=12, fontweight='bold')
ax3.set_xlabel('Hours from Start')
ax3.set_ylabel('Density')
ax3.legend()

# 4. Scatter plot: Time vs Amount
ax4 = axes[1, 1]
# Sample normal transactions for visibility
normal_sample = df[df['Class']==0].sample(1000, random_state=42)
fraud_all = df[df['Class']==1]

ax4.scatter(normal_sample['Hour'], normal_sample['Amount'], 
           alpha=0.5, s=10, label='Normal (sample)', color='#2ecc71')
ax4.scatter(fraud_all['Hour'], fraud_all['Amount'], 
           alpha=0.7, s=20, label='Fraud (all)', color='#e74c3c')
ax4.set_title('Time vs Amount Pattern', fontsize=12, fontweight='bold')
ax4.set_xlabel('Hours from Start')
ax4.set_ylabel('Amount ($)')
ax4.set_yscale('log')
ax4.legend()

plt.tight_layout()
plt.show()

# Calculate some interesting time-based statistics
print("\n📊 Fraud Distribution by Time Period:")
df['TimeOfDay'] = pd.cut(df['Hour'] % 24, bins=[0, 6, 12, 18, 24], 
                         labels=['Night', 'Morning', 'Afternoon', 'Evening'])
time_fraud_stats = pd.crosstab(df['TimeOfDay'], df['Class'], normalize='index') * 100
time_fraud_stats.columns = ['Normal %', 'Fraud %']
print(time_fraud_stats.round(3))

## 7. PCA Feature Analysis

The V1-V28 features are PCA-transformed. Let's analyze which ones are most important for fraud detection:

In [None]:
# Analyze PCA features
print("🔬 PCA Feature Analysis")
print("=" * 50)

# Get PCA feature columns
pca_features = [col for col in df.columns if col.startswith('V')]
print(f"Number of PCA features: {len(pca_features)}")

# Calculate effect sizes (Cohen's d) for each feature
def calculate_effect_size(feature):
    """Calculate Cohen's d effect size for a feature"""
    normal_data = df[df['Class'] == 0][feature]
    fraud_data = df[df['Class'] == 1][feature]
    
    mean_diff = abs(normal_data.mean() - fraud_data.mean())
    pooled_std = np.sqrt((normal_data.std()**2 + fraud_data.std()**2) / 2)
    
    return mean_diff / pooled_std if pooled_std > 0 else 0

# Calculate effect sizes for all PCA features
effect_sizes = []
for feature in pca_features:
    effect_size = calculate_effect_size(feature)
    effect_sizes.append({
        'Feature': feature,
        'Effect_Size': effect_size,
        'Normal_Mean': df[df['Class'] == 0][feature].mean(),
        'Fraud_Mean': df[df['Class'] == 1][feature].mean()
    })

# Create DataFrame and sort by effect size
effect_df = pd.DataFrame(effect_sizes).sort_values('Effect_Size', ascending=False)

print("\n🎯 Top 10 Most Discriminative Features:")
print("(Effect Size: Cohen's d - larger values indicate better separation)")
print("-" * 60)
print(effect_df.head(10).to_string(index=False))

In [None]:
# Visualize top discriminative features
fig, axes = plt.subplots(2, 5, figsize=(20, 8))
axes = axes.ravel()

top_features = effect_df.head(10)['Feature'].tolist()

for i, feature in enumerate(top_features):
    ax = axes[i]
    
    # Get data for each class
    normal_data = df[df['Class'] == 0][feature]
    fraud_data = df[df['Class'] == 1][feature]
    
    # Create distributions
    ax.hist(normal_data, bins=50, alpha=0.6, density=True, label='Normal', color='#2ecc71')
    ax.hist(fraud_data, bins=50, alpha=0.6, density=True, label='Fraud', color='#e74c3c')
    
    # Add title with effect size
    effect_size = effect_df[effect_df['Feature'] == feature]['Effect_Size'].iloc[0]
    ax.set_title(f'{feature}\n(d = {effect_size:.3f})', fontsize=10, fontweight='bold')
    ax.set_xlabel('Value')
    ax.set_ylabel('Density')
    ax.legend(fontsize=8)
    
plt.suptitle('Top 10 Most Discriminative PCA Features', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## 8. Feature Correlations

Understanding feature relationships can help with feature selection and engineering:

In [None]:
# Feature correlation analysis
print("🔗 Feature Correlation Analysis")
print("=" * 50)

# Calculate correlations with target
target_corr = df.corr()['Class'].sort_values(ascending=False)
print("Top features correlated with fraud (Class):")
print(target_corr.head(10).to_string())
print("\nTop features negatively correlated with fraud:")
print(target_corr.tail(10).to_string())

# Create correlation heatmaps
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# 1. Top correlated features heatmap
top_corr_features = target_corr.abs().sort_values(ascending=False).head(15).index.tolist()
corr_matrix_top = df[top_corr_features].corr()

sns.heatmap(corr_matrix_top, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, ax=ax1, cbar_kws={'shrink': 0.8})
ax1.set_title('Correlation Matrix: Top 15 Features', fontsize=14, fontweight='bold')

# 2. Feature correlation with Class (bar plot)
ax2.barh(range(len(target_corr.head(20))), target_corr.head(20).values)
ax2.set_yticks(range(len(target_corr.head(20))))
ax2.set_yticklabels(target_corr.head(20).index)
ax2.set_xlabel('Correlation with Fraud')
ax2.set_title('Top 20 Features Correlated with Fraud', fontsize=14, fontweight='bold')
ax2.axvline(x=0, color='black', linestyle='-', linewidth=0.5)

# Color bars based on positive/negative correlation
colors = ['#e74c3c' if x > 0 else '#3498db' for x in target_corr.head(20).values]
bars = ax2.patches
for bar, color in zip(bars, colors):
    bar.set_color(color)

plt.tight_layout()
plt.show()

# Check multicollinearity among PCA features
print("\n🔍 Multicollinearity Check:")
print("PCA features should have low correlation with each other...")
pca_corr = df[pca_features].corr()
high_corr_pairs = []
for i in range(len(pca_features)):
    for j in range(i+1, len(pca_features)):
        if abs(pca_corr.iloc[i, j]) > 0.8:
            high_corr_pairs.append((pca_features[i], pca_features[j], pca_corr.iloc[i, j]))

if high_corr_pairs:
    print("⚠️  High correlation pairs found:")
    for pair in high_corr_pairs:
        print(f"{pair[0]} - {pair[1]}: {pair[2]:.3f}")
else:
    print("✅ Good news! No high correlations (>0.8) found among PCA features.")

## 9. Outlier Detection

Fraudulent transactions often appear as outliers. Let's identify them:

In [None]:
# Outlier analysis
print("🔍 Outlier Detection Analysis")
print("=" * 50)

# Calculate IQR for Amount
Q1 = df['Amount'].quantile(0.25)
Q3 = df['Amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Find outliers
amount_outliers = df[(df['Amount'] < lower_bound) | (df['Amount'] > upper_bound)]
print(f"Amount outliers: {len(amount_outliers)} ({len(amount_outliers)/len(df)*100:.2f}%)")
print(f"Fraud rate in amount outliers: {amount_outliers['Class'].mean()*100:.2f}%")
print(f"Normal fraud rate: {df['Class'].mean()*100:.2f}%")

# Use Isolation Forest for multivariate outlier detection
from sklearn.ensemble import IsolationForest

# Prepare features for outlier detection
outlier_features = pca_features + ['Amount']
X_outlier = df[outlier_features].values

# Fit Isolation Forest
iso_forest = IsolationForest(contamination=0.01, random_state=42, n_estimators=100)
outlier_predictions = iso_forest.fit_predict(X_outlier)

# Analyze results
df['Outlier'] = outlier_predictions
outlier_fraud_rate = df[df['Outlier'] == -1]['Class'].mean()
normal_fraud_rate = df[df['Outlier'] == 1]['Class'].mean()

print(f"\n🌲 Isolation Forest Results:")
print(f"Outliers detected: {(outlier_predictions == -1).sum()} ({(outlier_predictions == -1).sum()/len(df)*100:.2f}%)")
print(f"Fraud rate in outliers: {outlier_fraud_rate*100:.2f}%")
print(f"Fraud rate in normal points: {normal_fraud_rate*100:.2f}%")
print(f"Improvement factor: {outlier_fraud_rate/normal_fraud_rate:.1f}x")

# Visualize outliers
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# 1. Amount outliers visualization
ax1.scatter(df[df['Outlier'] == 1]['Hour'], df[df['Outlier'] == 1]['Amount'], 
           alpha=0.5, s=1, label='Normal', color='#3498db')
ax1.scatter(df[df['Outlier'] == -1]['Hour'], df[df['Outlier'] == -1]['Amount'], 
           alpha=0.8, s=10, label='Outlier', color='#e74c3c')
ax1.set_xlabel('Time (hours)')
ax1.set_ylabel('Amount ($)')
ax1.set_yscale('log')
ax1.set_title('Outliers in Time-Amount Space', fontsize=12, fontweight='bold')
ax1.legend()

# 2. Outlier vs Fraud comparison
outlier_fraud_crosstab = pd.crosstab(df['Outlier'], df['Class'], normalize='index') * 100
outlier_fraud_crosstab.plot(kind='bar', ax=ax2, color=['#2ecc71', '#e74c3c'])
ax2.set_xlabel('Outlier Status (-1: Outlier, 1: Normal)')
ax2.set_ylabel('Percentage')
ax2.set_title('Fraud Rate by Outlier Status', fontsize=12, fontweight='bold')
ax2.legend(['Normal Transaction', 'Fraud'])
ax2.set_xticklabels(['Outlier', 'Normal'], rotation=0)

plt.tight_layout()
plt.show()

## 10. Summary and Key Insights

Let's summarize our findings from the data exploration:

In [None]:
# Summary of findings
print("📊 DATA EXPLORATION SUMMARY")
print("=" * 60)

print("\n1️⃣ Dataset Characteristics:")
print(f"   • Total transactions: {len(df):,}")
print(f"   • Features: {df.shape[1]} (28 PCA + Time + Amount + Class)")
print(f"   • Time span: {df['Hour'].max():.1f} hours")
print(f"   • No missing values")

print("\n2️⃣ Class Imbalance:")
print(f"   • Normal transactions: {class_counts[0]:,} ({class_percentages[0]:.3f}%)")
print(f"   • Fraudulent transactions: {class_counts[1]:,} ({class_percentages[1]:.3f}%)")
print(f"   • Imbalance ratio: {class_counts[0]/class_counts[1]:.0f}:1")

print("\n3️⃣ Transaction Amounts:")
print(f"   • Range: ${df['Amount'].min():.2f} - ${df['Amount'].max():,.2f}")
print(f"   • Mean (Normal): ${df[df['Class']==0]['Amount'].mean():.2f}")
print(f"   • Mean (Fraud): ${df[df['Class']==1]['Amount'].mean():.2f}")
print(f"   • Fraud transactions tend to have different amount patterns")

print("\n4️⃣ Most Discriminative Features:")
top_5_features = effect_df.head(5)
for idx, row in top_5_features.iterrows():
    print(f"   • {row['Feature']}: Effect size = {row['Effect_Size']:.3f}")

print("\n5️⃣ Outlier Analysis:")
print(f"   • Outliers detected: {(df['Outlier'] == -1).sum()} ({(df['Outlier'] == -1).sum()/len(df)*100:.2f}%)")
print(f"   • Fraud rate in outliers: {outlier_fraud_rate*100:.2f}%")
print(f"   • Outliers are {outlier_fraud_rate/df['Class'].mean():.1f}x more likely to be fraud")

print("\n6️⃣ Recommendations for Modeling:")
print("   • Use stratified sampling for train/test split")
print("   • Consider SMOTE or class weights for imbalance")
print("   • Focus on Precision-Recall instead of accuracy")
print("   • Use ensemble methods to combine different approaches")
print("   • Consider time-based validation splits")

# Create a final comprehensive visualization
fig = plt.figure(figsize=(20, 12))

# 1. Class distribution
ax1 = plt.subplot(3, 3, 1)
df['Class'].value_counts().plot(kind='pie', ax=ax1, autopct='%1.3f%%', 
                                colors=['#2ecc71', '#e74c3c'])
ax1.set_title('Class Distribution', fontweight='bold')
ax1.set_ylabel('')

# 2. Top 5 discriminative features
ax2 = plt.subplot(3, 3, 2)
top_5_features.plot(x='Feature', y='Effect_Size', kind='bar', ax=ax2, color='#3498db')
ax2.set_title('Top 5 Discriminative Features', fontweight='bold')
ax2.set_xlabel('Feature')
ax2.set_ylabel('Effect Size (Cohen\'s d)')

# 3. Amount distribution comparison
ax3 = plt.subplot(3, 3, 3)
df[df['Class']==0]['Amount'].plot.hist(bins=50, alpha=0.7, label='Normal', 
                                       ax=ax3, color='#2ecc71', density=True)
df[df['Class']==1]['Amount'].plot.hist(bins=50, alpha=0.7, label='Fraud', 
                                       ax=ax3, color='#e74c3c', density=True)
ax3.set_xlabel('Amount ($)')
ax3.set_ylabel('Density')
ax3.set_title('Amount Distribution by Class', fontweight='bold')
ax3.legend()
ax3.set_xlim(0, 500)

# 4. Time patterns
ax4 = plt.subplot(3, 3, 4)
hourly_fraud = df.groupby(df['Hour'].astype(int))['Class'].agg(['sum', 'count'])
hourly_fraud['rate'] = (hourly_fraud['sum'] / hourly_fraud['count']) * 100
hourly_fraud['rate'].plot(ax=ax4, color='red', marker='o')
ax4.set_xlabel('Hour')
ax4.set_ylabel('Fraud Rate (%)')
ax4.set_title('Fraud Rate Over Time', fontweight='bold')
ax4.grid(True, alpha=0.3)

# 5. Feature correlation with fraud
ax5 = plt.subplot(3, 3, 5)
top_corr = target_corr.drop('Class').abs().sort_values(ascending=False).head(10)
top_corr.plot(kind='barh', ax=ax5, color='#9b59b6')
ax5.set_xlabel('|Correlation| with Fraud')
ax5.set_title('Top 10 Feature Correlations', fontweight='bold')

# 6. Outlier analysis
ax6 = plt.subplot(3, 3, 6)
outlier_stats = pd.DataFrame({
    'Normal': [normal_fraud_rate*100, (1-normal_fraud_rate)*100],
    'Outlier': [outlier_fraud_rate*100, (1-outlier_fraud_rate)*100]
}, index=['Fraud', 'Normal'])
outlier_stats.T.plot(kind='bar', stacked=True, ax=ax6, color=['#e74c3c', '#2ecc71'])
ax6.set_xlabel('Point Type')
ax6.set_ylabel('Percentage')
ax6.set_title('Fraud Rate: Normal vs Outliers', fontweight='bold')
ax6.set_xticklabels(['Normal Points', 'Outliers'], rotation=0)

# 7. PCA feature 1 vs 2
ax7 = plt.subplot(3, 3, 7)
sample_normal = df[df['Class']==0].sample(1000)
ax7.scatter(sample_normal['V1'], sample_normal['V2'], alpha=0.5, s=5, 
           label='Normal', color='#2ecc71')
ax7.scatter(df[df['Class']==1]['V1'], df[df['Class']==1]['V2'], alpha=0.7, s=10, 
           label='Fraud', color='#e74c3c')
ax7.set_xlabel('V1')
ax7.set_ylabel('V2')
ax7.set_title('V1 vs V2 Feature Space', fontweight='bold')
ax7.legend()

# 8. Transaction amount percentiles
ax8 = plt.subplot(3, 3, 8)
percentiles = [10, 25, 50, 75, 90, 95, 99]
normal_percentiles = [df[df['Class']==0]['Amount'].quantile(p/100) for p in percentiles]
fraud_percentiles = [df[df['Class']==1]['Amount'].quantile(p/100) for p in percentiles]
x = np.arange(len(percentiles))
width = 0.35
ax8.bar(x - width/2, normal_percentiles, width, label='Normal', color='#2ecc71')
ax8.bar(x + width/2, fraud_percentiles, width, label='Fraud', color='#e74c3c')
ax8.set_xlabel('Percentile')
ax8.set_ylabel('Amount ($)')
ax8.set_title('Amount Percentiles by Class', fontweight='bold')
ax8.set_xticks(x)
ax8.set_xticklabels(percentiles)
ax8.legend()
ax8.set_yscale('log')

# 9. Sample size recommendations
ax9 = plt.subplot(3, 3, 9)
strategies = ['Original', 'Undersample', 'SMOTE', 'Class Weights']
fraud_counts_viz = [class_counts[1], class_counts[1], class_counts[0], class_counts[1]]
normal_counts_viz = [class_counts[0], class_counts[1], class_counts[0], class_counts[0]]
x = np.arange(len(strategies))
ax9.bar(x, normal_counts_viz, label='Normal', color='#2ecc71', alpha=0.7)
ax9.bar(x, fraud_counts_viz, bottom=normal_counts_viz, label='Fraud', color='#e74c3c', alpha=0.7)
ax9.set_xlabel('Strategy')
ax9.set_ylabel('Sample Count')
ax9.set_title('Class Balancing Strategies', fontweight='bold')
ax9.set_xticks(x)
ax9.set_xticklabels(strategies, rotation=45)
ax9.legend()
ax9.set_yscale('log')

plt.suptitle('Credit Card Fraud Detection - Data Exploration Summary', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## 11. Next Steps

Now that we've thoroughly explored the data, we're ready to build fraud detection models! Here's what comes next:

### 🚀 In the Next Tutorial (fraud_detection_models.ipynb):
1. **Model Building**: Implement various ML algorithms (Logistic Regression, Random Forest, XGBoost, etc.)
2. **Handling Imbalance**: Apply SMOTE, undersampling, and class weights
3. **Model Evaluation**: Use appropriate metrics (Precision, Recall, F1, AUC-ROC)
4. **Model Comparison**: Compare performance across different algorithms

### 📚 Additional Resources:
- **Feature Engineering**: Create new features based on our exploration insights
- **Advanced Techniques**: Deep learning, ensemble methods, and graph neural networks
- **Production Deployment**: API development and real-time scoring

### 💡 Key Takeaways from Data Exploration:
1. **Extreme class imbalance** requires special handling techniques
2. **PCA features** (especially V14, V4, V12, V10) are highly discriminative
3. **Transaction amounts** show different patterns for fraud vs normal
4. **Outliers** are much more likely to be fraudulent
5. **Time patterns** exist but are subtle

## 🎉 Congratulations!

You've completed the data exploration tutorial! You now understand:
- The characteristics of credit card fraud data
- How to identify and visualize class imbalance
- Which features are most important for fraud detection
- How to prepare for the modeling phase

### 📝 Practice Exercise:
Try exploring the data with different visualization techniques or statistical tests. Can you find any other interesting patterns?

### 🔗 Continue to the next tutorial:
Open `fraud_detection_models.ipynb` to start building your first fraud detection models!