In [None]:
```xml
<VSCode.Cell id="eda_001" language="markdown">
# 01 - Exploratory Data Analysis (EDA)

This notebook performs comprehensive exploratory analysis of loan application data to understand fraud patterns, feature distributions, and relationships.

## Objectives
1. Load and validate dataset
2. Analyze fraud distribution and class imbalance
3. Explore feature distributions for legitimate vs fraudulent applications
4. Identify missing values and data quality issues
5. Create correlation matrices and feature relationships
6. Generate visualizations for business stakeholders
</VSCode.Cell>

<VSCode.Cell id="eda_002" language="python">
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)
</VSCode.Cell>

<VSCode.Cell id="eda_003" language="python">
# Load the dataset
# Replace 'data/raw/loan_applications.csv' with your actual data path
DATA_PATH = 'data/raw/loan_applications.csv'

try:
    df = pd.read_csv(DATA_PATH)
    print(f"✓ Dataset loaded successfully")
    print(f"Shape: {df.shape}")
except FileNotFoundError:
    print(f"⚠ Data file not found at {DATA_PATH}")
    print("Please ensure the CSV file exists in the data/raw/ directory")
    print("\nExpected columns:")
    print("income, loan_amount, credit_score, employment_years, age,")
    print("education_level, marital_status, fraud_label")
</VSCode.Cell>

<VSCode.Cell id="eda_004" language="python">
# Display basic information about the dataset
print("="*60)
print("DATASET OVERVIEW")
print("="*60)
print(f"\nDataset Shape: {df.shape[0]} samples × {df.shape[1]} features")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nFirst few rows:\n{df.head()}")
</VSCode.Cell>

<VSCode.Cell id="eda_005" language="python">
# Check for missing values
print("\n" + "="*60)
print("MISSING VALUES ANALYSIS")
print("="*60)
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum().values,
    'Missing_Percentage': (df.isnull().sum().values / len(df) * 100).round(2)
})
print(missing_data)

# Visualize missing values
fig, ax = plt.subplots(figsize=(10, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis', ax=ax)
plt.title('Missing Values Heatmap')
plt.tight_layout()
plt.show()
</VSCode.Cell>

<VSCode.Cell id="eda_006" language="python">
# Fraud Distribution Analysis
print("\n" + "="*60)
print("FRAUD DISTRIBUTION ANALYSIS")
print("="*60)

fraud_counts = df['fraud_label'].value_counts()
fraud_percentages = df['fraud_label'].value_counts(normalize=True) * 100

summary_data = pd.DataFrame({
    'Class': ['Legitimate (0)', 'Fraudulent (1)'],
    'Count': [fraud_counts[0], fraud_counts[1]],
    'Percentage': [fraud_percentages[0], fraud_percentages[1]]
})
print("\n" + summary_data.to_string(index=False))

# Visualize fraud distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
fraud_labels = ['Legitimate', 'Fraudulent']
colors = ['#2ecc71', '#e74c3c']
axes[0].bar(fraud_labels, fraud_counts, color=colors, alpha=0.7, edgecolor='black')
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Fraud Class Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Pie chart
axes[1].pie(fraud_counts, labels=fraud_labels, autopct='%1.1f%%', 
            colors=colors, startangle=90, textprops={'fontsize': 12})
axes[1].set_title('Fraud Class Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\n⚠ Class Imbalance Ratio: {fraud_counts[0]/fraud_counts[1]:.2f}:1 (Legitimate:Fraudulent)")
print("Note: This imbalance requires special handling in model training (SMOTE, class weights, etc.)")
</VSCode.Cell>

<VSCode.Cell id="eda_007" language="python">
# Statistical Summary for Numerical Features
print("\n" + "="*60)
print("NUMERICAL FEATURES STATISTICS")
print("="*60)

numerical_features = ['income', 'loan_amount', 'credit_score', 'employment_years', 'age']
print("\n" + df[numerical_features].describe().round(2).to_string())
</VSCode.Cell>

<VSCode.Cell id="eda_008" language="python">
# Compare numerical features between legitimate and fraudulent applications
print("\n" + "="*60)
print("COMPARISON: LEGITIMATE vs FRAUDULENT")
print("="*60)

legitimate = df[df['fraud_label'] == 0]
fraudulent = df[df['fraud_label'] == 1]

comparison_stats = pd.DataFrame({
    'Feature': numerical_features,
    'Legit_Mean': [legitimate[f].mean() for f in numerical_features],
    'Fraud_Mean': [fraudulent[f].mean() for f in numerical_features],
    'Legit_Std': [legitimate[f].std() for f in numerical_features],
    'Fraud_Std': [fraudulent[f].std() for f in numerical_features]
})

print("\n" + comparison_stats.round(2).to_string(index=False))
</VSCode.Cell>

<VSCode.Cell id="eda_009" language="python">
# Distribution comparison: Legitimate vs Fraudulent
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

colors = ['#2ecc71', '#e74c3c']  # Green for legitimate, red for fraud

for idx, feature in enumerate(numerical_features):
    ax = axes[idx]
    
    # Create overlapping histograms
    ax.hist(legitimate[feature], bins=30, alpha=0.6, label='Legitimate', color=colors[0])
    ax.hist(fraudulent[feature], bins=30, alpha=0.6, label='Fraudulent', color=colors[1])
    
    ax.set_xlabel(feature.capitalize(), fontsize=11)
    ax.set_ylabel('Frequency', fontsize=11)
    ax.set_title(f'{feature.capitalize()} Distribution', fontsize=12, fontweight='bold')
    ax.legend()
    ax.grid(alpha=0.3)

# Hide the extra subplot
axes[-1].axis('off')

plt.tight_layout()
plt.show()
</VSCode.Cell>

<VSCode.Cell id="eda_010" language="python">
# Box plots for outlier detection and distribution comparison
fig, axes = plt.subplots(1, 5, figsize=(18, 5))

for idx, feature in enumerate(numerical_features):
    ax = axes[idx]
    
    data_to_plot = [legitimate[feature], fraudulent[feature]]
    bp = ax.boxplot(data_to_plot, labels=['Legitimate', 'Fraudulent'], patch_artist=True)
    
    # Color the boxes
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)
        patch.set_alpha(0.6)
    
    ax.set_ylabel('Value', fontsize=11)
    ax.set_title(f'{feature.capitalize()}', fontsize=12, fontweight='bold')
    ax.grid(axis='y', alpha=0.3)

plt.suptitle('Box Plots: Feature Distributions by Class', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
</VSCode.Cell>

<VSCode.Cell id="eda_011" language="python">
# Categorical Features Analysis
print("\n" + "="*60)
print("CATEGORICAL FEATURES ANALYSIS")
print("="*60)

categorical_features = ['education_level', 'marital_status']

for feature in categorical_features:
    print(f"\n{feature.upper()}:")
    print(df[feature].value_counts())
    print(f"\nFraud Rate by {feature}:")
    fraud_by_category = df.groupby(feature)['fraud_label'].agg(['sum', 'count'])
    fraud_by_category['fraud_rate'] = (fraud_by_category['sum'] / fraud_by_category['count'] * 100).round(2)
    print(fraud_by_category)
</VSCode.Cell>

<VSCode.Cell id="eda_012" language="python">
# Categorical features distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for idx, feature in enumerate(categorical_features):
    ax = axes[idx]
    
    # Create fraud rate by category
    fraud_rate = df.groupby(feature)['fraud_label'].mean() * 100
    fraud_rate = fraud_rate.sort_values(ascending=False)
    
    bars = ax.bar(range(len(fraud_rate)), fraud_rate.values, color=colors[1], alpha=0.7, edgecolor='black')
    ax.set_xticks(range(len(fraud_rate)))
    ax.set_xticklabels(fraud_rate.index, rotation=45)
    ax.set_ylabel('Fraud Rate (%)', fontsize=11)
    ax.set_title(f'Fraud Rate by {feature.capitalize()}', fontsize=12, fontweight='bold')
    ax.grid(axis='y', alpha=0.3)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.1f}%', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()
</VSCode.Cell>

<VSCode.Cell id="eda_013" language="python">
# Correlation Analysis
print("\n" + "="*60)
print("CORRELATION ANALYSIS")
print("="*60)

# Create correlation matrix for numerical features + fraud_label
corr_data = df[numerical_features + ['fraud_label']].corr()
print("\n" + corr_data.round(3).to_string())

# Visualize correlation matrix
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr_data, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            square=True, ax=ax, cbar_kws={'label': 'Correlation'})
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Print highest correlations with fraud_label
print("\nTop Correlations with Fraud Label:")
fraud_corr = corr_data['fraud_label'].sort_values(ascending=False)
print(fraud_corr[1:6])  # Exclude self-correlation
</VSCode.Cell>

<VSCode.Cell id="eda_014" language="python">
# Derived Features Analysis
print("\n" + "="*60)
print("DERIVED FEATURES (Manual Calculation)")
print("="*60)

# Calculate important ratios
df['income_to_loan_ratio'] = df['income'] / df['loan_amount']
df['credit_income_ratio'] = df['credit_score'] / 850  # Normalized

print("\nIncome-to-Loan Ratio Statistics:")
print(f"Legitimate Applications:")
print(f"  Mean: {df[df['fraud_label']==0]['income_to_loan_ratio'].mean():.4f}")
print(f"  Median: {df[df['fraud_label']==0]['income_to_loan_ratio'].median():.4f}")
print(f"\nFraudulent Applications:")
print(f"  Mean: {df[df['fraud_label']==1]['income_to_loan_ratio'].mean():.4f}")
print(f"  Median: {df[df['fraud_label']==1]['income_to_loan_ratio'].median():.4f}")

# Visualize derived features
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Income to loan ratio
ax = axes[0]
ax.hist(df[df['fraud_label']==0]['income_to_loan_ratio'], bins=30, alpha=0.6, label='Legitimate', color=colors[0])
ax.hist(df[df['fraud_label']==1]['income_to_loan_ratio'], bins=30, alpha=0.6, label='Fraudulent', color=colors[1])
ax.set_xlabel('Income-to-Loan Ratio', fontsize=11)
ax.set_ylabel('Frequency', fontsize=11)
ax.set_title('Income-to-Loan Ratio Distribution', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)

# Credit income ratio
ax = axes[1]
ax.hist(df[df['fraud_label']==0]['credit_income_ratio'], bins=30, alpha=0.6, label='Legitimate', color=colors[0])
ax.hist(df[df['fraud_label']==1]['credit_income_ratio'], bins=30, alpha=0.6, label='Fraudulent', color=colors[1])
ax.set_xlabel('Normalized Credit Score', fontsize=11)
ax.set_ylabel('Frequency', fontsize=11)
ax.set_title('Credit Score Distribution (Normalized)', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()
</VSCode.Cell>

<VSCode.Cell id="eda_015" language="python">
# Key Insights and Recommendations
print("\n" + "="*60)
print("KEY INSIGHTS & RECOMMENDATIONS")
print("="*60)

insights = """
1. CLASS IMBALANCE:
   - Significant imbalance detected between fraud and legitimate cases
   - Recommendation: Use SMOTE or class weights during model training
   - Focus on precision/recall and ROC-AUC, not just accuracy

2. FRAUD PATTERNS:
   - Compare statistical differences between fraud and legitimate applications
   - Identify which features show strongest discrimination power
   - Consider feature interactions and non-linear relationships

3. MISSING VALUES:
   - Handle appropriately (mean/median imputation, forward fill, etc.)
   - Consider dropping features with >30% missing data
   - Note any patterns in missingness (MCAR, MAR, MNAR)

4. FEATURE ENGINEERING OPPORTUNITIES:
   - Income-to-loan ratio: Debt burden indicator
   - Credit history score: Payment reliability
   - Employment stability: Job continuity indicator
   - Age-credit score interaction: Maturity vs. creditworthiness

5. OUTLIERS:
   - Investigate extreme values in box plots
   - Decide: Keep (legitimate extreme cases) or remove (data errors)
   - Use robust scaling if outliers are kept

6. CATEGORICAL VARIABLES:
   - Analyze fraud rates across categories
   - Consider one-hot encoding or label encoding for models
   - Watch for rare categories

7. MODEL STRATEGY:
   - Ensemble methods (Random Forest, XGBoost, Voting Classifier)
   - Hyperparameter tuning focused on business metrics
   - Cross-validation with stratification to maintain fraud distribution
   - Threshold tuning to balance precision vs recall trade-off
"""

print(insights)
</VSCode.Cell>

<VSCode.Cell id="eda_016" language="python">
# Summary Statistics Export
print("\n" + "="*60)
print("FINAL SUMMARY")
print("="*60)

summary = {
    'Total Records': len(df),
    'Fraudulent Records': (df['fraud_label'] == 1).sum(),
    'Legitimate Records': (df['fraud_label'] == 0).sum(),
    'Fraud Rate': f"{(df['fraud_label'].mean() * 100):.2f}%",
    'Features Count': len(df.columns) - 1,
    'Missing Data': df.isnull().sum().sum(),
    'Complete Records': len(df.dropna())
}

for key, value in summary.items():
    print(f"{key:.<40} {value}")

print("\n✓ EDA Complete! Ready for preprocessing and feature engineering.")
</VSCode.Cell>
```