# ðŸ“Š Exploratory Data Analysis (EDA)
## Insurance Claims Cost Prediction

**Objective**: Understand the training dataset to inform feature engineering and model selection for predicting `UltimateIncurredClaimCost`.

---

### Table of Contents
1. [Data Loading & Overview](#1-data-loading--overview)
2. [Missing Values Analysis](#2-missing-values-analysis)
3. [Target Variable Analysis](#3-target-variable-analysis)
4. [Numerical Features Distribution](#4-numerical-features-distribution)
5. [Categorical Features Analysis](#5-categorical-features-analysis)
6. [Correlation Analysis](#6-correlation-analysis)
7. [Feature vs Target Relationships](#7-feature-vs-target-relationships)
8. [Outlier Detection](#8-outlier-detection)
9. [Key Insights & Recommendations](#9-key-insights--recommendations)

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)

---
## 1. Data Loading & Overview

In [None]:
# Load the training dataset
df = pd.read_csv('data/train.csv')

print(f"Dataset Shape: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

In [None]:
# First 5 rows
df.head()

In [None]:
# Data Types and Info
df.info()

In [None]:
# Statistical Summary
df.describe()

### ðŸ’¡ Initial Observations
- **ClaimNumber**: Unique identifier (should be dropped for modeling)
- **Date columns**: `DateTimeOfAccident`, `DateReported` - potential for time-based features
- **Target**: `UltimateIncurredClaimCost` - this is what we need to predict
- **Categorical**: `Gender`, `MaritalStatus`, `PartTimeFullTime`
- **Numerical**: `Age`, `WeeklyWages`, `InitialIncurredCalimsCost`, etc.
- **Text**: `ClaimDescription` - potential for NLP features

---
## 2. Missing Values Analysis

In [None]:
# Missing values summary
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
}).sort_values('Missing %', ascending=False)

missing_df[missing_df['Missing Count'] > 0]

In [None]:
# Visualize missing values
if missing_df[missing_df['Missing Count'] > 0].shape[0] > 0:
    plt.figure(figsize=(10, 4))
    missing_cols = missing_df[missing_df['Missing Count'] > 0]
    sns.barplot(x=missing_cols.index, y='Missing %', data=missing_cols, color='coral')
    plt.title('Missing Values by Column (%)', fontsize=14, fontweight='bold')
    plt.xticks(rotation=45, ha='right')
    plt.ylabel('Percentage Missing')
    plt.tight_layout()
    plt.show()
else:
    print("âœ… No missing values in the dataset!")

---
## 3. Target Variable Analysis

Understanding the distribution of `UltimateIncurredClaimCost` is critical for choosing the right model and loss function.

In [None]:
target = 'UltimateIncurredClaimCost'

print("Target Variable Statistics:")
print(df[target].describe())
print(f"\nSkewness: {df[target].skew():.2f}")
print(f"Kurtosis: {df[target].kurtosis():.2f}")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw Distribution
axes[0].hist(df[target], bins=100, color='steelblue', edgecolor='white')
axes[0].set_title('Distribution of Ultimate Incurred Claim Cost (Raw)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Cost ($)')
axes[0].set_ylabel('Frequency')
axes[0].axvline(df[target].mean(), color='red', linestyle='--', label=f'Mean: ${df[target].mean():,.0f}')
axes[0].axvline(df[target].median(), color='orange', linestyle='--', label=f'Median: ${df[target].median():,.0f}')
axes[0].legend()

# Log-Transformed Distribution
log_target = np.log1p(df[target])
axes[1].hist(log_target, bins=100, color='seagreen', edgecolor='white')
axes[1].set_title('Distribution of log(1 + Cost)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Log-Transformed Cost')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

### ðŸ’¡ Target Insights
- **Extreme Right Skew**: The raw distribution shows a heavy right tail with most claims being low value.
- **Log Transformation Helps**: The log-transformed distribution is much more normal-like.
- **Recommendation**: Use `log1p(target)` for training and `expm1(prediction)` for inference.

---
## 4. Numerical Features Distribution

In [None]:
numerical_cols = ['Age', 'WeeklyWages', 'InitialIncurredCalimsCost', 
                  'HoursWorkedPerWeek', 'DaysWorkedPerWeek', 
                  'DependentChildren', 'DependentsOther']

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.flatten()

for i, col in enumerate(numerical_cols):
    if col in df.columns:
        axes[i].hist(df[col].dropna(), bins=50, color='steelblue', edgecolor='white')
        axes[i].set_title(f'Distribution of {col}', fontsize=11, fontweight='bold')
        axes[i].set_xlabel(col)
        axes[i].set_ylabel('Frequency')

# Hide unused subplots
for j in range(len(numerical_cols), len(axes)):
    axes[j].axis('off')

plt.tight_layout()
plt.show()

### ðŸ’¡ Numerical Feature Observations
- **Age**: Relatively uniform distribution, typical working age range.
- **WeeklyWages**: Right-skewed with some high earners.
- **InitialIncurredClaimsCost**: Highly skewed, similar to the target (expected correlation).
- **DependentChildren/DependentsOther**: Mostly low values (0-2).

---
## 5. Categorical Features Analysis

In [None]:
categorical_cols = ['Gender', 'MaritalStatus', 'PartTimeFullTime']

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for i, col in enumerate(categorical_cols):
    if col in df.columns:
        value_counts = df[col].value_counts()
        axes[i].bar(value_counts.index, value_counts.values, color='teal', edgecolor='white')
        axes[i].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
        axes[i].set_xlabel(col)
        axes[i].set_ylabel('Count')
        
        # Add percentage labels
        for idx, (cat, count) in enumerate(zip(value_counts.index, value_counts.values)):
            pct = count / len(df) * 100
            axes[i].text(idx, count + len(df)*0.01, f'{pct:.1f}%', ha='center', fontsize=9)

plt.tight_layout()
plt.show()

In [None]:
# Categorical vs Target (Box plots)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for i, col in enumerate(categorical_cols):
    if col in df.columns:
        df.boxplot(column=target, by=col, ax=axes[i])
        axes[i].set_title(f'{col} vs {target}', fontsize=12, fontweight='bold')
        axes[i].set_xlabel(col)
        axes[i].set_ylabel('Claim Cost ($)')
        axes[i].set_ylim(0, df[target].quantile(0.95))  # Focus on 95th percentile

plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.show()

### ðŸ’¡ Categorical Insights
- **Gender**: Imbalanced (more males, typical in workers' comp data).
- **MaritalStatus**: Married (M) is the largest group.
- **PartTimeFullTime**: Full-time workers dominate.

---
## 6. Correlation Analysis

In [None]:
# Compute correlation matrix for numerical columns
corr_cols = numerical_cols + [target]
corr_matrix = df[corr_cols].corr()

plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r', 
            center=0, linewidths=0.5, square=True)
plt.title('Correlation Matrix (Numerical Features)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Correlation with Target
target_corr = corr_matrix[target].drop(target).sort_values(ascending=False)

plt.figure(figsize=(8, 5))
colors = ['green' if x > 0 else 'red' for x in target_corr.values]
plt.barh(target_corr.index, target_corr.values, color=colors)
plt.xlabel('Correlation Coefficient')
plt.title('Feature Correlation with Target', fontsize=14, fontweight='bold')
plt.axvline(0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()

### ðŸ’¡ Correlation Insights
- **InitialIncurredClaimsCost**: Strong positive correlation with target (expected - initial estimate drives final cost).
- **WeeklyWages**: Moderate positive correlation (higher wages â†’ higher claims?).
- Other features have weak correlations, suggesting non-linear relationships or feature engineering opportunities.

---
## 7. Feature vs Target Relationships

In [None]:
# Scatter plots for key features vs target
key_features = ['InitialIncurredCalimsCost', 'WeeklyWages', 'Age']

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for i, col in enumerate(key_features):
    axes[i].scatter(df[col], df[target], alpha=0.3, s=10, color='steelblue')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel(target)
    axes[i].set_title(f'{col} vs Target', fontsize=12, fontweight='bold')
    axes[i].set_ylim(0, df[target].quantile(0.99))
    
plt.tight_layout()
plt.show()

---
## 8. Outlier Detection

In [None]:
# Box plots for outlier detection
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

all_numeric = numerical_cols + [target]

for i, col in enumerate(all_numeric):
    if col in df.columns and i < len(axes):
        axes[i].boxplot(df[col].dropna(), vert=True)
        axes[i].set_title(f'{col}', fontsize=11, fontweight='bold')
        axes[i].set_ylabel('Value')

plt.tight_layout()
plt.show()

In [None]:
# Quantile analysis for target
print("Target Variable Percentiles:")
for q in [0.25, 0.50, 0.75, 0.90, 0.95, 0.99, 1.0]:
    val = df[target].quantile(q)
    print(f"  {int(q*100):3d}th percentile: ${val:,.2f}")

### ðŸ’¡ Outlier Observations
- **Target**: Top 1% of claims have extremely high values (long tail).
- **InitialIncurredClaimsCost**: Similar pattern to target.
- **Consideration**: Tree-based models are robust to outliers, but log-transformation helps with gradient descent methods.

---
## 9. Key Insights & Recommendations

### ðŸ“Œ Summary of Findings

| Aspect | Finding | Implication |
|--------|---------|-------------|
| **Target Distribution** | Highly right-skewed | Use `log1p` transformation for training |
| **Key Predictor** | `InitialIncurredClaimsCost` | Strong correlation with target, most important feature |
| **Missing Values** | Minimal/None | No complex imputation needed |
| **Categorical Features** | Imbalanced classes | Use label encoding; tree models handle this well |
| **Text Feature** | `ClaimDescription` available | Use TF-IDF + SVD for dimensionality reduction |
| **Date Features** | Accident and Report dates | Engineer `ReportLag`, `AccidentYear`, `AccidentMonth`, `DayOfWeek` |

### ðŸŽ¯ Recommended Feature Engineering
1. **Date Features**: `ReportLag = DateReported - DateTimeOfAccident`
2. **Log Transform**: `LogInitialCost = log1p(InitialIncurredClaimsCost)`
3. **Interaction**: `Age_Wage_Interaction = Age * WeeklyWages`
4. **NLP Features**: TF-IDF (1000 features) + TruncatedSVD (30 components) on `ClaimDescription`

### ðŸ§  Model Recommendations
- **Primary**: Gradient Boosting (XGBoost, LightGBM) - handles skewed data, non-linear relationships
- **Ensemble**: Stacking with Ridge meta-learner for robust predictions
- **Evaluation**: RMSE on original scale (after `expm1` transformation)

In [None]:
print("\nâœ… EDA Complete!")
print("Proceed to feature engineering and model training.")