# Medical Insurance Cost - EDA Summary

This notebook provides a lightweight summary of the exploratory data analysis for the Medical Insurance Cost dataset.

## Dataset Overview

- **Task**: Regression
- **Target**: `charges` (continuous)
- **Features**: Age, sex, BMI, children, smoker status, region
- **Dataset location**: `data/raw/insurance.csv`

## Key Steps Performed

1. **Data Loading and Preprocessing**
   - Standardized column names
   - Cast categorical variables (sex, smoker, region)
   - No missing values found

2. **Univariate Analysis**
   - Histograms and boxplots for numeric features
   - Count plots for categorical features
   - See figures: `reports/figures/insurance_univariate_*.png`

3. **Log Transformation**
   - Created `charges_log = log1p(charges)` for modeling stability
   - Reduces skewness in target distribution
   - See figure: `reports/figures/insurance_log_transformation.png`

4. **Outlier Detection**
   - Applied IQR method only to features (age, bmi), **not** to target
   - Preserves natural variation in insurance charges

5. **Bivariate Analysis**
   - Charges by smoker status (boxplot)
   - Age vs. charges scatter plot (colored by smoker)
   - BMI vs. charges scatter plot (colored by smoker)
   - Charges by sex and region
   - See figures: `reports/figures/insurance_bivariate_*.png`

6. **Correlation Analysis**
   - Spearman correlation with dummy variables
   - **Key findings**:
     - Region variables show low correlation with charges
     - Children shows low correlation but may have interaction effects
   - **Decision**: Dropped `region` to simplify model (low predictive value)
   - See figure: `reports/figures/insurance_correlation_matrix.png`

7. **Train/Test Split**
   - 80/20 split with stratification by `charges` quantiles (q=5)
   - Ensures representative distribution of target values in both sets
   - Outputs: `data/processed/insurance_train.csv`, `data/processed/insurance_test.csv`

## Key Insights

- Smoking status is the strongest predictor of insurance charges
- Age and BMI show positive correlation with charges, especially for smokers
- Region has minimal impact on charges and was removed
- Children count has low direct correlation but may have complex interactions

## How to Run

```bash
cd sleep-insurance-eda
python src/eda_insurance.py
```

## Figures

All generated figures are saved in `reports/figures/` with the prefix `insurance_`.
