# Insurance Data Analysis Pipeline

This notebook demonstrates how to use the `DataCleaner` class and inline EDA functions for comprehensive data analysis.

## 1. Setup and Imports

In [4]:
import sys
from pathlib import Path

# Add scripts directory to path
sys.path.insert(0, str(Path.cwd().parent / 'scripts'))

from clean_data import DataCleaner
from eda_inline import (
    show_data_summary,
    show_missing_values,
    show_descriptive_stats,
    show_distributions,
    show_outliers,
    show_correlation,
    show_scatter,
    show_group_analysis,
    show_temporal_trends,
    run_full_inline_analysis
)

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# For better display in notebooks
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

## 2. Data Cleaning

Use the `DataCleaner` class to load and clean the raw data.

The cleaned data will be saved in the same directory as the raw data with a `_cleaned` suffix.

In [5]:
# Initialize the DataCleaner
cleaner = DataCleaner()

# Process the data file
# This saves to data/MachineLearningRating_v3_cleaned.csv
df = cleaner.process_file(
    input_path="../data/MachineLearningRating_v3.txt",
    save_to_cleaned=True
)

print(f"\nCleaned data shape: {df.shape}")

### Alternative: Load already cleaned data

In [6]:
# If you've already cleaned the data, you can load it directly
# df = pd.read_csv("../data/MachineLearningRating_v3_cleaned.csv")

## 3. Exploratory Data Analysis (EDA)

All EDA results will be displayed inline in this notebook. No files will be saved.

### Option 1: Run Full Analysis (Automated)

In [7]:
# Run complete EDA pipeline - displays everything inline
run_full_inline_analysis(df)

# This will:
# 1. Summarize data structure
# 2. Assess missing values
# 3. Generate descriptive statistics
# 4. Create distribution plots
# 5. Detect outliers
# 6. Perform correlation analysis
# 7. Create scatter plots
# 8. Analyze group-level KPIs
# 9. Perform temporal trend analysis

#### Group-Level KPI Analysis

In [None]:
# Analyze KPIs by different groups
show_group_analysis(df, group_cols=['Province', 'VehicleType', 'Gender'])

#### Temporal Trend Analysis

In [None]:
# Analyze trends over time
temporal_df = show_temporal_trends(df, date_col='TransactionMonth')

## 4. Custom Analysis

You can perform additional custom analysis using the cleaned DataFrame.

In [None]:
# Example: Analyze loss ratio by province
province_analysis = df.groupby('Province').agg({
    'TotalPremium': 'sum',
    'TotalClaims': 'sum',
    'PolicyID': 'count'
})

province_analysis['LossRatio'] = (
    province_analysis['TotalClaims'] / province_analysis['TotalPremium']
)

province_analysis.sort_values('LossRatio', ascending=False)

## Summary

This notebook demonstrated:
1. **Data Cleaning**: Using `DataCleaner` to process and save cleaned data
2. **Analusis**: Using `analysis` functions to display analysis results directly in the notebook

The cleaned data is saved to: `data/MachineLearningRating_v3_cleaned.csv`