# Task 1: Exploratory Data Analysis (EDA)
## Insurance Risk Analytics & Predictive Modeling

This notebook performs comprehensive EDA on the insurance claim data to:
- Understand data structure and quality
- Discover patterns in risk and profitability
- Answer key business questions
- Prepare for hypothesis testing and modeling


## 1. Setup and Imports


In [1]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

from src.data_loader import load_insurance_data, prepare_data_for_analysis
from src.utils import calculate_loss_ratio, detect_outliers_iqr, get_data_summary

warnings.filterwarnings('ignore')

# Set plotting style
try:
    plt.style.use('seaborn-v0_8-darkgrid')
except OSError:
    try:
        plt.style.use('seaborn-darkgrid')
    except OSError:
        plt.style.use('ggplot')
sns.set_palette("husl")

%matplotlib inline


## 2. Load Data

Note: Use `sample_size` parameter for faster initial exploration. Remove it to use the full dataset.


In [2]:
# Load data (use sample_size parameter for faster initial exploration)
# Remove sample_size to use full dataset
df = load_insurance_data(sample_size=50000)  # Adjust as needed
df = prepare_data_for_analysis(df)

print(f"Data shape: {df.shape}")
print(f"Date range: {df['TransactionMonth'].min()} to {df['TransactionMonth'].max()}")


Loading data from: D:\My Projects\Kifiya AI Mastery Training\week3\notebooks\..\data\MachineLearningRating_v3.txt
Loaded 1,000,098 rows and 52 columns
Sampled to 50,000 rows
Data shape: (50000, 56)
Date range: 2013-10-01 00:00:00 to 2015-08-01 00:00:00


## 3. Run Full EDA

You can run the complete EDA pipeline using the InsuranceEDA class:


In [3]:
from src.eda import InsuranceEDA

# Initialize and run full EDA
eda = InsuranceEDA()
eda.run_full_eda()


INSURANCE RISK ANALYTICS - EXPLORATORY DATA ANALYSIS
Loading data from: D:\My Projects\Kifiya AI Mastery Training\week3\notebooks\..\data\MachineLearningRating_v3.txt
Loaded 1,000,098 rows and 52 columns

Data shape: (1000098, 56)
Date range: 2013-10-01 00:00:00 to 2015-08-01 00:00:00

RUNNING COMPLETE EDA PIPELINE

1. DATA SUMMARIZATION

1.1 Data Structure:
   - Total rows: 1,000,098
   - Total columns: 56

1.2 Data Types:
   - object: 34 columns
   - float64: 13 columns
   - int64: 4 columns
   - int32: 2 columns
   - datetime64[ns]: 1 columns
   - bool: 1 columns
   - period[M]: 1 columns

1.3 Descriptive Statistics (Numerical Variables):
       TotalPremium  TotalClaims   SumInsured  CustomValueEstimate  \
count    1000098.00   1000098.00   1000098.00            220456.00   
mean          61.91        64.86    604172.73            225531.13   
std          230.28      2384.07   1508331.84            564515.75   
min         -782.58    -12002.41         0.01             20000.00   
