# Insurance Risk Hypothesis Testing

This notebook performs statistical validation of key risk driver hypotheses for insurance segmentation, using modular code. The workflow includes:
- Data loading and preprocessing
- Metric calculation (claim frequency, severity, margin)
- Data segmentation for A/B testing
- Statistical testing
- Analysis and business recommendations

In [51]:
import sys
import os

# Add the src directory (one level up from notebooks/) to sys.path
src_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'src'))
if src_path not in sys.path:
    sys.path.insert(0, src_path)

import pandas as pd
from utils.data_prep import load_data, preprocess_data
from utils.metrics import compute_claim_frequency, compute_claim_severity, compute_margin
from utils.segmentation import split_groups
from utils.stat_tests import run_chi2_test, run_ttest
from utils.reporting import interpret_result, business_recommendation


In [52]:
# Load and preprocess data
DATA_PATH = os.path.abspath(os.path.join('..', 'data', 'raw', 'MachineLearningRating_v3.txt'))
df_raw = load_data(DATA_PATH)
print('Rows after loading:', len(df_raw))
print('Rows with missing values:', df_raw.isnull().any(axis=1).sum())
# Only drop rows with missing values in required columns
required_cols = ['TotalClaims', 'TotalPremium', 'Province', 'Gender', 'PostalCode']
df = df_raw.drop_duplicates()
df = df.dropna(subset=required_cols)
print('Rows after preprocessing:', len(df))
print(df.columns.tolist())
df.head()

  return pd.read_csv(filepath, delimiter='|')


Rows after loading: 1000098
Rows with missing values: 1000098
Rows with missing values: 1000098
Rows after preprocessing: 990562
['UnderwrittenCoverID', 'PolicyID', 'TransactionMonth', 'IsVATRegistered', 'Citizenship', 'LegalType', 'Title', 'Language', 'Bank', 'AccountType', 'MaritalStatus', 'Gender', 'Country', 'Province', 'PostalCode', 'MainCrestaZone', 'SubCrestaZone', 'ItemType', 'mmcode', 'VehicleType', 'RegistrationYear', 'make', 'Model', 'Cylinders', 'cubiccapacity', 'kilowatts', 'bodytype', 'NumberOfDoors', 'VehicleIntroDate', 'CustomValueEstimate', 'AlarmImmobiliser', 'TrackingDevice', 'CapitalOutstanding', 'NewVehicle', 'WrittenOff', 'Rebuilt', 'Converted', 'CrossBorder', 'NumberOfVehiclesInFleet', 'SumInsured', 'TermFrequency', 'CalculatedPremiumPerTerm', 'ExcessSelected', 'CoverCategory', 'CoverType', 'CoverGroup', 'Section', 'Product', 'StatutoryClass', 'StatutoryRiskType', 'TotalPremium', 'TotalClaims']
Rows after preprocessing: 990562
['UnderwrittenCoverID', 'PolicyID', 

Unnamed: 0,UnderwrittenCoverID,PolicyID,TransactionMonth,IsVATRegistered,Citizenship,LegalType,Title,Language,Bank,AccountType,...,ExcessSelected,CoverCategory,CoverType,CoverGroup,Section,Product,StatutoryClass,StatutoryRiskType,TotalPremium,TotalClaims
0,145249,12827,2015-03-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
1,145249,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
2,145249,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0
3,145255,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,512.84807,0.0
4,145255,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0


## Hypothesis 1: Risk Differences Across Provinces

- **Null Hypothesis (H₀):** There are no risk differences across provinces.
- **Metrics:** Claim Frequency, Claim Severity, Margin
- **Test:** Chi-squared (frequency), t-test (severity, margin)

In [53]:
import numpy as np
from scipy.stats import chi2_contingency
# Compute metrics for provinces
df['TotalClaims'] = df['TotalClaims'].astype(int)
province_freq = df.groupby('Province')['TotalClaims'].apply(lambda x: (x > 0).mean())
# Use TotalPremium as a proxy for claim severity if ClaimAmount is not present
def compute_claim_severity_fixed(df, group_col):
    return df[df['TotalClaims'] > 0].groupby(group_col)['TotalPremium'].mean()
province_sev = compute_claim_severity_fixed(df, 'Province')
province_margin = compute_margin(df, 'Province')

# Select two provinces for A/B testing
prov_a, prov_b = 'Gauteng', 'Western Cape'
group_a, group_b = split_groups(df, 'Province', prov_a, prov_b)

# Robust chi-squared test for claim frequency
def claim_frequency_contingency(group_a, group_b, metric):
    a_no_claims = (group_a[metric] == 0).sum()
    a_claims = (group_a[metric] > 0).sum()
    b_no_claims = (group_b[metric] == 0).sum()
    b_claims = (group_b[metric] > 0).sum()
    return np.array([[a_no_claims, a_claims], [b_no_claims, b_claims]])

contingency = claim_frequency_contingency(group_a, group_b, 'TotalClaims')
chi2, p, dof, expected = chi2_contingency(contingency)
freq_test = {'test': 'chi2', 'statistic': chi2, 'p_value': p}

sev_test = run_ttest(group_a[group_a['TotalClaims']>0], group_b[group_b['TotalClaims']>0], 'TotalPremium')
margin_test = run_ttest(group_a, group_b, 'TotalPremium')

print('Claim Frequency:', interpret_result(freq_test))
print('Claim Severity:', interpret_result(sev_test))
print('Margin:', interpret_result(margin_test))
print(business_recommendation('provinces', freq_test, prov_a, prov_b))

  return df.groupby(group_col).apply(lambda x: x['TotalPremium'].sum() - x['TotalClaims'].sum())


Claim Frequency: Reject the null hypothesis (p = 0.0000).
Claim Severity: Reject the null hypothesis (p = 0.0000).
Margin: Reject the null hypothesis (p = 0.0005).
We reject the null hypothesis for provinces. There is a significant difference between Gauteng and Western Cape. Consider adjusting segmentation or pricing accordingly.


## Hypothesis 2: Risk Differences Between Zip Codes

- **Null Hypothesis (H₀):** There are no risk differences between zip codes.
- **Metrics:** Claim Frequency, Claim Severity, Margin
- **Test:** Chi-squared (frequency), t-test (severity, margin)

In [54]:
# Compute metrics for zip codes
postal_counts = df['PostalCode'].value_counts()
if len(postal_counts) < 2:
    print("Not enough unique postal codes for A/B testing.")
else:
    zip_a, zip_b = postal_counts.index[:2]
    group_a, group_b = split_groups(df, 'PostalCode', zip_a, zip_b)
    # Robust chi-squared test for claim frequency
    contingency = claim_frequency_contingency(group_a, group_b, 'TotalClaims')
    chi2, p, dof, expected = chi2_contingency(contingency)
    freq_test = {'test': 'chi2', 'statistic': chi2, 'p_value': p}
    # Use TotalPremium as a proxy for claim severity
    sev_test = run_ttest(group_a[group_a['TotalClaims']>0], group_b[group_b['TotalClaims']>0], 'TotalPremium')
    margin_test = run_ttest(group_a, group_b, 'TotalPremium')

    print('Claim Frequency:', interpret_result(freq_test))
    print('Claim Severity:', interpret_result(sev_test))
    print('Margin:', interpret_result(margin_test))
    print(business_recommendation('zip codes', freq_test, zip_a, zip_b))

Claim Frequency: Fail to reject the null hypothesis (p = 0.0579).
Claim Severity: Fail to reject the null hypothesis (p = 0.7958).
Margin: Reject the null hypothesis (p = 0.0000).
No significant difference found for zip codes between 2000 and 122. No change recommended.


## Hypothesis 3: Margin Differences Between Zip Codes

- **Null Hypothesis (H₀):** There are no significant margin (profit) differences between zip codes.
- **Metric:** Margin
- **Test:** t-test

In [55]:
# Margin difference test for zip codes
postal_counts = df['PostalCode'].value_counts()
if len(postal_counts) < 2:
    print("Not enough unique postal codes for margin difference test.")
else:
    zip_a, zip_b = postal_counts.index[:2]
    group_a, group_b = split_groups(df, 'PostalCode', zip_a, zip_b)
    margin_test = run_ttest(group_a, group_b, 'TotalPremium')
    print('Margin:', interpret_result(margin_test))
    print(business_recommendation('zip code margin', margin_test, zip_a, zip_b))

Margin: Reject the null hypothesis (p = 0.0000).
We reject the null hypothesis for zip code margin. There is a significant difference between 2000 and 122. Consider adjusting segmentation or pricing accordingly.


## Hypothesis 4: Risk Differences Between Women and Men

- **Null Hypothesis (H₀):** There are not significant risk differences between Women and Men.
- **Metrics:** Claim Frequency, Claim Severity, Margin
- **Test:** Chi-squared (frequency), t-test (severity, margin)

In [56]:
# Compute metrics for gender
group_a, group_b = split_groups(df, 'Gender', 'Female', 'Male')
# Robust chi-squared test for claim frequency
def claim_frequency_contingency(group_a, group_b, metric):
    a_no_claims = (group_a[metric] == 0).sum()
    a_claims = (group_a[metric] > 0).sum()
    b_no_claims = (group_b[metric] == 0).sum()
    b_claims = (group_b[metric] > 0).sum()
    return np.array([[a_no_claims, a_claims], [b_no_claims, b_claims]])
contingency = claim_frequency_contingency(group_a, group_b, 'TotalClaims')
chi2, p, dof, expected = chi2_contingency(contingency)
freq_test = {'test': 'chi2', 'statistic': chi2, 'p_value': p}
# Use TotalPremium as a proxy for claim severity
ev_test = run_ttest(group_a[group_a['TotalClaims']>0], group_b[group_b['TotalClaims']>0], 'TotalPremium')
margin_test = run_ttest(group_a, group_b, 'TotalPremium')

print('Claim Frequency:', interpret_result(freq_test))
print('Claim Severity:', interpret_result(ev_test))
print('Margin:', interpret_result(margin_test))
print(business_recommendation('gender', freq_test, 'Female', 'Male'))

Claim Frequency: Fail to reject the null hypothesis (p = 0.9515).
Claim Severity: Fail to reject the null hypothesis (p = 0.1995).
Margin: Reject the null hypothesis (p = 0.0000).
No significant difference found for gender between Female and Male. No change recommended.
