#  A/B Hypothesis Testing

### Import Libraries
We will need libraries such as pandas, scipy, and statsmodels to conduct statistical tests.

In [11]:
import sys
sys.path.append('../src')
from data_loader import DataLoader
from data_quality_check import missing_values_summary
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

from scipy import stats
import statsmodels.api as sm

### Load the DataFrame (Cleaned Data)

In [2]:
df=pd.read_csv('../data/cleaned_data.csv')

In [8]:
missing_values_summary(df)

Unnamed: 0,Missing Values,Percentage
UnderwrittenCoverID,0,0.0
PolicyID,0,0.0
TransactionMonth,0,0.0
IsVATRegistered,0,0.0
Citizenship,0,0.0
LegalType,0,0.0
Title,0,0.0
Language,0,0.0
Bank,0,0.0
AccountType,0,0.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999546 entries, 0 to 999545
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   UnderwrittenCoverID       999546 non-null  int64  
 1   PolicyID                  999546 non-null  int64  
 2   TransactionMonth          999546 non-null  object 
 3   IsVATRegistered           999546 non-null  bool   
 4   Citizenship               999546 non-null  object 
 5   LegalType                 999546 non-null  object 
 6   Title                     999546 non-null  object 
 7   Language                  999546 non-null  object 
 8   Bank                      999546 non-null  object 
 9   AccountType               999546 non-null  object 
 10  MaritalStatus             999546 non-null  object 
 11  Gender                    999546 non-null  object 
 12  Country                   999546 non-null  object 
 13  Province                  999546 non-null  o

### Formulate Null Hypotheses
1.	There are no risk differences across provinces 
2.	There are no risk differences between zip codes 
3.	There are no significant margin (profit) difference between zip codes 
4.	There are not significant risk difference between Women and Men


### Select Metrics (KPIs)
- For Risk Differences (Hypotheses 1, 2, 4): You can use the claims frequency or loss ratio as the KPI to measure risk.
- For Profit Margin Differences (Hypothesis 3): Use profit margin as the KPI.

### Data Segmentation
We need to split the data into Group A (Control) and Group B (Test) for each hypothesis.

1. Hypothesis 1: Risk Differences Across Provinces
- Group A: Choose one province (e.g., Gauteng).
- Group B: Choose another province (e.g., Western Cape).
2. Hypothesis 2: Risk Differences Between Zip Codes
- Group A: Policies from one zip code (e.g., PostalCode == 2000).
- Group B: Policies from another zip code (e.g., PostalCode == 3000).
3. Hypothesis 3: Profit Margin Differences Between Zip Codes
- Group A: Zip codes with lower profit margins.
- Group B: Zip codes with higher profit margins.
4. Hypothesis 4: Risk Differences Between Women and Men
- Group A: Female policyholders.
- Group B: Male policyholders.

In [12]:
# For Hypothesis 1 (Provinces)
group_a = df[df['Province'] == 'Gauteng']
group_b = df[df['Province'] == 'Western Cape']

### Statistical Testing
For each hypothesis, we will conduct statistical tests. Based on the data types:

- t-test for numerical data (like risk or profit).
- Chi-Squared Test for categorical data.

#### 1. Hypothesis 1: Risk Differences Across Provinces
We can use the t-test to compare the mean of LossRatio between provinces.

In [34]:
import numpy as np
# Step 1: Calculate 'LossRatio' column
df['LossRatio'] = df['TotalClaims'] / df['TotalPremium']

# Step 2: Handle NaN or infinite values
# Replace infinite values with NaN
df['LossRatio'].replace([np.inf, -np.inf], np.nan, inplace=True)

# Drop rows with NaN values in 'LossRatio'
df_cleaned = df.dropna(subset=['LossRatio'])

# Step 3: Group the cleaned data by province
group_a = df_cleaned[df_cleaned['Province'] == 'Gauteng']  # Example: Gauteng
group_b = df_cleaned[df_cleaned['Province'] == 'Northern Cape']  # Example: Western Cape

# Step 4: Perform t-test on the cleaned data
t_stat, p_value = stats.ttest_ind(group_a['LossRatio'], group_b['LossRatio'])

# Step 5: Print the results
print(f"T-statistic: {t_stat}, P-value: {p_value}")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['LossRatio'].replace([np.inf, -np.inf], np.nan, inplace=True)


T-statistic: 1.31443426226578, P-value: 0.18870134052691323


### Hypothesis 2: Risk Differences Between Zip Codes
- Null Hypothesis 2: There are no risk differences between zip codes.
- Test: We'll use the t-test to compare LossRatio across two different zip codes.

In [43]:
# Find zip codes with most entries
df.groupby('PostalCode')['LossRatio'].count().sort_values(ascending=False).head(10)


PostalCode
2000    90852
122     27898
299     16731
7784    13563
7405    10731
8000     8751
458      8392
2196     7277
470      7052
4360     6555
Name: LossRatio, dtype: int64

In [45]:
from scipy import stats

# Assume you want to compare two specific zip codes
group_zip_a = df[df['PostalCode'] == 2000]['LossRatio']
group_zip_b = df[df['PostalCode'] == 299]['LossRatio']

group_zip_a = group_zip_a.dropna()
group_zip_b = group_zip_b.dropna()

# Check the number of valid entries again
print(f"Group A (PostalCode 2000) count after filtering: {group_zip_a.shape[0]}")
print(f"Group B (PostalCode 299) count after filtering: {group_zip_b.shape[0]}")

# Perform t-test
t_stat_zip, p_value_zip = stats.ttest_ind(group_zip_a, group_zip_b, nan_policy='omit')

# Print results
print(f"T-statistic (zip codes): {t_stat_zip}, P-value (zip codes): {p_value_zip}")


Group A (PostalCode 2000) count after filtering: 90852
Group B (PostalCode 299) count after filtering: 16731
T-statistic (zip codes): 2.3012593882638934, P-value (zip codes): 0.021378875770699284


### Hypothesis 3: Margin (Profit) Differences Between Zip Codes
- Null Hypothesis 3: There are no significant margin (profit) differences between zip codes.
- Test: Use a t-test with a Margin or Profit metric. If you don't have a margin or profit column, calculate it (e.g., Profit = TotalPremium - TotalClaims).

In [46]:
# Calculate profit or margin if not already present
df['ProfitMargin'] = df['TotalPremium'] - df['TotalClaims']

# Select two zip codes to compare, e.g., 12345 and 54321
profit_group_a = df[df['PostalCode'] == 2000]['ProfitMargin']
profit_group_b = df[df['PostalCode'] == 299]['ProfitMargin']

# Perform t-test
t_stat_profit, p_value_profit = stats.ttest_ind(profit_group_a, profit_group_b, nan_policy='omit')

# Print results
print(f"T-statistic (profit margins): {t_stat_profit}, P-value (profit margins): {p_value_profit}")


T-statistic (profit margins): -2.1221643230589944, P-value (profit margins): 0.03382548489498472


### Hypothesis 4: Risk Differences Between Women and Men
- Null Hypothesis 4: There are no significant risk differences between women and men.
- Test: Use a t-test to compare the LossRatio between the two gender groups.

In [47]:
# Split data by gender
group_men = df[df['Gender'] == 'Male']['LossRatio']
group_women = df[df['Gender'] == 'Female']['LossRatio']

# Perform t-test
t_stat_gender, p_value_gender = stats.ttest_ind(group_men, group_women, nan_policy='omit')

# Print results
print(f"T-statistic (gender): {t_stat_gender}, P-value (gender): {p_value_gender}")


T-statistic (gender): -0.9117797544015281, P-value (gender): 0.3618943414070056


#### Chi-Squared Test for Categorical Data

In [48]:
from scipy.stats import chi2_contingency

# Create a contingency table for Gender and whether a claim was made (e.g., based on non-zero TotalClaims)
df['ClaimMade'] = df['TotalClaims'] > 0
contingency_table = pd.crosstab(df['Gender'], df['ClaimMade'])

# Perform chi-squared test
chi2_stat, p_val_chi, dof, ex = chi2_contingency(contingency_table)

# Print results
print(f"Chi-squared statistic: {chi2_stat}, P-value: {p_val_chi}")


Chi-squared statistic: 6.7602110506096444, P-value: 0.03404386205612263


In [16]:
df.columns

Index(['UnderwrittenCoverID', 'PolicyID', 'TransactionMonth',
       'IsVATRegistered', 'Citizenship', 'LegalType', 'Title', 'Language',
       'Bank', 'AccountType', 'MaritalStatus', 'Gender', 'Country', 'Province',
       'PostalCode', 'MainCrestaZone', 'SubCrestaZone', 'ItemType', 'mmcode',
       'VehicleType', 'RegistrationYear', 'make', 'Model', 'Cylinders',
       'cubiccapacity', 'kilowatts', 'bodytype', 'NumberOfDoors',
       'VehicleIntroDate', 'AlarmImmobiliser', 'TrackingDevice',
       'CapitalOutstanding', 'NewVehicle', 'WrittenOff', 'Rebuilt',
       'Converted', 'CrossBorder', 'SumInsured', 'TermFrequency',
       'CalculatedPremiumPerTerm', 'ExcessSelected', 'CoverCategory',
       'CoverType', 'CoverGroup', 'Section', 'Product', 'StatutoryClass',
       'StatutoryRiskType', 'TotalPremium', 'TotalClaims', 'LossRatio'],
      dtype='object')