# A/B Hypothesis Testing 

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np



In [3]:
#
import sys
import os
sys.path.append(os.path.abspath('..'))

In [4]:
from scripts.ab_hypothesis_testing import ABHypothesisTesting

In [5]:
df=pd.read_csv('../data/insurance_data.csv')

In [6]:
# Initialize the hypothesis testing class
hypothesis_testing = ABHypothesisTesting(df)


## Risk Differences Across Provinces

### Analysis Objective
This test examines whether there are significant differences in **risk levels** (measured by `Total Claims`) across provinces. The goal is to understand how risk varies regionally, which can inform province-specific policies or risk management strategies.

### Hypotheses
- **Null Hypothesis (H₀):** No risk differences across provinces.
- **Alternative Hypothesis (H₁):** Risk differences exist across provinces.

In [7]:
# Test if there are risk differences across provinces
province_test_result = hypothesis_testing.test_risk_across_provinces()
province_test_result


{'Test': 'ANOVA',
 'Null Hypothesis': 'No risk differences across provinces',
 'F-Statistic': 8.626482565483611,
 'p-Value': 0.0001930246778774567,
 'Reject Null': True}


### Results
- **Test Type:** ANOVA (Analysis of Variance)
  - Compares the mean `Total Claims` across multiple provinces.
- **F-Statistic:** 8.626
  - Indicates that the variance in `Total Claims` between provinces is significantly greater than the variance within provinces.
- **p-Value:** 0.000193
  - This value is much smaller than the common significance level of 0.05, suggesting the observed differences are highly unlikely to be due to random chance.
- **Decision:** Reject the null hypothesis (H₀).

### Conclusion
There is strong evidence to conclude that **significant risk differences exist across provinces**.

### Implications
1. **Risk Management:** Provinces with higher average claims may require stricter risk mitigation measures, while lower-risk provinces could benefit from premium reductions.
2. **Pricing Strategy:** Develop province-specific premium structures to reflect the risk profile of each province.
3. **Further Analysis:** Investigate the factors contributing to risk differences, such as demographic, geographic, or economic factors.



## Risk Differences Between Zip Codes

### Analysis Objective
This test examines whether there are significant differences in **risk levels** (measured by `Total Claims`) between zip codes. The goal is to evaluate how risk varies geographically at a finer level, providing insights for localized strategies.

### Hypotheses
- **Null Hypothesis (H₀):** No risk differences between zip codes.
- **Alternative Hypothesis (H₁):** Risk differences exist between zip codes.

In [8]:
# Test if there are risk differences between zip codes
zipcode_test_result = hypothesis_testing.test_risk_between_zipcodes()
zipcode_test_result


{'Test': 'ANOVA',
 'Null Hypothesis': 'No risk differences between zip codes',
 'F-Statistic': 1.6365345978982229,
 'p-Value': 0.1951758838795849,
 'Reject Null': False}

### Results
- **Test Type:** ANOVA (Analysis of Variance)
  - Compares the mean `Total Claims` across multiple zip codes.
- **F-Statistic:** 1.637
  - Suggests a relatively low variance in `Total Claims` between zip codes compared to within zip codes.
- **p-Value:** 0.195
  - This value is greater than the common significance level of 0.05, indicating that the observed differences are likely due to random chance.
- **Decision:** Fail to reject the null hypothesis (H₀).

### Conclusion
There is **insufficient evidence** to conclude that significant risk differences exist between zip codes. Any observed differences in `Total Claims` are likely due to random variation.

### Implications
1. **Uniform Risk Strategy:** Since risk levels appear consistent across zip codes, there may be no need for differentiated risk management or pricing strategies based on zip codes.
2. **Further Exploration:** Investigate other variables (e.g., demographics, policy types) that may reveal meaningful differences between zip codes.
3. **Combination Analysis:** Consider analyzing zip code risk differences alongside other factors, such as region or gender, for a more comprehensive understanding.

### Next Steps
- Visualize `Total Claims` by zip code to confirm the absence of meaningful trends.
- Conduct subgroup analyses (e.g., zip code and gender combinations) to uncover any hidden patterns.


## Margin Differences Between Zip Codes

### Analysis Objective
This test examines whether **profit margins** (calculated as `Premium - Total Claims`) differ significantly across zip codes. Profit margin reflects the profitability of policies in each zip code, making this analysis critical for identifying underperforming or highly profitable areas.

### Key Difference from Risk Differences Test
- **Test Risk Between Zip Codes:**
  - Focuses on analyzing **Total Claims** to determine if there are significant risk differences between zip codes.
  - Guides **risk management** strategies by identifying regions with higher or lower claims.
- **Test Margin Differences Between Zip Codes:**
  - Focuses on analyzing **Profit Margins** (`Premium - Total Claims`) to determine if there are significant profitability differences between zip codes.
  - Guides **pricing and profitability optimization** by highlighting regions that are overperforming or underperforming.

### Hypotheses
- **Null Hypothesis (H₀):** No significant margin differences between zip codes.
- **Alternative Hypothesis (H₁):** Significant margin differences exist between zip codes.

In [9]:
# Test if there are significant margin differences between zip codes
margin_test_result = hypothesis_testing.test_margin_difference_between_zipcodes()
margin_test_result


{'Test': 'ANOVA',
 'Null Hypothesis': 'No significant margin differences between zip codes',
 'F-Statistic': 14.131693475190742,
 'p-Value': 8.872893794662912e-07,
 'Reject Null': True}

### Results
- **F-Statistic:** 14.132
  - Indicates a large variance in profit margins across zip codes compared to within-group variance.
- **p-Value:** 8.87e-07
  - This value is far below the common significance level of 0.05, indicating that the observed differences are highly unlikely to be due to random chance.
- **Decision:** Reject the null hypothesis (H₀).

### Conclusion
There is strong evidence to conclude that **significant profit margin differences exist between zip codes**.

### Implications
1. **Profit Optimization:** Tailor strategies to specific zip codes based on their profitability. For example:
   - Focus on enhancing profitability in underperforming zip codes.
   - Sustain or expand in highly profitable zip codes.
2. **Pricing Adjustments:** Evaluate zip codes with lower profit margins and consider revising premiums or reducing costs.
3. **Further Exploration:** Investigate factors driving these differences, such as demographics, policy mix, or claim frequency.



## Interpretation: Risk Differences Between Genders

### Analysis Objective
This test examines whether there are significant differences in **risk levels** (measured by `Total Claims`) between genders. Understanding risk differences by gender can help insurers design gender-specific policies or adjust premiums based on claims data.

### Hypotheses
- **Null Hypothesis (H₀):** No significant risk differences between women and men.
- **Alternative Hypothesis (H₁):** Significant risk differences exist between women and men.

In [10]:
# Test if there are significant risk differences between genders
gender_test_result = hypothesis_testing.test_risk_difference_gender()
gender_test_result


{'Test': 'T-Test',
 'Null Hypothesis': 'No significant risk differences between women and men',
 'T-Statistic': 3.5693545588418787,
 'p-Value': 0.00037512158401762627,
 'Reject Null': True}

### Results
- **Test Type:** T-Test (Independent Samples)
  - Compares the mean `Total Claims` between two independent groups: women and men.
- **T-Statistic:** 3.569
  - Indicates the magnitude of the difference between the means relative to the variability within groups.
- **p-Value:** 0.000375
  - This value is much smaller than the common significance level of 0.05, suggesting that the observed differences are highly unlikely to be due to random chance.
- **Decision:** Reject the null hypothesis (H₀).

### Conclusion
There is strong evidence to conclude that **significant risk differences exist between women and men**.

### Implications
1. **Gender-Specific Strategies:**
   - If men or women exhibit consistently higher claims, tailor policies, premiums, or risk mitigation strategies accordingly.
2. **Premium Adjustments:**
   - For the gender with lower average claims, consider offering reduced premiums to attract more clients.
3. **Further Analysis:**
   - Investigate underlying factors contributing to the differences, such as claim frequency, type of coverage, or demographic influences.
