### AB testing
This code assumes that the insurance data is stored in a CSV file named insurance_data.csv. It first loads the data into a pandas DataFrame df, and then explores the data by printing the first few rows, the data types, and the descriptive statistics.

The main function perform_ab_testing is then defined, which contains the code for performing the A/B hypothesis testing on the insurance data. This function will be called later in the notebook to execute the actual tests.

In [1]:
import os
import sys

In [2]:
sys.path.append(os.path.abspath('..'))

In [3]:
import pandas as pd

In [4]:
data = pd.read_csv("C:\\Users\\nadew\\10x\\week3\\ACIS\\data\\cleaned_data\\cleaned_data_v4.csv", low_memory=False)

In [5]:
data.head()

Unnamed: 0,UnderwrittenCoverID,PolicyID,TransactionMonth,IsVATRegistered,Citizenship,LegalType,Title,Language,Bank,AccountType,...,ExcessSelected,CoverCategory,CoverType,CoverGroup,Section,Product,StatutoryClass,StatutoryRiskType,TotalPremium,TotalClaims
0,145249,12827,2015-03-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
1,145249,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
2,145249,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0
3,145255,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,512.84807,0.0
4,145255,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000098 entries, 0 to 1000097
Data columns (total 52 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   UnderwrittenCoverID       1000098 non-null  int64  
 1   PolicyID                  1000098 non-null  int64  
 2   TransactionMonth          1000098 non-null  object 
 3   IsVATRegistered           1000098 non-null  bool   
 4   Citizenship               1000098 non-null  object 
 5   LegalType                 1000098 non-null  object 
 6   Title                     1000098 non-null  object 
 7   Language                  1000098 non-null  object 
 8   Bank                      854137 non-null   object 
 9   AccountType               959866 non-null   object 
 10  MaritalStatus             991839 non-null   object 
 11  Gender                    990562 non-null   object 
 12  Country                   1000098 non-null  object 
 13  Province                  1

In [7]:
data.describe()

Unnamed: 0,UnderwrittenCoverID,PolicyID,PostalCode,mmcode,RegistrationYear,Cylinders,cubiccapacity,kilowatts,NumberOfDoors,CustomValueEstimate,NumberOfVehiclesInFleet,SumInsured,CalculatedPremiumPerTerm,TotalPremium,TotalClaims
count,1000098.0,1000098.0,1000098.0,999546.0,1000098.0,999546.0,999546.0,999546.0,999546.0,220456.0,0.0,1000098.0,1000098.0,1000098.0,1000098.0
mean,104817.5,7956.682,3020.601,54877700.0,2010.225,4.046642,2466.743258,97.207919,4.01925,225531.1,,604172.7,117.8757,61.9055,64.86119
std,63293.71,5290.039,2649.854,13603810.0,3.261391,0.29402,442.80064,19.393256,0.468314,564515.7,,1508332.0,399.7017,230.2845,2384.075
min,1.0,14.0,1.0,4041200.0,1987.0,0.0,0.0,0.0,0.0,20000.0,,0.01,0.0,-782.5768,-12002.41
25%,55143.0,4500.0,827.0,60056920.0,2008.0,4.0,2237.0,75.0,4.0,135000.0,,5000.0,3.2248,0.0,0.0
50%,94083.0,7071.0,2000.0,60058420.0,2011.0,4.0,2694.0,111.0,4.0,220000.0,,7500.0,8.4369,2.178333,0.0
75%,139190.0,11077.0,4180.0,60058420.0,2013.0,4.0,2694.0,111.0,4.0,280000.0,,250000.0,90.0,21.92982,0.0
max,301175.0,23246.0,9870.0,65065350.0,2015.0,10.0,12880.0,309.0,6.0,26550000.0,,12636200.0,74422.17,65282.6,393092.1


In [8]:
from scripts.AB_Testing import ABTesting

This function perform_ab_testing takes the insurance data DataFrame as input and performs the following A/B hypothesis tests:

1. Test for risk differences across provinces
2. Test for risk differences between zip codes
3. Test for significant margin (profit) difference between zip codes
4. Test for significant risk difference between Women and Men

For each hypothesis, the function calculates the relevant statistics (e.g., chi-square, t-test) and reports whether the null hypothesis can be rejected or not based on the p-value.

In [9]:
data = ABTesting(data)

In [10]:
data.test_risk_across_provinces()


Testing risk differences across provinces...
Reject the null hypothesis. There are significant risk differences across provinces.


### observation result

#### Testing hypothesis: There are no risk differences across provinces

Result: Reject the null hypothesis. There are significant risk differences.

-> This means that the analysis found statistically significant differences in the risk (total claims) across different provinces. In other words, the province where the insured person lives is a significant factor in determining the insurance risk.


In [11]:
data.test_risk_between_zipcodes()


Testing risk differences between zip codes...
Reject the null hypothesis. There are significant risk differences between zip codes.


#### Testing hypothesis: There are no risk differences between zip codes

Result: Reject the null hypothesis. There are significant risk differences.

-> Similar to the previous test, this result indicates that the zip code or location of the insured person is a significant factor in determining the insurance risk.

In [13]:
data.test_margin_difference_zipcodes(  
)


Testing margin differences between zip codes...
Fail to reject the null hypothesis. There are no significant margin differences between zip codes.


#### Testing hypothesis: There are no significant margin (profit) difference between zip codes

Result: Fail to reject the null hypothesis. There are no significant margin differences between zip codes.

-> In this case, the analysis did not find statistically significant differences in the profit margins (total premium - total claims) between different zip codes. This suggests that the location of the insured person does not have a significant impact on the insurance company's profitability.


In [15]:
data.test_risk_by_gender()


Testing risk differences between genders...
Reject the null hypothesis. There are significant risk differences between genders.



#### Testing hypothesis: There are not significant risk difference between Women and Men

Result: Reject the null hypothesis. There are significant risk differences between genders.

-> This result shows that the gender of the insured person is a significant factor in determining the insurance risk. The analysis found that there are statistically significant differences in the total claims between women and men