# Hypothesis Testing

The Dataset gave a good opportunity to try out real world hypothesis testing problems e.g. imagine a case where you cannot apply traditional testing techniques like Z test or T test or ANOVA because the data is not normal. Also, imagine if the data had unequal variances how would we handle such cases step by step?

I have divided the hypothesis sections into:
- Two Sample Tests
- Multiple Sample Tests
- Non Numeric Tests (chi2)

This notebook might get you started with such type of tests.

There are a few things to keep in mind though:
- I have not done outlier treatment, the results may have been a bit different in that case. Also, outlier treatment cannot be directly done on Insurance Charge variable, there are many levels in the data, need to figure out the correct level to remove outliers
- The code in every section can be converted into a function, for ease of understanding, I have not done that!

In [None]:
### Importing all the relevant libraries ###

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

from scipy.stats import shapiro, levene
import scipy.stats as stats
import statistics

In [None]:
### Read in the data ###
data = pd.read_csv('../input/insurance/insurance (2).csv')
print(data.shape)
data.head()

In [None]:
data.info()

In [None]:
### Convert BMI to proper encodings ###
def bmi_encoder(x):
  if x < 18.5:
    return 'underweight'
  elif x < 25:
    return 'normal'
  elif x < 30:
    return 'overweight'
  else:
    return 'obese'

In [None]:
data['bmi'] = data['bmi'].apply(lambda x: bmi_encoder(x))
data.head()

In [None]:
### Distribution of various Variables with Insurance Charges ###

fig, ax = plt.subplots(nrows=2, ncols=3,figsize =(15,8))

ax[0,0].bar(x=data.groupby(by=['age']).agg({'charges': 'mean'}).reset_index()['age'], height=data.groupby(by=['age']).agg({'charges': 'mean'}).reset_index()['charges'], color='g')
ax[0,0].title.set_text('Age')

ax[0,1].bar(x=data.groupby(by=['sex']).agg({'charges': 'mean'}).reset_index()['sex'], height=data.groupby(by=['sex']).agg({'charges': 'mean'}).reset_index()['charges'], color = 'y')
ax[0,1].title.set_text('Gender')

ax[0,2].bar(x=data.groupby(by=['bmi']).agg({'charges': 'mean'}).reset_index()['bmi'], height=data.groupby(by=['bmi']).agg({'charges': 'mean'}).reset_index()['charges'], color = 'r')
ax[0,2].title.set_text('BMI')

ax[1,0].bar(x=data.groupby(by=['children']).agg({'charges': 'mean'}).reset_index()['children'], height=data.groupby(by=['children']).agg({'charges': 'mean'}).reset_index()['charges'], color = 'b')
ax[1,0].title.set_text('Number of Children')

ax[1,1].bar(x=data.groupby(by=['smoker']).agg({'charges': 'mean'}).reset_index()['smoker'], height=data.groupby(by=['smoker']).agg({'charges': 'mean'}).reset_index()['charges'], color = 'grey')
ax[1,1].title.set_text('Smokers vs Non Smokers')

ax[1,2].bar(x=data.groupby(by=['region']).agg({'charges': 'mean'}).reset_index()['region'], height=data.groupby(by=['region']).agg({'charges': 'mean'}).reset_index()['charges'], color='m')
ax[1,2].title.set_text('Region')

plt.show()

# Hypotheses we could test after observing the data:


**Two Sample Tests:**

- Smokers have high insurance Charges
- Insurance charge Remains similar across Genders
- Having children does not affect insurance Charges

**Multiple Populations:**

- All 4 regions have similar insurance charges
- Insurance Charges Differ across BMI Groups
- Number of childern does not affect insurance charges

**Non-Numeric Tests:**

- Males and Females have different BMI distributions
- Male Female Distribution is same across states
- Many More in the section


## Two Sample Tests

### Smokers vs Non Smoker Insurance Charge

In [None]:
smoker_insurance_charges = data[data['smoker'] == 'yes']['charges']
non_smoker_insurance_charges = data[data['smoker'] == 'no']['charges']

In [None]:
### Step 1 - Check if data is normally Distributed ###
### The Shapiro-Wilk test tests the null hypothesis that the ###
### data was drawn from a normal distribution ###
### If pValue is < 0.05, distribution is not normal ###

smoker_dist = shapiro(smoker_insurance_charges)
non_smoker_dist = shapiro(non_smoker_insurance_charges)

print('pvalue for smoker Distribution: ', smoker_dist[1])
print('pvalue for non smoker Distribution: ', non_smoker_dist[1])

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,5))

ax[0].hist(x=smoker_insurance_charges, bins=15, color = 'g')
ax[0].title.set_text('Smoker Insurance Charge Distribution')

ax[1].hist(x=non_smoker_insurance_charges, bins=15, color = 'grey')
ax[1].title.set_text('Non Smoker Insurance Charge Distribution')

plt.show()

print('Here we clearly see that both the variables do not follow a normal distribution')

In [None]:
### Step 2: Testing if both distributions have equal variance or not ###
###The Levene test tests the null hypothesis that all input samples###
### are from populations with equal variances.  Levene's test is an ###
### alternative to Bartlett's test bartlett in the case where ###
### there are significant deviations from normality. ###

lavene_test = levene(smoker_insurance_charges, non_smoker_insurance_charges)

print('pvalue for equal variance: ', lavene_test[1])
print('Variance of Smokers Insurance Charges', statistics.variance(smoker_insurance_charges))
print('Variance of Non Smokers Insurance Charges', statistics.variance(non_smoker_insurance_charges))
print('Var Smoker / Var Non Smoker', statistics.variance(smoker_insurance_charges)/statistics.variance(non_smoker_insurance_charges))

In [None]:
### Step 3: Since the distributions are not Normal ###
### and does not have equal variance so we use MannWhitney T Test ###

different = stats.mannwhitneyu(smoker_insurance_charges, non_smoker_insurance_charges, alternative='two-sided')
sm_charge_lt_nsm = stats.mannwhitneyu(smoker_insurance_charges, non_smoker_insurance_charges, alternative='less')
sm_charge_gt_nsm = stats.mannwhitneyu(smoker_insurance_charges, non_smoker_insurance_charges, alternative='greater')

if different[1] < 0.05:
  print('The 2 distributions are Different')

if sm_charge_lt_nsm[1] < 0.05:
  print('Smokers have less charges than Non Smokers')

if sm_charge_gt_nsm[1] < 0.05:
  print('Smokers have more charges than Non Smokers')

### Male vs Female Insurance Charge

In [None]:
male_insurance_charges = data[data['sex'] == 'male']['charges']
female_insurance_charges = data[data['sex'] == 'female']['charges']

In [None]:
male_dist = shapiro(male_insurance_charges)
female_dist = shapiro(female_insurance_charges)

print('pvalue for Male Distribution: ', male_dist[1])
print('pvalue for Female Distribution: ', female_dist[1])

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2,figsize =(15,8))

ax[0].hist(male_insurance_charges, bins =15, color='g')
ax[0].title.set_text('Male Insurance Charge Distribution')

ax[1].hist(female_insurance_charges, bins =15, color='y')
ax[1].title.set_text('Female Insurance Charge Distribution')

plt.show()

print('The distribution of MAle and Female Insurance charge follow approximate Chi Sq distribution')

In [None]:
lavene_test_mf = levene(male_insurance_charges, female_insurance_charges, center='median')

print('pvalue for equal variance: ', lavene_test_mf[1])
print('Variance of Smokers Insurance Charges', statistics.variance(male_insurance_charges))
print('Variance of Non Smokers Insurance Charges', statistics.variance(female_insurance_charges))
print('Var Smoker / Var Non Smoker', statistics.variance(male_insurance_charges)/statistics.variance(female_insurance_charges))

In [None]:
### Here 1 of the 2 main assumptions of T/Z test is voilated ###
### Which is assumption of Normal Distribution ###
### If the Distribution would have been normal, we may have used ###
### T test with unequal Variances ###

different_mf = stats.mannwhitneyu(male_insurance_charges, female_insurance_charges, alternative='two-sided')
m_charge_lt_f = stats.mannwhitneyu(male_insurance_charges, female_insurance_charges, alternative='less')
m_charge_gt_f = stats.mannwhitneyu(male_insurance_charges, female_insurance_charges, alternative='greater')

if different_mf[1] < 0.05:
  print('The 2 distributions are Different')

if m_charge_lt_f[1] < 0.05:
  print('Male have less charges than Female')

if m_charge_gt_f[1] < 0.05:
  print('Male have more charges than Female')

In [None]:
mf_t_diff = stats.ttest_ind(male_insurance_charges, female_insurance_charges, equal_var=False)

### Having vs Not Having Children

In [None]:
c_insurance_charges = data[data['children'] > 0]['charges']
nc_insurance_charges = data[data['children'] == 0]['charges']

In [None]:
c_dist = shapiro(c_insurance_charges)
nc_dist = shapiro(nc_insurance_charges)

print('pvalue for Male Distribution: ', c_dist[1])
print('pvalue for Female Distribution: ', nc_dist[1])

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2,figsize =(15,8))

ax[0].hist(c_insurance_charges, bins =15, color='g')
ax[0].title.set_text('With Child Insurance Charge Distribution')

ax[1].hist(nc_insurance_charges, bins =15, color='y')
ax[1].title.set_text('Without Child Insurance Charge Distribution')

plt.show()

print('The distribution of Insurance charges follow approximate Chi Sq distribution')

In [None]:
lavene_test_c = levene(c_insurance_charges, nc_insurance_charges, center='median')

print('pvalue for equal variance: ', lavene_test_c[1])
print('Variance of Smokers Insurance Charges', statistics.variance(c_insurance_charges))
print('Variance of Non Smokers Insurance Charges', statistics.variance(nc_insurance_charges))
print('Var Smoker / Var Non Smoker', statistics.variance(c_insurance_charges)/statistics.variance(nc_insurance_charges))

In [None]:
### Here 1 of the 2 main assumptions of T/Z test is voilated ###
### Which is assumption of Normal Distribution ###
### If the Distribution would have been normal, we may have used ###
### T test with unequal Variances ###

different_c = stats.mannwhitneyu(c_insurance_charges, nc_insurance_charges, alternative='two-sided')
c_charge_lt_nc = stats.mannwhitneyu(c_insurance_charges, nc_insurance_charges, alternative='less')
c_charge_gt_nc = stats.mannwhitneyu(c_insurance_charges, nc_insurance_charges, alternative='greater')

if different_c[1] < 0.05:
  print('The 2 distributions are Different')

if c_charge_lt_nc[1] < 0.05:
  print('With Cildren have less charges than Without Cildren')

if c_charge_gt_nc[1] < 0.05:
  print('With Cildren have more charges than Without Cildren')

## Multiple Populations

### Regions vs Insurance

In [None]:
se = data[data['region'] == 'southeast']['charges']
nw = data[data['region'] == 'northwest']['charges']
sw = data[data['region'] == 'southwest']['charges']
ne = data[data['region'] == 'northeast']['charges']

In [None]:
se_dist = shapiro(se)
nw_dist = shapiro(nw)
sw_dist = shapiro(sw)
ne_dist = shapiro(ne)

print('pvalue for se Distribution: ', se_dist[1])
print('pvalue for nw Distribution: ', nw_dist[1])
print('pvalue for sw Distribution: ', sw_dist[1])
print('pvalue for ne Distribution: ', ne_dist[1])

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2,figsize =(15,8))

ax[0,0].hist(se, bins =15, color='g')
ax[0,0].title.set_text('SE Insurance Charge Distribution')

ax[0,1].hist(nw, bins =15, color='y')
ax[0,1].title.set_text('NW Insurance Charge Distribution')

ax[1,0].hist(sw, bins =15, color='b')
ax[1,0].title.set_text('SW Insurance Charge Distribution')

ax[1,1].hist(ne, bins =15, color='r')
ax[1,1].title.set_text('NE Insurance Charge Distribution')

plt.show()

print('The distribution of Insurance charges follow approximate Chi Sq distribution')

In [None]:
lavene_test_region = levene(se, nw, sw, ne, center='median')

print('pvalue for equal variance: ', lavene_test_region[1])
print('Variance for se', statistics.variance(se))
print('Variance for nw', statistics.variance(nw))
print('Variance for sw', statistics.variance(sw))
print('Variance for ne', statistics.variance(ne))

# print('Var Smoker / Var Non Smoker', statistics.variance(c_insurance_charges)/statistics.variance(nc_insurance_charges))

In [None]:
### Kruskal Willis Test is Non Parametric Form of Anova ###
### Anova requires Normal and Homogenity ###
### Here our distribution is Not Normal but Homogenous ###
### We'll use Kruskal Willis Test to check ###
### Null hypothesis that the median of all of the groups are equal ###

region_test = stats.kruskal(se, nw, sw, ne)
print('pvalue for the Kruskal Test = ', region_test[1])
if region_test[1] < 0.05:
  print('Region Insuance Charge Distributions are Different')
else:
  print('Region Insurance Charges have Similar Distributions')


### BMI vs Charges

In [None]:
uw = data[data['bmi'] == 'underweight']['charges']
n = data[data['bmi'] == 'normal']['charges']
ow = data[data['bmi'] == 'overweight']['charges']
o = data[data['bmi'] == 'obese']['charges']

In [None]:
uw_dist = shapiro(uw)
n_dist = shapiro(n)
ow_dist = shapiro(ow)
o_dist = shapiro(o)

print('pvalue for uw Distribution: ', uw_dist[1])
print('pvalue for n Distribution: ', n_dist[1])
print('pvalue for ow Distribution: ', ow_dist[1])
print('pvalue for o Distribution: ', o_dist[1])

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2,figsize =(15,8))

ax[0,0].hist(uw, bins =15, color='g')
ax[0,0].title.set_text('UW Insurance Charge Distribution')

ax[0,1].hist(n, bins =15, color='y')
ax[0,1].title.set_text('N Insurance Charge Distribution')

ax[1,0].hist(ow, bins =15, color='b')
ax[1,0].title.set_text('OW Insurance Charge Distribution')

ax[1,1].hist(o, bins =15, color='r')
ax[1,1].title.set_text('O Insurance Charge Distribution')

plt.show()

# print('The distribution of Insurance charges follow approximate Chi Sq distribution')

In [None]:
### Null hypothesis that all input samples are from populations with equal variances###

lavene_test_bmi = levene(uw, n, ow, o, center='median',)

print('pvalue for equal variance: ', lavene_test_bmi[1])
print('Variance for uw', statistics.variance(uw))
print('Variance for n', statistics.variance(n))
print('Variance for ow', statistics.variance(ow))
print('Variance for o', statistics.variance(o))

# print('Var Smoker / Var Non Smoker', statistics.variance(c_insurance_charges)/statistics.variance(nc_insurance_charges))

In [None]:
### Kruskal Willis Test is Non Parametric Form of Anova ###
### Anova requires Normal and Homogenity ###
### Here our distribution is Not Normal but Homogenous ###
### We'll use Kruskal Willis Test to check ###
### Null hypothesis that the median of all of the groups are equal ###

region_test = stats.kruskal(uw, n, ow, o)
print('pvalue for the Kruskal Test = ', region_test[1])
if region_test[1] < 0.05:
  print('Insuance Charge Distributions are Different across BMI')
else:
  print('Insurance Charges have Similar Distributions across BMI')

### Number of Children vs Insurance Charge

In [None]:
c0 = data[data['children'] == 0]['charges']
c1 = data[data['children'] == 1]['charges']
c2 = data[data['children'] == 2]['charges']
c3 = data[data['children'] == 3]['charges']
c4 = data[data['children'] == 4]['charges']
c5 = data[data['children'] == 5]['charges']

In [None]:
c0_dist = shapiro(c0)
c1_dist = shapiro(c1)
c2_dist = shapiro(c2)
c3_dist = shapiro(c3)
c4_dist = shapiro(c4)
c5_dist = shapiro(c5)

print('pvalue for c0 Distribution: ', c0_dist[1])
print('pvalue for c1 Distribution: ', c1_dist[1])
print('pvalue for c2 Distribution: ', c2_dist[1])
print('pvalue for c3 Distribution: ', c3_dist[1])
print('pvalue for c4 Distribution: ', c4_dist[1])
print('pvalue for c5 Distribution: ', c5_dist[1])


In [None]:
fig, ax = plt.subplots(nrows=2, ncols=3,figsize =(15,10))

ax[0,0].hist(c0, bins =15, color='g')
ax[0,0].title.set_text('c0 Insurance Charge Distribution')

ax[0,1].hist(c1, bins =15, color='y')
ax[0,1].title.set_text('c1 Insurance Charge Distribution')

ax[0,2].hist(c2, bins =15, color='b')
ax[0,2].title.set_text('c2 Insurance Charge Distribution')

ax[1,0].hist(c3, bins =15, color='r')
ax[1,0].title.set_text('c3 Insurance Charge Distribution')

ax[1,1].hist(c4, bins =15, color='grey')
ax[1,1].title.set_text('c4 Insurance Charge Distribution')

ax[1,2].hist(c5, bins =15, color='brown')
ax[1,2].title.set_text('c5 Insurance Charge Distribution')

plt.show()

# print('The distribution of Insurance charges follow approximate Chi Sq distribution')

In [None]:
### Null hypothesis that all input samples are from populations with equal variances###

lavene_test_children = levene(c0, c1, c2, c3, c4, c5, center='median',)

print('pvalue for equal variance: ', lavene_test_children[1])
print('Variance for c0', statistics.variance(c0))
print('Variance for c1', statistics.variance(c1))
print('Variance for c2', statistics.variance(c2))
print('Variance for c3', statistics.variance(c3))
print('Variance for c4', statistics.variance(c4))
print('Variance for c5', statistics.variance(c5))

# print('Var Smoker / Var Non Smoker', statistics.variance(c_insurance_charges)/statistics.variance(nc_insurance_charges))

In [None]:
### Kruskal Willis Test is Non Parametric Form of Anova ###
### Anova requires Normal and Homogenity ###
### Here our distribution is Not Normal but Homogenous ###
### We'll use Kruskal Willis Test to check ###
### Null hypothesis that the median of all of the groups are equal ###

children_test = stats.kruskal(c0, c1, c2, c3, c4, c5)
print('pvalue for the Kruskal Test = ', children_test[1])
if children_test[1] < 0.05:
  print('Insuance Charge Distributions are Different across # of Children')
else:
  print('Insurance Charges have Similar Distributions across # of Children')

## Non Numeric Tests

### Gender Vs BMI

In [None]:
contigency_gn_bmi = pd.crosstab(data['sex'], data['bmi']) 
contigency_gn_bmi

In [None]:
# Chi-square test of independence. 
c, p_gn_bmi, dof, expected_gn_bmi = stats.chi2_contingency(contigency_gn_bmi) 

# Print the p-value
print(p_gn_bmi)
print(expected_gn_bmi)

### The p values is not less than 0.05 so there is no statistical difference ###
### between BMI distribution across gender ###
### You can see the same from expected Gender BMI Distribution ###

### Gender and Region

In [None]:
contigency_gen_regions= pd.crosstab(data['sex'], data['region']) 
contigency_gen_regions

In [None]:
# Chi-square test of independence. 
c, p_gen_regions, dof, expected_gen_regions = stats.chi2_contingency(contigency_gen_regions) 

# Print the p-value
print(p_gen_regions)
print(expected_gen_regions)

### Smoker vs Gender

In [None]:
contigency_gen_smoker= pd.crosstab(data['sex'], data['smoker']) 
contigency_gen_smoker

In [None]:
# Chi-square test of independence. 
c, p_gen_smoker, dof, expected_gen_smoker = stats.chi2_contingency(contigency_gen_smoker) 

# Print the p-value
print(p_gen_smoker)
print(expected_gen_smoker)

### Smoker vs Regions

In [None]:
contigency_reg_smoker= pd.crosstab(data['region'], data['smoker']) 
contigency_reg_smoker

In [None]:
# Chi-square test of independence. 
c, p_reg_smoker, dof, expected_reg_smoker = stats.chi2_contingency(contigency_reg_smoker) 

# Print the p-value
print(p_reg_smoker)
print(expected_reg_smoker)

### Smoker vs BMI

In [None]:
contigency_bmi_smoker= pd.crosstab(data['bmi'], data['smoker']) 
contigency_bmi_smoker

In [None]:
# Chi-square test of independence. 
c, p_bmi_smoker, dof, expected_bmi_smoker = stats.chi2_contingency(contigency_bmi_smoker) 

# Print the p-value
print(p_bmi_smoker)
print(expected_bmi_smoker)

### Smoker vs Gender vs BMI

In [None]:
contigency_reg_smoker_bmi = pd.crosstab(index= data['smoker'], columns = [data['sex'], data['bmi']]) 
contigency_reg_smoker_bmi

In [None]:
# Chi-square test of independence. 
c, p_reg_smoker_bmi, dof, expected_reg_smoker_bmi = stats.chi2_contingency(contigency_reg_smoker_bmi) 

# Print the p-value
print(p_reg_smoker_bmi)
print(expected_reg_smoker_bmi)

### Smoker vs Gender vs BMI vs Region

In [None]:
contigency_reg_smoker_bmi_region = pd.crosstab(index= data['smoker'], columns = [data['sex'], data['bmi'], data['region']]) 
contigency_reg_smoker_bmi_region

In [None]:
# Chi-square test of independence. 
c, p_reg_smoker_bmi_region, dof, expected_reg_smoker_bmi_region = stats.chi2_contingency(contigency_reg_smoker_bmi_region) 

# Print the p-value
print(p_reg_smoker_bmi_region)
print(expected_reg_smoker_bmi_region)