## Import the necessary libraries

In [12]:
import numpy as np
import pandas as pd

## Reading the data into the DataFrame

In [13]:
df = pd.read_csv('insurance.csv')

In [14]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Test of proportions 

* 'sex' and 'smoker' are two categorical variables
* We want to see if the proportion of smokers in the female population is significantly less than it is in the male population

**$H_0:$ The proportion of smokers in the female population is equal to the proportion of smokers in the male population.**

**$H_a:$ The proportion of smokers in the female population is not equal to the proportion of smokers in the male population**

In [15]:
female_smokers = df[df['sex'] == 'female'].smoker.value_counts()[1]  # number of female smokers
male_smokers = df[df['sex'] == 'male'].smoker.value_counts()[1] # number of male smokers
print('The numbers of female and male smokers are {0} and {1} respectively'.format(female_smokers, male_smokers))
n_females = df.sex.value_counts()[1] # number of females in the data
n_males = df.sex.value_counts()[0] #number of males in the data
print('The numbers of female and male smokers are {0} and {1} respectively'.format(n_females, n_males))

The numbers of female and male smokers are 115 and 159 respectively
The numbers of female and male smokers are 662 and 676 respectively


In [16]:
print(f' The proportions of smokers in females, males = {round(115/662,2)}%, {round(159/676,2)}% respectively')

 The proportions of smokers in females, males = 0.17%, 0.24% respectively


**The proportions are different but are they statistically significant?**

Let's perform the two sample proportion z-test to find out the same.

In [18]:
# import the required function
from statsmodels.stats.proportion import proportions_ztest

# find the p-value
stat, pval = proportions_ztest([female_smokers, male_smokers] , [n_females, n_males], alternative = 'two-sided')

# print the conclusion based on p-value
if pval < 0.05:
    print(f'With a p-value of {round(pval,4)} the difference is significant. So, we reject the null.')
else:
    print(f'With a p-value of {round(pval,4)} the difference is not significant. So, we fail to reject the null.')

With a p-value of 0.0053 the difference is significant. So, we reject the null.


## Insight
We have enough statistical evidence to say that the proportion of smokers in the female population is different from the proportion of smokers in the male population.