# Health Insurance Analysis by Implementing Probability

## Import library and load dataset

In [32]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import ttest_ind

data = pd.read_csv('insurance.csv')

data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


On this occasion, we will analyze the health insurance dataset. The following are the variables in the health insurance dataset:

- age
(Age of primary beneficiary)
- sex
(Insurance contractor gender, female, male)
- bmi
(Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/m2) using the ratio of height to weight, ideally 18.5 to 24.9)
- children
(Number of children covered by health insurance / Number of dependents)
- smoker
(Smoking)
- region
(The beneficiary's residential area in the US, northeast, southeast, southwest,
northwest.)
- charges
(Individual medical costs billed by health insurance)

In [33]:
# check data info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


It can be seen that there are no missing values, but there are incorrect data types.

In [34]:
data['sex'] = data['sex'].astype('category')
data['smoker'] = data['smoker'].astype('category')
data['region'] = data['region'].astype('category')

In [35]:
# check duplicate
data.duplicated().sum()

1

In [36]:
# drop duplicate
data = data.drop_duplicates()

Data is clean and ready for analysis.

# Research Question

The following are the sections that will be discussed in the research questions  :
- Continuous Variable Analysis
- Variable Correlation Analysis
- Descriptive Statistical Analysis
- Discrete Variable Analysis
- Hypothesis Testing

## Continuous Variable Analysis
1. Which is more likely to happen  
a. A person with BMI above 25 gets a health bill above 16.7k,  
or  
b. A person with a BMI below 25 gets a health bill above 16.7k


In [37]:
# A person with BMI above 25 gets a health bill above 16.7k
proba_1a = data[(data['bmi'] > 25) & (data['charges'] > 16700)]
print("probability of event A : ", round(len(proba_1a)/len(data),4))

probability of event A :  0.2117


In [38]:
# A person with BMI below 25 gets a health bill above 16.7k
proba_1b = data[(data['bmi'] < 25) & (data['charges'] > 16700)]
print("probability of event B : ", round(len(proba_1b)/len(data),4))

probability of event B :  0.0381


From the comparison of the two event probabilities above, A person with BMI above 25 gets a health bill above 16.7k has a probability of 0.2117, while A person with BMI below 25 gets a health bill above 16.7k has a probability of 0.0381. This shows that, people with BMI > 25 (overweight) tend to have larger bills compared to people who are not overweight.

2. Which is more likely to   
a. A smoker with a BMI above 25 gets a health bill of
above 16.7k, or  
b. A non-smoker with a BMI above 25 gets a health bill
above 16.7k

In [39]:
# A smoker with a BMI above 25 gets a health bill of above 16.7k
proba_2a = data[(data["bmi"] > 25) & (data["smoker"] == 'yes') & (data['charges'] > 16700)]
print("probability of event A : ", round(len(proba_2a)/len(data[data['smoker']=="yes"]),4))

probability of event A :  0.7847


In [40]:
# A non-smoker with a BMI above 25 gets a health bill above 16.7k
proba_2b = data[(data["bmi"] > 25) & (data["smoker"] == 'no') & (data['charges'] > 16700)]
print("probability of event B : ", round(len(proba_2b)/len(data[data['smoker']=="no"]),4))

probability of event B :  0.064


From the above calculations, it can be concluded that people who smoke have a BMI> 25, and pay bills of> 16.7k have a greater probability of 0.7847, compared to non-smokers who have a BMI> 25, and pay bills of> 16.7k which have a probability of 0.064. This indicates that smokers with BMI > 25 have a greater probability of getting a charge above 16.7k compared to non-smokers with BMI > 25.

## Variable Correlation Analysis
We will look at the correlation between BMI and Age.

In [41]:
data[['bmi','age']].corr()

Unnamed: 0,bmi,age
bmi,1.0,0.109344
age,0.109344,1.0


From the correlation above, it can be seen that the BMI variable and the Age variable have a correlation of 0.1093. This indicates that the relationship between the BMI variable and the Age variable has a weak positive correlation.

## Descriptive Statistical Analysis
1. What is the average age in the data?

In [42]:
avg_age = data['age'].mean()
print("Average age in this data is :", round(avg_age,2))

Average age in this data is : 39.22


The average age contained in the health insurance data is around 39 years. This means that people who have health insurance are adults who are moving towards old age.

2. What is the average BMI value of those who smoke?

In [43]:
data_smokers = data[data['smoker'] == "yes"]
avg_bmi = data_smokers['bmi'].mean()
avg_bmi

30.70844890510949

The average BMI of smokers is around 30. That indicates that the average smoker is overweight because they have a BMI of around 30.

3. Is the standard deviation of the health bills of smokers and non-smokers the same?

In [44]:
data_smokers.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,274.0,38.514599,13.923186,18.0,27.0,38.0,49.0,64.0
bmi,274.0,30.708449,6.318644,17.195,26.08375,30.4475,35.2,52.58
children,274.0,1.113139,1.157066,0.0,0.0,1.0,2.0,5.0
charges,274.0,32050.231832,11541.547176,12829.4551,20826.244213,34456.34845,41019.207275,63770.42801


In [45]:
data_nonsmokers = data[data['smoker'] == "no"]
data_nonsmokers.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1063.0,39.404516,14.076133,18.0,27.0,40.0,52.0,64.0
bmi,1063.0,30.651853,6.045956,15.96,26.315,30.305,34.43,53.13
children,1063.0,1.091251,1.21825,0.0,0.0,1.0,2.0,5.0
charges,1063.0,8440.660307,5992.9738,1121.8739,3988.8835,7345.7266,11363.0191,36910.60803


Standard deviation of smoker bills is USD 11,541.55 and the standard deviation of non-smoker bills is USD 5,992.97. This indicates that the health bills of smokers tend to vary more widely from the average compared to non-smokers.

The higher standard deviation for smokers may suggest that there is more variability in the health bills of smokers, and they may have a wider range of healthcare expenses compared to non-smokers.

4. Is the average age of women and men who smoke the same?

In [46]:
avg_smokers = data_smokers[['age','sex']].groupby('sex').mean()
avg_smokers

Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
female,38.608696
male,38.446541


The average age of women and men who smoke is around 38 years old. The mean difference is only 0.16 years, indicating that women and men who smoke have almost the same mean age.

5. Which is higher, the average health bill of smokers or non-smokers?

In [47]:
data_smokers.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,274.0,38.514599,13.923186,18.0,27.0,38.0,49.0,64.0
bmi,274.0,30.708449,6.318644,17.195,26.08375,30.4475,35.2,52.58
children,274.0,1.113139,1.157066,0.0,0.0,1.0,2.0,5.0
charges,274.0,32050.231832,11541.547176,12829.4551,20826.244213,34456.34845,41019.207275,63770.42801


In [48]:
data_nonsmokers.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1063.0,39.404516,14.076133,18.0,27.0,40.0,52.0,64.0
bmi,1063.0,30.651853,6.045956,15.96,26.315,30.305,34.43,53.13
children,1063.0,1.091251,1.21825,0.0,0.0,1.0,2.0,5.0
charges,1063.0,8440.660307,5992.9738,1121.8739,3988.8835,7345.7266,11363.0191,36910.60803


It can be seen that the average smoker's bill is USD 32,050.23 compared to the average non-smoker's bill of USD 8,440.66. This indicates that smoker bills are 4 times greater than non-smoker bills.

## Discrete Variable Analysis
1. Which gender has the highest bill?

In [49]:
data.pivot_table(values="charges", index="sex", aggfunc=["max", "mean", "median", "std"])

Unnamed: 0_level_0,max,mean,median,std
Unnamed: 0_level_1,charges,charges,charges,charges
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,63770.42801,12569.578844,9412.9625,11128.703801
male,62592.87309,13974.998864,9377.9047,12971.958663


The highest female bill (USD 63,770.43) is higher than the male bill (USD 62,592.87). The median also shows that the female bill (USD 9,412.96) is higher than the male bill (USD 93,69.62). However, for the average, the male bill is higher at USD 13,956.75 than the female bill at USD 12,569.58.

2. Odds distribution of charges in each region

In [50]:
data.pivot_table(values="charges", index="region", aggfunc=["mean", "median"])

Unnamed: 0_level_0,mean,median
Unnamed: 0_level_1,charges,charges
region,Unnamed: 1_level_2,Unnamed: 2_level_2
northeast,13406.384516,10057.652025
northwest,12450.840844,8976.97725
southeast,14735.411438,9294.13195
southwest,12346.937377,8798.593


There are several insights that can be drawn :
- The "southeast" region has the highest mean charges, which is approximately USD 14,735.41. It is followed closely by the "northeast" region with a mean charge of around USD 13,406.38.

- The "southwest" region has the lowest mean charges among all regions, with an average of approximately USD 12,346.94.

- When looking at the median charges, the "northeast" region stands out with a median charge of around USD 10,057.65, which is significantly lower than its mean charge.

- The "northwest" region also shows a similar trend, with a median charge of approximately USD 8,965.80, which is lower than its mean charge.

- In contrast, the "southeast" and "southwest" regions have median charges closer to their respective mean charges, indicating less variability in the data.

- Overall, there are notable differences between the mean and median charges for different regions, indicating the presence of potential outliers or skewed distributions in some regions.

3. Does each region have the same proportion of people?

In [51]:
data.pivot_table(values="charges", index="region", aggfunc="count")/len(data)*100

Unnamed: 0_level_0,charges
region,Unnamed: 1_level_1
northeast,24.233358
northwest,24.233358
southeast,27.225131
southwest,24.308153


Southeast region has the highest proportion of people, accounting for approximately 27.20% of the total. This insight suggests that the Southeast region may have a higher population density or a higher concentration of individuals compared to the other regions. On the other hand, the Northeast and Southwest regions have slightly lower proportions, each representing around 24.21% and 24.29% of the total, respectively. The data highlights regional variations in population distribution, which could be useful for understanding demographic patterns and potentially for targeted marketing or resource allocation strategies in healthcare or insurance industries.

4. Which is the higher proportion of smokers or non-smokers?

In [52]:
data.pivot_table(values="charges", index="smoker", aggfunc="count")/len(data)*100

Unnamed: 0_level_0,charges
smoker,Unnamed: 1_level_1
no,79.506358
yes,20.493642


The proportion of non-smokers is significantly higher, accounting for approximately 79.52% of the total, while smokers make up about 20.48%. This insight indicates that a large majority of individuals in the dataset are non-smokers.

5. What is the probability that a person is a woman and it is known that she is a smoker?

In [53]:
data.pivot_table(values="charges", index="sex", columns="smoker" ,aggfunc="count")

smoker,no,yes
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,547,115
male,516,159


The number of women who smoked was 115. The total smoking of women (115) and men (159) is 274. Then the probability of women smoking is 115/274 = 0.4197.

## Hypothesis Testing
1. Smokers' health bills are higher than non-smokers' health bills

In [54]:
data[['smoker', 'charges']].groupby("smoker").mean()

Unnamed: 0_level_0,charges
smoker,Unnamed: 1_level_1
no,8440.660307
yes,32050.231832


It can be seen that the mean bill of smokers (USD 32,050.23) is greater than the mean bill of non-smokers (USD 8,434.27). This shows the large difference between the mean bill of smokers and non-smokers. But will conduct a statistical test for the comparison of the hyphotesis below :  
H0 : μsmokers <= μnon-smokers   
H1 : μsmokers > μnon-smokers

The alpha used is 0.05 with a 2-sample t-test as the statistical test and using the right party test.



In [55]:
alpha = 0.05
statistics, p = ttest_ind(a = data_smokers["charges"], b = data_nonsmokers["charges"], equal_var=False, alternative='greater')

print('Statistics = %.3f, p-value = %.3f' % (statistics, p))

Statistics = 32.742, p-value = 0.000


The statistics value = 32.742, and p-value = 0.000. It can be seen that the p-value < alpha, therefore we can reject HO. It can be concluded that the average bill of smokers is greater than the average bill of non-smokers.

2. The proportion of male smokers is greater than that of females

In [60]:
gender_smokers = data.pivot_table(values="charges", index="sex", columns="smoker", aggfunc="count")
gender_smokers

smoker,no,yes
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,547,115
male,516,159


The proportion of males who smoke (159), i.e. 159/(516+159) : 0.2356, while the proportion of women who smoke (115), is 115/(115+547) : 0.1737. This indicates that the proportion of men smoking is greater than the proportion of women smoking. Next, we will test the hypothesis of the proportion of two populations using the z distribution. The following is the hypothesis:       
H0 : Pmale <= Pfemale  
H1 : Pmale > Pfemale  
By using the right party test and alpha value = 0.05

In [58]:
total_smokers = np.array([gender_smokers.loc["male","yes"], gender_smokers.loc["female","yes"]])
total_samples = np.array([675, 662])

alpha = 0.05
(z_stat, p_value) = proportions_ztest(total_smokers, total_samples,alternative='larger')
print("Z statistics : ",z_stat)
print("P-value : ", p_value)

Z statistics :  2.800728081362614
P-value :  0.0025493731085728284


The z statistics value = 2.8007, and p-value = 0.0025. It can be seen that the p-value < alpha, therefore we can reject HO. Hence the proportion of male smokers is greater than that of female smokers.

3. Health bills with BMI above 25 are higher than those with health bills with BMI below 25

In [59]:
data_with_bmi_category = data
data_with_bmi_category["bmi_category"] = ["above_25" if x >= 25 else "below_25" for x in data_with_bmi_category["bmi"]]
data_with_bmi_category.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,bmi_category
0,19,female,27.9,0,yes,southwest,16884.924,above_25
1,18,male,33.77,1,no,southeast,1725.5523,above_25
2,28,male,33.0,3,no,southeast,4449.462,above_25
3,33,male,22.705,0,no,northwest,21984.47061,below_25
4,32,male,28.88,0,no,northwest,3866.8552,above_25


In [61]:
data.pivot_table(values="charges", index="bmi_category", aggfunc=["mean", "median"])

Unnamed: 0_level_0,mean,median
Unnamed: 0_level_1,charges,charges
bmi_category,Unnamed: 1_level_2,Unnamed: 2_level_2
above_25,13951.502227,9573.46115
below_25,10282.224474,8582.3023


It can be seen that the average bill with BMI > 25 (USD 13,951.50) is greater than the average bill BMI < 25 (USD 10,282.22). Furthermore, a statistical test will be carried out with the hypothesis that :     
H0 : μabove25 <= μbelow25   
H1 : μabove25 > μbelow25  
By using the right party test and alpha value = 0.05



In [62]:
alpha = 0.05
data_bmi_above_25 = data_with_bmi_category[data_with_bmi_category["bmi_category"] == "above_25"]
data_bmi_below_25 = data_with_bmi_category[data_with_bmi_category["bmi_category"] == "below_25"]
stat, p = ttest_ind(a = data_bmi_above_25["charges"], b = data_bmi_below_25["charges"], equal_var=False, alternative='greater')


print('Statistics = %.3f, p-value = %.3f' % (stat, p))

Statistics = 5.941, p-value = 0.000


The statistics value = 5.941, and p-value = 0.000. It can be seen that the p-value < alpha, therefore we can reject HO. It can be concluded that the bill of a person with a BMI > 25 will get a greater bill value compared to a person who has a BMI < 25.

# Conclusion
From the results of the analysis that has been done, the proportion of men who smoke is more than women who smoke. Then for people who smoke have a larger bill compared to people who do not smoke. The age of men and women who smoke has the same age, which is around 38 years. The average BMI of people who smoke is 30. Then it can be seen that the BMI variable and the Age variable have a correlation of 0.1093. This indicates that the relationship between the BMI variable and the Age variable has a weak positive correlation. Furthermore, the Person with a BMI> 25 will get a greater bill value compared to a person who has a BMI < 25.

# Further Research
For future analysis, a correlation search can be conducted on other variables in the health insurance data. Factors such as age, body mass index (BMI), region, smoking habits, and number of children in the family can be studied in more depth to understand how they relate to health insurance costs. This research can provide more detailed insights into the influence of each factor on the cost of insurance bills, thus helping insurance companies make more informed decisions and potential customers understand the factors that affect health insurance bill.


# Reference
Probability Material 1 & 2 Pacmann