In [2]:
import pandas as pd

df = pd.read_csv("insurance2.csv")
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,insuranceclaim
0,19,0,27.9,0,1,3,16884.924,1
1,18,1,33.77,1,0,2,1725.5523,1
2,28,1,33.0,3,0,2,4449.462,0
3,33,1,22.705,0,0,1,21984.47061,0
4,32,1,28.88,0,0,1,3866.8552,1


## Step 2: Define the Real-World Question

**Business Question:**  
Are smokers more likely to file an insurance claim than non-smokers?

This is a binary classification problem comparing two proportions:
- Proportion of **smokers** who filed a claim
- Proportion of **non-smokers** who filed a claim

### Hypotheses

**Null Hypothesis (H₀):**  
There is no difference in the proportion of claims filed between smokers and non-smokers.  
→ 𝑝₁ = 𝑝₂

**Alternative Hypothesis (H₁):**  
Smokers are more likely to file a claim than non-smokers.  
→ 𝑝₁ > 𝑝₂

In [6]:
print("Smoker counts:")
print(df["smoker"].value_counts())

print("\nSmoker vs. Claim Crosstab:")
print(pd.crosstab(df["smoker"], df["insuranceclaim"]))

Smoker counts:
smoker
0    1064
1     274
Name: count, dtype: int64

Smoker vs. Claim Crosstab:
insuranceclaim    0    1
smoker                  
0               530  534
1                25  249


## Step 3: Data Exploration Summary

There are 274 smokers and 1064 non-smokers in the dataset.  
Out of the smokers, 249 filed insurance claims.  
Out of the non-smokers, 534 filed claims.

This suggests a higher proportion of smokers file claims, but we need to perform a hypothesis test to confirm if the difference is statistically significant.

## Step 4: Two-Proportion Z-Test

I want to test if smokers are more likely to file insurance claims than non-smokers.

### Hypotheses:
- **Null Hypothesis (H₀):** There is no difference in proportions. (p1 = p2)
- **Alternative Hypothesis (H₁):** Smokers have a higher proportion of claims. (p1 > p2)

I'll use a two-proportion z-test to check if the observed difference is statistically significant.

In [8]:
from statsmodels.stats.proportion import proportions_ztest

# Claim counts for each group
claim_counts = [249, 534]  # [smokers who claimed, non-smokers who claimed]

# Total people in each group
group_sizes = [274, 1064]  # [total smokers, total non-smokers]

# Run one-sided (smoker > non-smoker) z-test
z_stat, p_value = proportions_ztest(count=claim_counts, nobs=group_sizes, alternative='larger')

print("Z-statistic:", z_stat)
print("P-value:", p_value)

Z-statistic: 12.190247084419937
P-value: 1.751787457205559e-34


In [None]:
## Step 6: Final Summary & Conclusion

In this project, I looked at whether smokers are more likely to file insurance claims compared to non-smokers. I used a two-proportion z-test to compare the percentage of claims made by each group.

### Key Findings:
- 249 out of 274 smokers filed claims (~90.9%)
- 534 out of 1064 non-smokers filed claims (~50.2%)
- Z-statistic: 12.19  
- P-value: 1.75e-34 (way below 0.05)

### Conclusion:
Because the p-value is so small, I can confidently reject the null hypothesis. There’s strong evidence that smokers are more likely to file insurance claims than non-smokers.

This analysis could be useful for insurance providers when evaluating risk based on lifestyle choices like smoking.
