## **The Dataset:** Medical Cost Dataset 

### **Potential Factors:** Age, Smoker(smoker/non-smoker)

### **Response Variable:** Charges

### **Hypothesises**
##### HO1: Age is not associated with differences in medical charges. 
##### HA1: Age is associated with differences in medical charges.

##### HO2: Smoking status (smoker vs. non-smoker) is not associated with differences in medical charges. 
##### HA2: Smoking status is associated with differences in medical charges.

##### HO3: There is no interaction effect between age and smoking status on medical charges. 
#### HA3: There is an interaction effect between age and smoking status on medical charges (meaning the impact of one factor on charges depends on the level of the other factor).

### 1. Import necessary libraries

In [155]:
#importing the libraries
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

### 2. Load the data

In [156]:
df = pd.read_csv('medical_cost.csv')

### 3. Data Preprocessing
1. Filtering the concerned columns
2. Handling missing values


In [157]:
# filter the data
df = df[['age', 'smoker', 'charges']]
# Handle missing values (if any)
df = df.dropna()

### 4. Data Exploration

In [158]:
print(df.head())

   age smoker      charges
0   19    yes  16884.92400
1   18     no   1725.55230
2   28     no   4449.46200
3   33     no  21984.47061
4   32     no   3866.85520


### 5. Performing Stratified Sampling

In [None]:
# Perform stratified sampling
df = df.groupby('smoker', group_keys=False).apply(lambda x: x.sample(min(len(x), 100)))

In [160]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 367 to 296
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   age      200 non-null    int64  
 1   smoker   200 non-null    object 
 2   charges  200 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.2+ KB
None


### 6. Fitting the model and running Anova

In [None]:
# Fit the model
model = ols('charges ~ C(age) * C(smoker)', data=df).fit()

# Perform the two-way ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

In [162]:
print(anova_table)

                        sum_sq     df           F        PR(>F)
C(age)            6.530507e+09   46.0    1.946522  3.690067e-03
C(smoker)         1.980275e+10    1.0  271.516245  4.717964e-32
C(age):C(smoker)  3.740502e+09   46.0    1.114916  3.198056e-01
Residual          8.387405e+09  115.0         NaN           NaN
