# **Estimate medical insurance cost from client's information**
@author: Nam Nguyen


## **Dataset Description**
We are using the [Medical Cost Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance) from Kaggle. This datataset contains 7 attributes:

1. **age**: age of primary beneficiary
2. **sex**: insurance contractor gender, female, male
3. **bmi**: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
4. **children**: Number of children covered by health insurance / Number of dependents
5. **smoker**: Smoking
6. **region**: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
7. **charges**: Individual medical costs billed by health insurance

## **Objectives**
Builds a regression model to estimate the the medical cost based on individual information. The breakdown tasks include:

1.   **Preprocess** the dataset
2.   **Split** the dataset into train and test set
3.   **Fit** the regression model on the train set
4. **Evaluate** model's performance on the test set using common regression metrics (MAE and coefficient of determination)

In [1]:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## **Preprocess the dataset**

### **1. Load the dataset**

In [2]:
insurance_df = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
insurance_df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


### **2. Split between input data (X) and output values(y)**

In [3]:
X, y = insurance_df.drop(columns="charges"), insurance_df["charges"]

In [4]:
X.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,female,27.9,0,yes,southwest
1,18,male,33.77,1,no,southeast
2,28,male,33.0,3,no,southeast
3,33,male,22.705,0,no,northwest
4,32,male,28.88,0,no,northwest


In [5]:
y.head()

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

### **3. Convert continuous variables to catergorical variables**

BMI will be converted into following groups:
1. Underweight = <18.5
2. Normal weight = 18.5–24.9
3. Overweight = 25–29.9
4. Obesity = BMI of 30 or greater

Age will be grouped into following groups:
1. young: 0-30
2. middle_age:  31-60
3. old: 61 and greater


In [6]:
X["bmi_group"] = pd.cut(X["bmi"], 
                      [0, 18.5, 25, 30, 1000], False, 
                      ["underweight", "normal", "overweight", "obesity"])
X["age_group"] = pd.cut(X["age"],
                        [0, 31, 61, 500], False, 
                        ["young", "middle_age", "old"])
X.drop(columns=["age", "bmi"], inplace=True)

In [7]:
X.head()

Unnamed: 0,sex,children,smoker,region,bmi_group,age_group
0,female,0,yes,southwest,overweight,young
1,male,1,no,southeast,obesity,young
2,male,3,no,southeast,obesity,young
3,male,0,no,northwest,normal,middle_age
4,male,0,no,northwest,overweight,middle_age


## **Split between training and testing**



In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                          test_size=0.20, random_state=42)

## **Train the regression model**

In [9]:
# TO-DO
## Import and initialize the sk_learn's compatible regression model
## Store the model in the variable "model"



##### Example solution ######
import py566
model = py566.ensemble.BaggingRegressor(ntrees=100,
                    get_learner_func=lambda: py566.tree.StumpRegressor())



print() # Without another statement at the end of the cell, we may have an AttributeError




In [10]:
# TO-DO
## Fit the model on the training data (X_train, y_train)

##### Example solution ######
model.fit(X_train,y_train)




print() # Without another statement at the end of the cell, we may have an AttributeError




## **Evaluate the regression model**



### **1. Run the model on the test set**

In [11]:
y_pred = model.predict(X_test)

### **2. Calculate MAE and Coefficient of Determination**

In [12]:
from sklearn.metrics import mean_absolute_error, r2_score
print("MAE:", mean_absolute_error(y_test, y_pred))
print("R2 score:", r2_score(y_test, y_pred))

MAE: 5624.0478704886655
R2 score: 0.6601589593406862


## **Save the model to a file**


In [17]:
import dill

In [18]:
with open('model.pkl', 'wb') as file:
  dill.dump(model, file)

## **Reload the model from file, and run it on the test data**

In [23]:
with open('model.pkl', 'rb') as file:
  model_test = dill.load(file)
  y_pred_2 = model_test.predict(X_test)
  print("MAE:", mean_absolute_error(y_test, y_pred_2))
  print("R2 score:", r2_score(y_test, y_pred_2))

MAE: 5624.0478704886655
R2 score: 0.6601589593406862
