![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.tail()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1333,50.0,male,30.97,3.0,no,Northwest,$10600.5483
1334,-18.0,female,31.92,0.0,no,Northeast,2205.9808
1335,18.0,female,36.85,0.0,no,southeast,$1629.8335
1336,21.0,female,25.8,0.0,no,southwest,2007.945
1337,61.0,female,29.07,0.0,yes,northwest,29141.3603


In [2]:
insurance.isna().sum()

age         66
sex         66
bmi         66
children    66
smoker      66
region      66
charges     54
dtype: int64

In [3]:
insurance = insurance.dropna()
insurance.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [4]:
insurance["charges"].head()

0       16884.924
1       1725.5523
2       $4449.462
3    $21984.47061
4      $3866.8552
Name: charges, dtype: object

In [5]:
insurance["region"].unique()

array(['southwest', 'Southeast', 'southeast', 'northwest', 'Northwest',
       'Northeast', 'northeast', 'Southwest'], dtype=object)

In [6]:
insurance["sex"].unique()

array(['female', 'male', 'woman', 'F', 'man', 'M'], dtype=object)

In [7]:
insurance["smoker"].unique()

array(['yes', 'no'], dtype=object)

In [8]:
def clean_data():
    global insurance
    
    def clean_target_variable(value):
        if isinstance(value, str):
            value = value.replace('$', '')
        return float(value)
    
    def handle_negative(value):
        if value < 0 or isinstance(value, str):
            value = 0.0
        return float(value)
    
    def handle_gender(value):
        value = value.replace("F", "female").replace("woman", "female")
        value = value.replace("M", "male").replace("man", "male")
        return value
    
    insurance['charges'] = insurance['charges'].apply(clean_target_variable)
    insurance["region"] = insurance["region"].apply(lambda x: x.lower())
    insurance = insurance[insurance["age"] > 0]
    insurance["children"] = insurance["children"].apply(handle_negative)
    insurance["sex"] = insurance["sex"].apply(handle_gender)

clean_data()
insurance.shape

(1149, 7)

In [9]:
X = insurance.iloc[:, 1:-1]
y = insurance.iloc[:, -1]
X.head()

Unnamed: 0,sex,bmi,children,smoker,region
0,female,27.9,0.0,yes,southwest
1,male,33.77,1.0,no,southeast
2,male,33.0,3.0,no,southeast
3,male,22.705,0.0,no,northwest
4,male,28.88,0.0,no,northwest


In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [11]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

categorical_columns = ['sex', 'region', 'smoker']
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_columns) 
    ],
    remainder='passthrough'
)

In [12]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor

lr = LinearRegression()
lasso = Lasso()
ridge = Ridge()
en = ElasticNet()
svr = SVR()
dt = DecisionTreeRegressor(max_depth=5)
br = BaggingRegressor(estimator=dt, n_estimators=300)
rf = RandomForestRegressor(n_estimators=300)
et = ExtraTreesRegressor(n_estimators=300)

regressors = [('linear_regression', lr),
             ('lasso_regression', lasso),
             ('ridge_regression', ridge),
             ('elasticnet_regression', en),
             ('support_vector_regression', svr),
             ('decision_tree_regression', dt),
             ('bagging_regression', br),
             ('random_forest_regression', rf),
             ('extra_trees_regression', et)]

In [13]:
from sklearn.metrics import r2_score as r2, mean_squared_error as MSE
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

regressor_names = []
r2_scores = []
cv_scores = []
rmse_scores = []

for reg_name, reg in regressors:
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', reg)   
    ])
    cv_score = np.mean(cross_val_score(pipeline, X, y, cv=10))
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    r2_score = r2(y_test, y_pred)
    rmse = (MSE(y_test, y_pred)) ** 1/2
    
    regressor_names.append(reg_name)
    r2_scores.append(r2_score)
    cv_scores.append(cv_score)
    rmse_scores.append(rmse)
    
    print(f"CV Score for {reg_name}: {cv_score}")
    print(f"R2 Score for {reg_name}: {r2_score}")    
    print(f"RMSE Score for {reg_name}: {rmse}\n")    

CV Score for linear_regression: 0.6526004947458592
R2 Score for linear_regression: 0.6953306367174052
RMSE Score for linear_regression: 23155597.766960245

CV Score for lasso_regression: 0.6526013228626449
R2 Score for lasso_regression: 0.695302395959726
RMSE Score for lasso_regression: 23157744.131853685

CV Score for ridge_regression: 0.6526217309302373
R2 Score for ridge_regression: 0.6953252446137506
RMSE Score for ridge_regression: 23156007.57968969

CV Score for elasticnet_regression: 0.42626683526975934
R2 Score for elasticnet_regression: 0.457297533720067
RMSE Score for elasticnet_regression: 41246680.9295142

CV Score for support_vector_regression: -0.10452491296824648
R2 Score for support_vector_regression: -0.08632381255596933
RMSE Score for support_vector_regression: 82563198.92881654

CV Score for decision_tree_regression: 0.7408849539113602
R2 Score for decision_tree_regression: 0.7814682499297196
RMSE Score for decision_tree_regression: 16608933.860027472

CV Score for b

In [14]:
performance_df = pd.DataFrame({'Algorithm':regressor_names,'R2':r2_scores, 'RMSE': rmse_scores, 
                               'CV Score':cv_scores}).sort_values('R2',ascending=False)
performance_df

Unnamed: 0,Algorithm,R2,RMSE,CV Score
6,bagging_regression,0.796557,15462120.0,0.762166
5,decision_tree_regression,0.781468,16608930.0,0.740885
7,random_forest_regression,0.735112,20132110.0,0.714517
0,linear_regression,0.695331,23155600.0,0.6526
2,ridge_regression,0.695325,23156010.0,0.652622
1,lasso_regression,0.695302,23157740.0,0.652601
8,extra_trees_regression,0.677808,24487340.0,0.639649
3,elasticnet_regression,0.457298,41246680.0,0.426267
4,support_vector_regression,-0.086324,82563200.0,-0.104525


In [15]:
from sklearn.ensemble import VotingRegressor

vc = VotingRegressor(estimators=regressors)
pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('vc', vc)   
])

cv_score = np.mean(cross_val_score(pipeline, X, y, cv=10))
    
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
r2_score = r2(y_test, y_pred)
rmse = (MSE(y_test, y_pred)) ** 1/2
    
print(f"CV Score for Voting Regressor: {cv_score}")
print(f"R2 Score for Voting Regressor: {r2_score}")   
print(f"RMSE Score for Voting Regressor: {rmse}\n")    

CV Score for Voting Regressor: 0.7095121199771564
R2 Score for Voting Regressor: 0.7509952471875672
RMSE Score for Voting Regressor: 18924954.698638227



In [16]:
final_model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('br', br)   
])

final_model.fit(X_train, y_train)
y_pred = final_model.predict(X_test)
r2_score = r2(y_test, y_pred)

print(f"R2 Score for Bagging Regressor (Final Model): {r2_score}")   

R2 Score for Bagging Regressor (Final Model): 0.7952270496136279


In [17]:
validation_data_path = 'validation_dataset.csv'
validation_data = pd.read_csv(validation_data_path)
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,18.0,female,24.09,1.0,no,southeast
1,39.0,male,26.41,0.0,yes,northeast
2,27.0,male,29.15,0.0,yes,southeast
3,71.0,male,65.502135,13.0,yes,southeast
4,28.0,male,38.06,0.0,no,southeast


In [18]:
validation_data.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
dtype: int64

In [19]:
y_pred = final_model.predict(validation_data)
y_pred

array([ 8015.63535416, 22583.97188572, 20947.3402259 , 56566.20579447,
        8526.49678315, 56291.69880382, 10939.16804046,  9704.11950308,
        8590.37418201, 10434.9023138 ,  7605.84160185, 12057.14533   ,
        8001.0319509 ,  7819.12690538,  7677.19943585, 10101.11161476,
        9479.54792112, 56291.69880382, 56249.09928316,  8452.80596499,
        8261.88650706, 12342.4318098 , 43576.73854539,  9266.63871633,
       10331.18606744,  7734.70215809, 56460.79892123,  7206.68635976,
        9502.60263425,  8379.81341839,  9753.67286946, 39445.29165329,
       17655.4851531 ,  7353.60044542, 22163.77804918,  9085.02413528,
       56785.57662799,  8530.35286918,  8887.00252601,  9549.17056285,
       41404.68162847, 10954.73108889,  8082.37622732, 56785.57662799,
        9569.15891186, 43440.41938278, 56688.62921827, 43067.95098298,
        8818.37552176, 43037.27703749])

In [20]:
validation_data["predicted_charges"] = y_pred
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,female,24.09,1.0,no,southeast,8015.635354
1,39.0,male,26.41,0.0,yes,northeast,22583.971886
2,27.0,male,29.15,0.0,yes,southeast,20947.340226
3,71.0,male,65.502135,13.0,yes,southeast,56566.205794
4,28.0,male,38.06,0.0,no,southeast,8526.496783
