# Wassim Mecheri - Lab 5
# **Regression Model Comparison**

Objective:
In this lab, you will use and compare multiple regression models to predict insurance charges (the outcome variable) based on the dataset available at Kaggle: Insurance Dataset. This exercise will help you understand the differences in model performance for regression tasks using a variety of approaches.

# **0 Loading Libraries**

In [1]:
from IPython.display import display, Markdown

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, Lasso
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import mean_squared_error

# **1 Data Preparation**

In [2]:
display(Markdown('## **1.1 Load Dataset**'))
df = pd.read_csv('./insurance.csv')
display(Markdown('**Sample of df**'))
display(df.head())
display(Markdown(f'**Initial shape**: {df.shape}'))

display(Markdown('## **1.2 Handlind missing values**'))
clean_df = df.dropna()
display(Markdown(f'**Missing values removed**: {clean_df.shape}'))

display(Markdown('## **1.3 Encoding Categorical Variable**'))
clean_df.loc[clean_df['sex']=='male','sex']=0
clean_df.loc[clean_df['sex']=='female','sex']=1
clean_df.loc[clean_df['smoker']=='no','smoker']=0
clean_df.loc[clean_df['smoker']=='yes','smoker']=1
encoded_df = pd.get_dummies(clean_df, columns=['region'])
display(encoded_df.head())

display(Markdown('## **1.4 Data Splitting**'))
y = np.array(encoded_df['charges'])
X = np.array(encoded_df.drop(columns='charges'))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
display(Markdown(f'**Shapes after splitting:**'))
display(Markdown(f'- X_train shape: {X_train.shape}'))
display(Markdown(f'- X_test shape: {X_test.shape}'))
display(Markdown(f'- y_train shape: {y_train.shape}'))
display(Markdown(f'- y_test shape: {y_test.shape}'))

## **1.1 Load Dataset**

**Sample of df**

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


**Initial shape**: (1338, 7)

## **1.2 Handlind missing values**

**Missing values removed**: (1338, 7)

## **1.3 Encoding Categorical Variable**

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northeast,region_northwest,region_southeast,region_southwest
0,19,1,27.9,0,1,16884.924,False,False,False,True
1,18,0,33.77,1,0,1725.5523,False,False,True,False
2,28,0,33.0,3,0,4449.462,False,False,True,False
3,33,0,22.705,0,0,21984.47061,False,True,False,False
4,32,0,28.88,0,0,3866.8552,False,True,False,False


## **1.4 Data Splitting**

**Shapes after splitting:**

- X_train shape: (1070, 9)

- X_test shape: (268, 9)

- y_train shape: (1070,)

- y_test shape: (268,)

# **2 Fitting the Models**

In [3]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred_lr = model.predict(X_test)

lasso = Lasso(alpha = 10)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)

mlp_reg = MLPRegressor(hidden_layer_sizes=(50, 30), max_iter=5000, random_state=0)
mlp_reg.fit(X_train, y_train)
y_pred_mlp = mlp_reg.predict(X_test)

svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr.fit(X_train, y_train)
y_pred_svr = svr.predict(X_test)

gb_reg = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=0)
gb_reg.fit(X_train, y_train)
y_pred_gb = gb_reg.predict(X_test)

# **3 Evaluation Metrics**

In [4]:
display(Markdown('**MSE results**'))
mse_lr = mean_squared_error(y_test, y_pred_lr)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
mse_mlp = mean_squared_error(y_test, y_pred_mlp)
mse_svr = mean_squared_error(y_test, y_pred_svr)
mse_gb = mean_squared_error(y_test, y_pred_gb)
display(Markdown(f'- **Linear Regression:** {mse_lr:.2f}'))
display(Markdown(f'- **Lasso Regression:** {mse_lasso:.2f}'))
display(Markdown(f'- **MLP Regressor:** {mse_mlp:.2f}'))
display(Markdown(f'- **Support Vector Regression:** {mse_svr:.2f}'))
display(Markdown(f'- **Gradient Boosting Regression:** {mse_gb:.2f}'))

**MSE results**

- **Linear Regression:** 31827950.23

- **Lasso Regression:** 31884809.23

- **MLP Regressor:** 22572526.54

- **Support Vector Regression:** 173060721.83

- **Gradient Boosting Regression:** 16419381.05

**Comparison and Analysis**  
The models used in this lab struggled to predict insurance charges, all showing extremely high MSE. Despite the high error rates, Gradient Boosting appears to be the best model for this task with our data, while Lasso Regression performed the worst, followed closely by Linear Regression.  
These poor results may be due to missing important features in our data, which could help the model capture complex patterns and relationships. Additionally, we could try adjusting the parameters of our models, as we did in Lab 2, to find the best settings and reduce the MSE.