# **Practice Project: Insurance Cost Analysis**

In this project, you have to perform analytics operations on an insurance database that uses the below mentioned parameters.

| Parameter |Description| Content type |
|---|----|---|
|age| Age in years| integer |
|gender| Male or Female|integer (1 or 2)|
| bmi | Body mass index | float |
|no_of_children| Number of children | integer|
|smoker| Whether smoker or not | integer (0 or 1)|
|region| Which US region - NW, NE, SW, SE | integer (1,2,3 or 4 respectively)| 
|charges| Annual Insurance charges in USD | float|

## Objectives 
In this project, you will:
 - Load the data as a `pandas` dataframe
 - Clean the data, taking care of the blank entries
 - Run exploratory data analysis (EDA) and identify the attributes that most affect the `charges`
 - Develop single variable and multi variable Linear Regression models for predicting the `charges`
 - Use Ridge regression to refine the performance of Linear regression models. 
 


# Setup


For this lab, we will be using the following libraries:
*   [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for managing the data.
*   [`numpy`](https://numpy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for mathematical operations.
*   [`sklearn`](https://scikit-learn.org/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for machine learning and machine-learning-pipeline related functions.
*   [`seaborn`](https://seaborn.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for visualizing the data.
*   [`matplotlib`](https://matplotlib.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for additional plotting tools.


### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Download the dataset to this lab environment

Run the cell below to load the dataset to this lab environment.


In [None]:
filepath = 'medical_insurance_dataset.csv'
df = pd.read_csv(filepath, header=None)

# Task 1 : Import the dataset

Import the dataset into a `pandas` dataframe. Note that there are currently no headers in the CSV file. 

Print the first 10 rows of the dataframe to confirm successful loading.


In [None]:
df.head(10)

Add the headers to the dataframe, as mentioned in the project scenario. 


In [None]:
headers=['age','gender','bmi','no_of_children','smoker','region','charges']
df.columns=headers

In [None]:
df.head()

Now, replace the '?' entries with 'NaN' values.
This is being done to replace unwanted entries to more approachable NaN value which we can identify easily with numpy.

In [None]:
df.replace('?',np.nan,inplace=True)

# Task 2 : Data Wrangling


Use `dataframe.info()` to identify the columns that have some 'Null' (or NaN) information.


In [None]:
df.info()

Handle missing data:

- For continuous attributes (e.g., age), replace missing values with the mean.
- For categorical attributes (e.g., smoker), replace missing values with the most frequent value.
- Update the data types of the respective columns.
- Verify the update using `df.info()`.


In [None]:
#replacing empty cells in 'age' with mean age
mean_age=df['age'].astype('float').mean()
df['age'].replace(np.nan, mean_age,inplace=True)

In [None]:
#replacing empty cells in 'smoker' with most frequent value
is_smoker=df['smoker'].value_counts().idxmax()
df['smoker'].replace(np.nan,is_smoker,inplace=True)

In [None]:
#updating data types of 'age' and 'smoker' columns to int
df[['age','smoker']] = df[['age','smoker']].astype('int')
print(df.info())

Also note, that the `charges` column has values which are more than 2 decimal places long. Update the `charges` column such that all values are rounded to nearest 2 decimal places. Verify conversion by printing the first 5 values of the updated dataframe.


In [None]:
df['charges'] = round(df['charges'],2)
print(df.head())

# Task 3 : Exploratory Data Analysis (EDA)

Implement the regression plot for `charges` with respect to `bmi`. 


In [None]:
sns.regplot(x="bmi", y="charges", data=df, line_kws={"color": "red"})
plt.ylim(0,)

Implement the box plot for `charges` with respect to `smoker`.


In [None]:
sns.boxplot(x='smoker',y='charges',data=df,color='red')

Print the correlation matrix for the dataset.


In [None]:
df.corr()

Implement the `heatmap`of the correlation matrix

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap')
plt.show()

# Task 4 : Model Development

Fit a linear regression model that may be used to predict the `charges` value, just by using the `smoker` attribute of the dataset. Print the $ R^2 $ score of this model.


In [None]:
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

In [None]:
x=df[['smoker']]
y=df[['charges']]
lm=LinearRegression()
lm.fit(x,y)
print(f"R-squared score: {lm.score(x,y):.4f}")

_The R-squared of 0.62 indicates that approximately 62% of the variance in charges can be explained by the smoker attribute alone._

Fit a linear regression model that may be used to predict the `charges` value, just by using all other attributes of the dataset. Print the $ R^2 $ score of this model. You should see an improvement in the performance.


In [None]:
z=df[["age", "gender", "bmi", "no_of_children", "smoker", "region"]]
lm.fit(z,y)
print(f"R-squared score: {lm.score(z,y):.4f}")

_The R-squared of 0.75 indicates that approximately 75% of the variance in charges can be explained by all attribute combined._

Create a training pipeline that uses `StandardScaler()`, `PolynomialFeatures()` and `LinearRegression()` to create a model that can predict the `charges` value using all the other attributes of the dataset. There should be even further improvement in the performance.


__StandardScaler prevents features with a large range (like charges) from dominating the model over those with a smaller range (age).__

__Default polynomial degree here is 2.__

In [None]:
input=[('scale',StandardScaler()),('poly',PolynomialFeatures(include_bias=False)),('model',LinearRegression())]
pipe=Pipeline(input)
z=z.astype("float")
pipe.fit(z,y)
#print(pipe.score(z,y))
ypipe=pipe.predict(z)
print(f"R-squared score: {r2_score(y,ypipe):.4f}")

_The R-squared of 0.84 indicates that approximately 84% of the variance in charges can be explained by all attribute combined._

# Task 5 : Model Refinement



Split the data into training and testing subsets, assuming that 20% of the data will be reserved for testing.
__This helps you perform unbiased model evaluation and validation.__

In [None]:
x_train,x_test,y_train,y_test = train_test_split(z,y,test_size=0.2,random_state=1)

Initialize a `Ridge` regressor that used hyperparameter $ \alpha = 0.1 $. Fit the model using training data data subset. Print the $ R^2 $ score for the testing data.


__Ridge regression helps to prevent overfitting by adding a penalty term to the model's coefficients.__

In [None]:
ridgeModel=Ridge(alpha=0.1)
ridgeModel.fit(x_train,y_train)
yhat=ridgeModel.predict(x_test)
print(f"R-squared score: {r2_score(y_test,yhat):.4f}")

_The R-squared of 0.67 indicates that approximately 67% of the variance in charges can be explained by all attribute combined._

Apply `polynomial` transformation to the training parameters with `degree=2`. Use this transformed feature set to fit the same regression model, as above, using the training subset. Print the $ R^2 $ score for the testing subset.


In [None]:
pr=PolynomialFeatures(degree=2)
x_train_pr=pr.fit_transform(x_train)
x_test_pr=pr.fit_transform(x_test)
ridgeModel.fit(x_train_pr,y_train)
yhat=ridgeModel.predict(x_test_pr)
print(f"R-squared score: {r2_score(y_test,yhat):.4f}")

_The R-squared of 0.78 indicates that approximately 78% of the variance in charges can be explained by all attribute combined._

`Cross_Val_Score` method to check r-square value with folds = 4.

In [None]:
lre=LinearRegression()
Rcross = cross_val_score(lre,z,y,cv=4)
print(f"R-squared score mean: {Rcross.mean()}")

## R-square score summary for different models with all the attributes

| MLR | Normalized Polynomial Model | Ridge Model $ train-test = 0.2, \alpha = 0.1 $ | Polynomial Ridge Model $ degree = 2 $ | Cross Value cv=4 |
|:--------:|:--------:|:--------:|:--------:|:--------:|
| 0.75 | 0.84 | 0.67 | 0.78 | 0.75 |

# Project Summary

This project aimed to predict __insurance charges__ using various machine learning models. The initial Exploratory Data Analysis (EDA) revealed that the relationship between features and the target was __non-linear__, suggesting that more complex models would be needed.

We began with a __Multiple Linear Regression (MLR)__ model, which served as our baseline and achieved an R-squared score of __0.75__. To capture the non-linear relationships, we then used a __Normalized Polynomial Model__, which significantly improved performance to an R-squared of __0.84__.

To address potential overfitting, we explored __Ridge Regularization__. While the basic Ridge model underperformed (R-squared = 0.67) due to an overly aggressive penalty, the __Polynomial Ridge Model__ still showed a strong R-squared of __0.78__.

Ultimately, the __Normalized Polynomial Model__ proved to be the most effective for this dataset, providing the best predictive power. This analysis demonstrates the importance of both __feature engineering__ (using polynomial features) and __hyperparameter tuning__ (adjusting the alpha value in the Ridge model) to build a robust predictive model.

_Note: Only Project Summary is AI generated._

<!--## Change Log


<!--|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-09-16|0.1|Abhishek Gagneja|Initial Version Created|
|2023-09-19|0.2|Vicky Kuo|Reviewed and Revised|
--!>
