<a href="https://colab.research.google.com/github/rutripathi96/Predicting_Health_Insurance_Charges/blob/main/Multiple_Linear_Regression_Project_Predicting_Health_Insurance_Charges.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Linear Regression Project: Predicting Health Insurance Charges

## Introduction

In this project, we aim to build a Multiple Linear Regression model to predict individual medical costs billed by health insurance. The dataset contains various factors that may influence health insurance charges, including age, gender, BMI (Body Mass Index), number of children, smoking status, and region.

### Objective

The primary objective of this project is to analyze the relationships between the independent variables (age, gender, BMI, children, smoker, region) and the dependent variable (health insurance charges). By leveraging multiple linear regression, we seek to create a predictive model that can estimate insurance charges based on these factors.

### Dataset

The dataset used for this project is the [Medical Cost Personal Datasets](https://www.kaggle.com/mirichoi0218/insurance) available on Kaggle. It includes information about individuals, their medical charges, and various attributes that may impact insurance costs.

### Variables

1. **Age:** The age of the individual.
2. **Gender:** Gender of the individual (male/female).
3. **BMI (Body Mass Index):** A numerical value of a person's weight in relation to their height.
4. **Children:** Number of children/dependents covered by the insurance.
5. **Smoker:** Whether the individual is a smoker (yes/no).
6. **Region:** The geographic region of the individual (northeast, northwest, southeast, southwest).
7. **Charges (Target Variable):** Individual medical costs billed by health insurance.

### Methodology

1. **Data Exploration:** Explore the dataset to understand its structure, check for missing values, and gain insights into the distribution of variables.
2. **Data Preprocessing:** Handle categorical variables, encode them, and scale numerical features if needed.
3. **Exploratory Data Analysis (EDA):** Analyze the relationships between variables through visualizations and statistical summaries.
4. **Feature Engineering:** If necessary, create new features or transform existing ones to improve model performance.
5. **Model Training:** Implement a Multiple Linear Regression model using the selected features.
6. **Model Evaluation:** Evaluate the model's performance on a validation set and interpret the results.
7. **Prediction:** Make predictions on new data points to estimate health insurance charges.

Let's begin the exploration and analysis!


# Data Exploration and Preprocessing

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [25]:
dataset = pd.read_csv('insurance.csv')


In [26]:
X = dataset.iloc[:,:-1]
y = dataset.iloc[:,-1]

In [27]:
print(X)

      age     sex     bmi  children smoker     region
0      19  female  27.900         0    yes  southwest
1      18    male  33.770         1     no  southeast
2      28    male  33.000         3     no  southeast
3      33    male  22.705         0     no  northwest
4      32    male  28.880         0     no  northwest
...   ...     ...     ...       ...    ...        ...
1333   50    male  30.970         3     no  northwest
1334   18  female  31.920         0     no  northeast
1335   18  female  36.850         0     no  southeast
1336   21  female  25.800         0     no  southwest
1337   61  female  29.070         0    yes  northwest

[1338 rows x 6 columns]


In [28]:
print(y)

0       16884.92400
1        1725.55230
2        4449.46200
3       21984.47061
4        3866.85520
           ...     
1333    10600.54830
1334     2205.98080
1335     1629.83350
1336     2007.94500
1337    29141.36030
Name: charges, Length: 1338, dtype: float64


**Checking for missing values**

In [29]:
print(dataset.isnull().sum())

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


No missing values present

# Encoding the indepdent variables

**Encoding gender**

In [30]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [31]:
print(X)

[[1.0 0.0 19 ... 0 'yes' 'southwest']
 [0.0 1.0 18 ... 1 'no' 'southeast']
 [0.0 1.0 28 ... 3 'no' 'southeast']
 ...
 [1.0 0.0 18 ... 0 'no' 'southeast']
 [1.0 0.0 21 ... 0 'no' 'southwest']
 [1.0 0.0 61 ... 0 'yes' 'northwest']]


**Encoding the smoker feature**

In [32]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [5])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [33]:
print(X)

[[0.0 1.0 1.0 ... 27.9 0 'southwest']
 [1.0 0.0 0.0 ... 33.77 1 'southeast']
 [1.0 0.0 0.0 ... 33.0 3 'southeast']
 ...
 [1.0 0.0 1.0 ... 36.85 0 'southeast']
 [1.0 0.0 1.0 ... 25.8 0 'southwest']
 [0.0 1.0 1.0 ... 29.07 0 'northwest']]


**Encoding the region feature**

In [35]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [7])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [36]:
print(X)

[[0.0 0.0 0.0 ... 19 27.9 0]
 [0.0 0.0 1.0 ... 18 33.77 1]
 [0.0 0.0 1.0 ... 28 33.0 3]
 ...
 [0.0 0.0 1.0 ... 18 36.85 0]
 [0.0 0.0 0.0 ... 21 25.8 0]
 [0.0 1.0 0.0 ... 61 29.07 0]]


# Splitting the data into training and testing set

In [37]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train , y_test = train_test_split(X,y,test_size=0.2,random_state=0)

# Model Training

In [40]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train,y_train)

# Model Evaluation

In [41]:
y_pred = lr.predict(X_test)

In [43]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)

In [44]:
print(mse)

31827950.22952382


In [45]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

5641.626558850188

In [46]:
print(lr.coef_)

[ 4.83840068e+02  2.23707336e+02 -4.29438766e+02 -2.78108638e+02
 -1.18025086e+04  1.18025086e+04  7.73186394e+00 -7.73186394e+00
  2.53700500e+02  3.35962814e+02  4.36910121e+02]


In [47]:
print(lr.intercept_)

-517.136835842577
