<a href="https://colab.research.google.com/github/tejatanush/Medical-Charges-Prediction/blob/main/Medical_Charges_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Description
This model is capable of predicting if Medical Chrges prediction based on features like age,BMI, Gender, Smoking, No.of Children..etc . This helps people to get a visualization of how much they will charge lifetime for their health condition. Hospitals and clinics also be helpful by providing information about patient charges quickly.

# Steps to build a model:
1. Import required libraries
2. Import dataset
3. Data Preprocessing
* Find and fill missing values
* Encoding data
* Splitting into training and testing set
* Feature Scaling
4. Selection of model
5. Build a Model
6. Predict Results
7. Evaluate R-Squared score

# 1. Import libraries

In [23]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

# 2. Import dataset
This dataset has many independent features to predict medical charges required for a patient for lifetime.

Reference:
https://www.kaggle.com/datasets/mirichoi0218/insurance
Lets split our data into two parts x (dependent variables) and y (independent variable).

In [24]:
dataset=pd.read_csv("Health_Price_Prediction.csv")
x=dataset.iloc[:,0:6].values
y=dataset.iloc[:,6].values
print(dataset.head())

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


# 3. Data Preprocessing

# Find and filling missing values

In [25]:
missing_values = dataset.isnull().sum()
print(missing_values)

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


Hence we have no missing values we can skip this step.

# Encoding data
From our data we have categorical features and some label features. So that we should encode categorcal and label encoding in 2 steps.

In [26]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
le=LabelEncoder()
columns_to_encode=[1,4]
for column in columns_to_encode:
    x[:,column]=le.fit_transform(x[:,column])
ct=ColumnTransformer(transformers=[("encoder1",OneHotEncoder(),[3]),("encoder2",OneHotEncoder(),[5])],remainder='passthrough')
x=np.array(ct.fit_transform(x))
x=x.astype('float32')
y=y.astype('float32')

#Splitting into training set and test set
Let's split data so that 80% of data will be training set and remaining 20% will be testing set.

In [27]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
#Let's check how data splitted
print(x_train)
print(y_train)

[[ 0.    0.    0.   ...  1.   34.1   1.  ]
 [ 1.    0.    0.   ...  1.   34.43  0.  ]
 [ 0.    0.    1.   ...  0.   36.67  1.  ]
 ...
 [ 1.    0.    0.   ...  1.   25.08  0.  ]
 [ 1.    0.    0.   ...  1.   35.53  0.  ]
 [ 0.    1.    0.   ...  0.   18.5   0.  ]]
[40182.246   1137.4697 38511.63   ...  5415.661   1646.4297  4766.022 ]


We can see that data is not in certain order.... it mean the data is splitted in random.

Feature Scaling
Let's Normalize BMI, Age, Charges.Because these are having unique numerical values and by normalizing them model may understand better patterns between them.

In [28]:
from sklearn.preprocessing import MinMaxScaler
sc=MinMaxScaler()
x_train[:,10]=sc.fit_transform((x_train[:,10]).reshape(-1,1)).flatten()
x_test[:,10]=sc.fit_transform((x_test[:,10]).reshape(-1,1)).flatten()
x_train[:,12]=sc.fit_transform((x_train[:,12]).reshape(-1,1)).flatten()
x_test[:,12]=sc.fit_transform((x_test[:,12]).reshape(-1,1)).flatten()
y_train=sc.fit_transform(y_train.reshape(-1,1))
y_test=sc.transform(y_test.reshape(-1,1))
print(x_train[2])
print(y_train[2])

[0.         0.         1.         0.         0.         0.
 1.         0.         0.         0.         0.10869566 0.
 0.5571697  1.        ]
[0.5968175]


So, the Age,BMI,Charges were normalized into values between 0 and 1.

# 4. Selection of model
Hence the prediction of charges will be many unique values which predicted from many different features. So let's use regression model to predict charges.

#5. Build a model
**Create model:** In regressio problems we make our model from baseline machine learning models. I have already evaluated many models out of them Randonm Forest Regressor model is the best for predicting charges as it showing the good R-Swuared values.

In [29]:
from sklearn.ensemble import RandomForestRegressor
regressor=RandomForestRegressor(n_estimators=40,random_state=0)


We have constructed the outline of model in a robust manner.

**Fit Model:** Lets fit our model with x_train and y_train data.

In [33]:
regressor.fit(x_train,y_train)

  regressor.fit(x_train,y_train)


We have fit our model with x_train and y_train and we can see summary of model.

# 6. Predict Results
Let's Predict the results of x_test and compare them with y_test.

In [31]:
y_pred=regressor.predict(x_test)
y_pred=y_pred.reshape(-1,1)
y_pred=sc.inverse_transform(y_pred)
y_test=sc.inverse_transform(y_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

[[10071.37810058  9724.52929688]
 [ 9033.61665596  8547.69140625]
 [49230.05044556 45702.0234375 ]
 [12984.69801217 12950.07128906]
 [ 9651.93545445  9644.25292969]
 [10776.09485658  4500.33935547]
 [ 2880.08498146  2198.18994141]
 [11554.78783755 11436.73730469]
 [ 6756.42886154  7537.1640625 ]
 [ 8477.98674812  5425.0234375 ]
 [ 8101.11640078  6753.03808594]
 [17590.93737446 10493.94433594]
 [ 7883.71860327  7337.74804688]
 [ 5066.70865124  4185.09765625]
 [34090.50384141 18310.7421875 ]
 [13567.3917953  10702.64257812]
 [13387.85017062 12523.60351562]
 [10556.84428718  3490.54907227]
 [ 6627.63718487  6457.84326172]
 [34900.35391572 33475.81640625]
 [23933.56602302 23967.3828125 ]
 [19560.07405222 12643.37792969]
 [10003.36695056 23045.56640625]
 [26753.73577355 23065.41992188]
 [ 2348.19216409  1674.63232422]
 [ 9618.94176272  4667.60742188]
 [ 8114.80911962  3732.625     ]
 [ 7689.37623975  7682.66992188]
 [ 3743.88049934  3756.62158203]
 [ 9738.3459622   8413.46289062]
 [ 7342.10

Thus the results are very good as they are predicting charges which are closer to original charges. This seems that our model trains good and providing good results.**bold text**

#7. Evaluate model score

In [32]:
from sklearn.metrics import r2_score
result=r2_score(y_test,y_pred)
print(result)

0.855152260746799


We are getting a good results that having R-Squared value with nearly 85.5% which mean the model performs very well in predicting charges.