<a href="https://colab.research.google.com/github/sankalpsroy/MachineLearningProjects/blob/main/Health_Insurance_Premium_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Name: Health Insurance Premium Prediction

* Business Objective: To find out the premium for the Health Insurance. 

In [1]:
# creating root directory
import os
root = '/content/drive/MyDrive/Portfolio/Data Analytics/Regression/Health Insurance Premium Prediction'
os.chdir(root)

In [2]:
# importing essential libraries
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
# importing dataset and viewing the first few rows
health_df = pd.read_csv('/content/drive/MyDrive/Portfolio/Data Analytics/Regression/Health Insurance Premium Prediction/Health_insurance.csv')
print(health_df.head())

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


In [4]:
# information about the dataset
health_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [6]:
# description of the dataset
health_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1338.0,39.207025,14.04996,18.0,27.0,39.0,51.0,64.0
bmi,1338.0,30.663397,6.098187,15.96,26.29625,30.4,34.69375,53.13
children,1338.0,1.094918,1.205493,0.0,0.0,1.0,2.0,5.0
charges,1338.0,13270.422265,12110.011237,1121.8739,4740.28715,9382.033,16639.912515,63770.42801


In [7]:
# Checking for 'nan' values
health_df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

No null values found.

In [8]:
# Checking for duplicate values
health_df.duplicated().sum()

1

Duplicate value found. It needs to be removed.

In [9]:
# removing duplicate value
health_df = health_df.drop_duplicates()
health_df.shape

(1337, 7)

In [14]:
# Data Visualization
plt.figure(figsize=(8,10))
figure = px.histogram(health_df, x = 'sex', color='smoker', title='Proportion of Smokers')
figure.show()

<Figure size 576x720 with 0 Axes>

We can say that the proportion of male smokers is higher and there is direct relation between the people who smoke and the premium charges incurred. So, the male smokers are paying more premium.

In [15]:
# Feature Engineering
health_df['sex'] = health_df['sex'].map({'female': 0, 'male': 1})
health_df['smoker'] = health_df['smoker'].map({'no': 0, 'yes': 1})
print(health_df.tail(10))

      age  sex     bmi  children  smoker     region      charges
1328   23    0  24.225         2       0  northeast  22395.74424
1329   52    1  38.600         2       0  southwest  10325.20600
1330   57    0  25.740         2       0  southeast  12629.16560
1331   23    0  33.400         0       0  southwest  10795.93733
1332   52    0  44.700         3       0  southwest  11411.68500
1333   50    1  30.970         3       0  northwest  10600.54830
1334   18    0  31.920         0       0  northeast   2205.98080
1335   18    0  36.850         0       0  southeast   1629.83350
1336   21    0  25.800         0       0  southwest   2007.94500
1337   61    0  29.070         0       1  northwest  29141.36030


In [18]:
# 'region' distribution of population
pie = health_df['region'].value_counts()
regions = pie.index
population = pie.values
fig = px.pie(health_df, values=population, names=regions)
fig.show()

We can say that the population is almost equally distributed.

In [19]:
# Correlation 
print(health_df.corr())

               age       sex       bmi  children    smoker   charges
age       1.000000 -0.019814  0.109344  0.041536 -0.025587  0.298308
sex      -0.019814  1.000000  0.046397  0.017848  0.076596  0.058044
bmi       0.109344  0.046397  1.000000  0.012755  0.003746  0.198401
children  0.041536  0.017848  0.012755  1.000000  0.007331  0.067389
smoker   -0.025587  0.076596  0.003746  0.007331  1.000000  0.787234
charges   0.298308  0.058044  0.198401  0.067389  0.787234  1.000000


There is positive correlation between the 'smoker' and the 'charges' variables.

In [20]:
# Let's proceed for splitting of data
x = np.array(health_df[['age','sex','bmi','smoker','children']])
y = np.array(health_df['charges'])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [21]:
# Model Training
forest = RandomForestRegressor()
forest.fit(x_train, y_train)

RandomForestRegressor()

In [22]:
pred = forest.predict(x_test)
data = pd.DataFrame(data = {'Predicted Premium Amount': pred})
print(data.head(10))

   Predicted Premium Amount
0               9088.343497
1              12465.864249
2              12020.150999
3              42702.028583
4               6249.182989
5              10601.991451
6              38363.343624
7               2634.936461
8              10329.784552
9              10767.613645
