# Health Insurance Premium Prediction

Health Insurance is a type of insurance that covers medical expenses. A person who has taken a health insurance policy gets health insurance cover by paying a particular premium amount. There are a lot of factors that determine the premium of health insurance. So if you want to learn how we can use machine learning for predicting the premium of health insurance, then this article is for you. In this article, I will take you through the task of health insurance premium prediction with machine learning using Python.

The amount of the premium for a health insurance policy depends from person to person, as many factors affect the amount of the premium for a health insurance policy. Let’s say age, a young person is very less likely to have major health problems compared to an older person. Thus, treating an older person will be expensive compared to a young one. That is why an older person is required to pay a high premium compared to a young person.

Just like age, many other factors affect the premium for a health insurance policy. Hope you now have understood what health insurance is and how the premium for a health insurance policy is determined. In the section below, I will take you through the task of health insurance premium prediction with machine learning using Python.

Dataset :- https://www.kaggle.com/code/sleymananl/medical-insurance-price-prediction/data

In [91]:
#import libraries
import numpy as np #2 perform mathematical operations on arrays
import pandas as pd #for data analysis

from sklearn.ensemble import RandomForestRegressor #it avoids overfitting by using multiple trees
from sklearn.model_selection import train_test_split #measure the accuracy of the model 

In [92]:
#load the dataset 2 pandas data frame for manupulating the data
data = pd.read_csv('Health_insurance.csv', encoding = 'latin-1')

#now v hv 2 replace null values with null string otherwise it will show errors
#v will store this in variable claaed "data"
medical_premium_data = data.where((pd.notnull(data)), '')

#lets check the shape of the dataset
medical_premium_data.shape

(1338, 7)

In [93]:
#printing the dataset
medical_premium_data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [94]:
#lets c sample of this dataset in pandas data frame
#first 10 rows of the dataset
medical_premium_data.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


In [95]:
#last 10 rows of the dataset
medical_premium_data.tail(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1328,23,female,24.225,2,no,northeast,22395.74424
1329,52,male,38.6,2,no,southwest,10325.206
1330,57,female,25.74,2,no,southeast,12629.1656
1331,23,female,33.4,0,no,southwest,10795.93733
1332,52,female,44.7,3,no,southwest,11411.685
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


In [96]:
#dataset informations
medical_premium_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [97]:
#data preprocessing 2 check whether if there r any empty values
#checking the number of missing values in each column
medical_premium_data.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [98]:
#statistical Measures of the dataset
medical_premium_data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [99]:
#labelling the features
medical_premium_data["sex"] = medical_premium_data["sex"].map({"female": 0, "male": 1})
medical_premium_data["smoker"] = medical_premium_data["smoker"].map({"no": 0, "yes": 1})

In [100]:
#assigning features as X
#v r gonna drop the class column 
#as v r droping the column v need 2 mention axis = 1
X = medical_premium_data.drop(columns=['charges', 'region'], axis=1)

#assigning targets as Y
Y = medical_premium_data['charges']

In [101]:
print(X) #printing the features
print("---------------------------------------------------------------------------------------------")
print(Y) #printing the target

      age  sex     bmi  children  smoker
0      19    0  27.900         0       1
1      18    1  33.770         1       0
2      28    1  33.000         3       0
3      33    1  22.705         0       0
4      32    1  28.880         0       0
...   ...  ...     ...       ...     ...
1333   50    1  30.970         3       0
1334   18    0  31.920         0       0
1335   18    0  36.850         0       0
1336   21    0  25.800         0       0
1337   61    0  29.070         0       1

[1338 rows x 5 columns]
---------------------------------------------------------------------------------------------
0       16884.92400
1        1725.55230
2        4449.46200
3       21984.47061
4        3866.85520
           ...     
1333    10600.54830
1334     2205.98080
1335     1629.83350
1336     2007.94500
1337    29141.36030
Name: charges, Length: 1338, dtype: float64


## Dividing data into train and test data using sklearn's train_test_split()

In [102]:
#spliting the dataset in2 Training & Testing

#test size --> 2 specify the percentage of test data needed ==> 0.2 ==> 20%

#random state --> specific split of data each value of random_state splits the data differently, v can put any state v want
#v need 2 specify the same random_state everytym if v want 2 split the data the same way everytym
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)

In [103]:
#lets c how many examples r there for each cases
#checking dimensions of Features
print(X.shape, X_train.shape, X_test.shape)

(1338, 5) (1070, 5) (268, 5)


In [104]:
#lets c how many examples r there for each cases
#checking dimensions of Targets
print(Y.shape, Y_train.shape, Y_test.shape)

(1338,) (1070,) (268,)


In [105]:
# loading the model
# training the model with training data
model = RandomForestRegressor().fit(X_train, Y_train)

In [106]:
#we are reshaping our training and testing data for better prediction
X_train= X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)

In [107]:
#v r predicting by giving the input 
input_data = (18, 1, 33.77, 1, 0)

print("***Price of Health Insurance***")
print('---------------------------------')

a = int(input("Enter the Age of the Customer : "))
b = int(input("Enter the Gender of the Customer (1 = Male, 0 = Female) : "))
c = float(input("Enter BMI of Customer : "))
d = int(input("Enter no of children : "))
e = int(input("Do the Customer smoke (1 = Yes, 0 = No) : "))

input_data = np.array([[a, b, c, d, e]])

#change input_data 2 numpy_array 2 make prediction
input_data_as_numpy_array = np.asarray(input_data)
print(input_data)

#reshape the array as v r predicting the output for 1 instance 
input_data_reshaped = input_data_as_numpy_array.reshape(1, -1)

#prediction
prediction = model.predict(input_data_reshaped)
print("Pricted Price : ", prediction) 

***Price of Health Insurance***
---------------------------------
Enter the Age of the Customer : 19
Enter the Gender of the Customer (1 = Male, 0 = Female) : 1
Enter BMI of Customer : 29.7
Enter no of children : 0
Do the Customer smoke (1 = Yes, 0 = No) : 0
[[19.   1.  29.7  0.   0. ]]
Pricted Price :  [1423.03839869]


  "X does not have valid feature names, but"


# Summary
So this is how we can build a Health Insurance Premium Prediction model using Machine Learning and the Python programming language. I hope you liked this project on how to build a  model with Machine Learning.