# Medical Insurance Premium Prediction

The amount of the premium for a health insurance policy depends from person to person, as many factors affect the amount of the premium for a health insurance policy. Let’s say age, a young person is very less likely to have major health problems compared to an older person. Thus, treating an older person will be expensive compared to a young one. That is why an older person is required to pay a high premium compared to a young person.

Just like age, many other factors affect the premium for a health insurance policy. Hope you now have understood what health insurance is and how the premium for a health insurance policy is determined. In the section below, I will take you through the task of health insurance premium prediction with machine learning using Python.

Dataset :- https://www.kaggle.com/tejashvi14/medical-insurance-premium-prediction

In [73]:
#import libraries
import numpy as np #2 perform mathematical operations on arrays
import pandas as pd #for data analysis

from xgboost import XGBRegressor #gradient boosting model used for regression predictive modeling
from sklearn.model_selection import train_test_split #measure the accuracy of the model 
from sklearn.model_selection import cross_val_score #statistical method used to estimate the performance (or accuracy) of machine learning models

In [74]:
#load the dataset 2 pandas data frame for manupulating the data
data = pd.read_csv('Medicalpremium.csv', encoding = 'latin-1')

#now v hv 2 replace null values with null string otherwise it will show errors
#v will store this in variable claaed "data"
medical_premium_data = data.where((pd.notnull(data)), '')

#lets check the shape of the dataset
medical_premium_data.shape

(986, 11)

In [75]:
#printing the dataset
print(medical_premium_data)

     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
0     45         0                      0               0                   0   
1     60         1                      0               0                   0   
2     36         1                      1               0                   0   
3     52         1                      1               0                   1   
4     38         0                      0               0                   1   
..   ...       ...                    ...             ...                 ...   
981   18         0                      0               0                   0   
982   64         1                      1               0                   0   
983   56         0                      1               0                   0   
984   47         1                      1               0                   0   
985   21         0                      0               0                   0   

     Height  Weight  KnownA

### Printing the head of the dataset to have a look at the dataframe

In [76]:
#lets c sample of this dataset in pandas data frame
#first 10 rows of the dataset
medical_premium_data.head(10)

Unnamed: 0,Age,Diabetes,BloodPressureProblems,AnyTransplants,AnyChronicDiseases,Height,Weight,KnownAllergies,HistoryOfCancerInFamily,NumberOfMajorSurgeries,PremiumPrice
0,45,0,0,0,0,155,57,0,0,0,25000
1,60,1,0,0,0,180,73,0,0,0,29000
2,36,1,1,0,0,158,59,0,0,1,23000
3,52,1,1,0,1,183,93,0,0,2,28000
4,38,0,0,0,1,166,88,0,0,1,23000
5,30,0,0,0,0,160,69,1,0,1,23000
6,33,0,0,0,0,150,54,0,0,0,21000
7,23,0,0,0,0,181,79,1,0,0,15000
8,48,1,0,0,0,169,74,1,0,0,23000
9,38,0,0,0,0,182,93,0,0,0,23000


### Printing the tail of the dataset to have a look at the dataframe

In [77]:
#last 10 rows of the dataset
medical_premium_data.tail(10)

Unnamed: 0,Age,Diabetes,BloodPressureProblems,AnyTransplants,AnyChronicDiseases,Height,Weight,KnownAllergies,HistoryOfCancerInFamily,NumberOfMajorSurgeries,PremiumPrice
976,21,0,1,0,0,155,74,0,0,0,39000
977,45,0,1,0,1,157,67,0,0,1,25000
978,40,0,1,1,0,168,70,0,0,0,17000
979,24,0,0,0,0,161,71,0,0,0,15000
980,40,0,1,1,0,171,74,0,0,0,38000
981,18,0,0,0,0,169,67,0,0,0,15000
982,64,1,1,0,0,153,70,0,0,3,28000
983,56,0,1,0,0,155,71,0,0,1,29000
984,47,1,1,0,0,158,73,1,0,1,39000
985,21,0,0,0,0,158,75,1,0,1,15000


In [78]:
#dataset informations
medical_premium_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 986 entries, 0 to 985
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   Age                      986 non-null    int64
 1   Diabetes                 986 non-null    int64
 2   BloodPressureProblems    986 non-null    int64
 3   AnyTransplants           986 non-null    int64
 4   AnyChronicDiseases       986 non-null    int64
 5   Height                   986 non-null    int64
 6   Weight                   986 non-null    int64
 7   KnownAllergies           986 non-null    int64
 8   HistoryOfCancerInFamily  986 non-null    int64
 9   NumberOfMajorSurgeries   986 non-null    int64
 10  PremiumPrice             986 non-null    int64
dtypes: int64(11)
memory usage: 84.9 KB


In [79]:
#data preprocessing 2 check whether if there r any empty values
#checking the number of missing values in each column
medical_premium_data.isnull().sum()

Age                        0
Diabetes                   0
BloodPressureProblems      0
AnyTransplants             0
AnyChronicDiseases         0
Height                     0
Weight                     0
KnownAllergies             0
HistoryOfCancerInFamily    0
NumberOfMajorSurgeries     0
PremiumPrice               0
dtype: int64

In [80]:
#statistical Measures of the dataset
medical_premium_data.describe()

Unnamed: 0,Age,Diabetes,BloodPressureProblems,AnyTransplants,AnyChronicDiseases,Height,Weight,KnownAllergies,HistoryOfCancerInFamily,NumberOfMajorSurgeries,PremiumPrice
count,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0
mean,41.745436,0.419878,0.46856,0.055781,0.180527,168.182556,76.950304,0.21501,0.117647,0.667343,24336.713996
std,13.963371,0.493789,0.499264,0.229615,0.384821,10.098155,14.265096,0.411038,0.322353,0.749205,6248.184382
min,18.0,0.0,0.0,0.0,0.0,145.0,51.0,0.0,0.0,0.0,15000.0
25%,30.0,0.0,0.0,0.0,0.0,161.0,67.0,0.0,0.0,0.0,21000.0
50%,42.0,0.0,0.0,0.0,0.0,168.0,75.0,0.0,0.0,1.0,23000.0
75%,53.0,1.0,1.0,0.0,0.0,176.0,87.0,0.0,0.0,1.0,28000.0
max,66.0,1.0,1.0,1.0,1.0,188.0,132.0,1.0,1.0,3.0,40000.0


Splitting the Features and Target

In [81]:
#assigning features as X
#v r gonna drop the class column 
#as v r droping the column v need 2 mention axis = 1
X = medical_premium_data.drop(columns='PremiumPrice', axis=1)

#assigning targets as Y
Y = medical_premium_data['PremiumPrice']

In [82]:
print(X) #printing the features
print("---------------------------------------------------------------------------------------------")
print(Y) #printing the target

     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
0     45         0                      0               0                   0   
1     60         1                      0               0                   0   
2     36         1                      1               0                   0   
3     52         1                      1               0                   1   
4     38         0                      0               0                   1   
..   ...       ...                    ...             ...                 ...   
981   18         0                      0               0                   0   
982   64         1                      1               0                   0   
983   56         0                      1               0                   0   
984   47         1                      1               0                   0   
985   21         0                      0               0                   0   

     Height  Weight  KnownA

## Dividing data into train and test data using sklearn's train_test_split()

In [83]:
#spliting the dataset in2 Training & Testing

#test size --> 2 specify the percentage of test data needed ==> 0.2 ==> 20%

#random state --> specific split of data each value of random_state splits the data differently, v can put any state v want
#v need 2 specify the same random_state everytym if v want 2 split the data the same way everytym
X_train, X_test, Y_train, Y_test = train_test_split(X.values, Y.values, test_size = 0.2, random_state = 2)

In [84]:
#lets c how many examples r there for each cases
#checking dimensions of Features
print(X.shape, X_train.shape, X_test.shape)

(986, 10) (788, 10) (198, 10)


In [85]:
#lets c how many examples r there for each cases
#checking dimensions of Targets
print(Y.shape, Y_train.shape, Y_test.shape)

(986,) (788,) (198,)


## Our data is ready to be applied a machine learning algorithm

## XGBoost Regressor

This is type of Decision Tree based ensemble model(use 1 or more model so its lyk incorporating 2 or more models together) 

In [86]:
# loading the model
# training the model with training data
model = XGBRegressor().fit(X_train, Y_train)



Evaluation

Prediction on training data

In [87]:
# accuracy for prediction on training data
training_data_prediction = model.predict(X_train)
training_data_prediction

array([29594.854 , 28790.062 , 28575.408 , 14265.516 , 24974.87  ,
       14866.936 , 15979.672 , 15508.825 , 28971.584 , 23272.492 ,
       27680.537 , 28327.404 , 28387.502 , 16189.781 , 27890.223 ,
       28906.924 , 29084.477 , 33600.04  , 22952.07  , 36981.25  ,
       27370.951 , 16967.479 , 16505.916 , 24951.338 , 28358.611 ,
       17292.85  , 15808.93  , 33647.188 , 22481.506 , 15833.407 ,
       22544.473 , 23704.086 , 22095.467 , 16837.512 , 27478.145 ,
       24720.068 , 22841.309 , 23293.637 , 14816.766 , 26300.713 ,
       24439.582 , 25707.887 , 31443.729 , 16326.456 , 25050.812 ,
       24029.496 , 28536.186 , 29882.512 , 23715.158 , 14631.632 ,
       15564.138 , 15540.994 , 28192.957 , 21967.516 , 23483.951 ,
       17900.924 , 22759.176 , 23926.418 , 15534.99  , 27719.408 ,
       27470.807 , 27957.756 , 29904.611 , 14680.438 , 14599.995 ,
       31734.125 , 15595.4795, 15079.911 , 25533.389 , 23991.129 ,
       34069.76  , 32060.379 , 27748.629 , 33380.62  , 32340.8

Prediction on Test Data 

In [88]:
# accuracy for prediction on test data
test_data_prediction = model.predict(X_test)
test_data_prediction

array([32148.107, 33382.824, 24283.004, 28490.77 , 30655.836, 29757.732,
       15356.   , 28581.29 , 29510.426, 25947.83 , 28088.967, 23661.596,
       16456.482, 24898.75 , 27752.828, 22673.695, 28533.89 , 24902.42 ,
       30016.426, 14801.767, 35391.02 , 24670.521, 25422.438, 17252.31 ,
       24168.35 , 23456.85 , 15710.571, 23499.516, 25762.557, 28590.564,
       28560.42 , 16167.913, 28095.635, 26826.12 , 23477.008, 28015.035,
       23519.332, 14386.603, 22797.162, 23371.994, 15025.9  , 22894.39 ,
       19832.057, 24581.752, 27138.969, 15147.097, 24502.463, 28419.4  ,
       27712.793, 36215.31 , 24715.69 , 14690.018, 28447.785, 26687.717,
       28402.205, 22025.68 , 22500.623, 27868.105, 27924.584, 25304.625,
       24283.004, 14714.933, 28689.883, 24715.69 , 22673.695, 33620.324,
       28419.432, 27232.092, 27264.125, 33433.492, 26775.686, 33608.285,
       24288.885, 36305.812, 38642.996, 16081.326, 17285.172, 28095.723,
       24250.533, 25226.373, 15763.019, 24470.91 , 

## XGBRegressor ML model Score

In [89]:
#mean accuracy (accuracy score)
#measuring the accuracy of the model against the training data 
model.score(X_train, Y_train)

0.8791766224293489

In [90]:
#mean accuracy (accuracy score)
#measuring the accuracy of the model against the test data 
model.score(X_test, Y_test)

0.796115496644966

Lets Cross Validate and Check how the model performs.

In [91]:
#cross validation
#it is used to protect against overfitting in a predictive model, 
#particularly in a case where the amount of data may be limited. In cross-validation, 
#you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate.
#cv = 5 ==> partition the data in2 4 Training & 1 Testing Data parts
print(cross_val_score(model, X, Y, cv = 5))

[0.78072397 0.72508556 0.80267643 0.87316903 0.64958212]


Making a Predictive System

In [93]:
#v r predicting by giving the input 
print("***Price of Medical Insurance***")
print('---------------------------------')

a = int(input("Enter the Age of the Customer : "))
b = int(input("Does the Customer has Diabetes (1 = Yes, 0 = No) : "))
c = int(input("Does the Customer has Blood Pressure Problems (1 = Yes, 0 = No) : "))
d = int(input("Does the Customer has Any Transplants (1 = Yes, 0 = No) : "))
e = int(input("Does the Customer has Any Chronic Diseases (1 = Yes, 0 = No) : "))
f = float(input("Height of Customer (in cm) : "))
g = float(input("Weight of Customer (in kg) : "))
h = int(input("Does the Customer has Known Allergies (1 = Yes, 0 = No) : "))
i = int(input("Does the Customer has History Of Cancer In Family (1 = Yes, 0 = No) : "))
j = int(input("Number Of Major Surgeries for Customer : "))

input_data = np.array([[a, b, c, d, e, f, g, h, i, j]])

# changing input_data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)

# reshape the array
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)
print("-----------------------------------------------------------------------------------")

print('The Medical Insurance cost : $', prediction[0])

***Price of Medical Insurance***
---------------------------------
Enter the Age of the Customer : 19
Does the Customer has Diabetes (1 = Yes, 0 = No) : 0
Does the Customer has Blood Pressure Problems (1 = Yes, 0 = No) : 0
Does the Customer has Any Transplants (1 = Yes, 0 = No) : 0
Does the Customer has Any Chronic Diseases (1 = Yes, 0 = No) : 0
Height of Customer (in cm) : 186
Weight of Customer (in kg) : 100
Does the Customer has Known Allergies (1 = Yes, 0 = No) : 1
Does the Customer has History Of Cancer In Family (1 = Yes, 0 = No) : 0
Number Of Major Surgeries for Customer : 0
-----------------------------------------------------------------------------------
The Medical Insurance cost : $ 16104.737


# Summary
So this is how we can build a Medical Insurance Premium Prediction model using Machine Learning and the Python programming language. I hope you liked this project on how to build a  model with Machine Learning.