Step 1: Gathering Data

Load the data and read the data

In [1]:
#importing necessary packages
import numpy as np
import pandas as pd

In [2]:
data=pd.read_csv("/content/insurance.csv")

In [3]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [5]:
data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


Step 2: Data Preprocessing

One Hot Encoding - Converts the categorical variables into numerical variables

In [6]:
data['sex']=data['sex'].apply({'male':0,'female':1}.get)

In [7]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,1,27.9,0,yes,southwest,16884.924
1,18,0,33.77,1,no,southeast,1725.5523
2,28,0,33.0,3,no,southeast,4449.462
3,33,0,22.705,0,no,northwest,21984.47061
4,32,0,28.88,0,no,northwest,3866.8552


In [8]:
data['smoker']=data['smoker'].apply({'no':0,'yes':1}.get)

In [9]:
data['region']=data['region'].apply({'northeast':0,'northwest':1,'southeast':2,'southwest':3}.get)

In [10]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,1,27.9,0,1,3,16884.924
1,18,0,33.77,1,0,2,1725.5523
2,28,0,33.0,3,0,2,4449.462
3,33,0,22.705,0,0,1,21984.47061
4,32,0,28.88,0,0,1,3866.8552


Dividing the data into dependent and independent columns

In [11]:
X=data[['age','sex','bmi','children','smoker','region']] #independent columns

y=data['charges'] #dependent column

In [12]:
X.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,1,27.9,0,1,3
1,18,0,33.77,1,0,2
2,28,0,33.0,3,0,2
3,33,0,22.705,0,0,1
4,32,0,28.88,0,0,1


In [13]:
y.head()

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

Splitting the data into training set and testing set

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test =train_test_split(X,y, test_size=0.4)

In [29]:
len(X_train)

802

In [30]:
len(y_train)

802

In [31]:
len(X_test)

536

In [32]:
len(y_test)

536

Step 3: Model Building using Linear Regression

In [33]:
from sklearn.linear_model import LinearRegression

In [34]:
model=LinearRegression()

In [35]:
model.fit(X_train,y_train) #Learning or training of model is going to take place

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [36]:
predictions=model.predict(X_test) #Prediction made by MLM

In [37]:
predictions[0:5]

array([27481.15779973, 12191.35053862, 10941.64214317,  5500.43350621,
       11343.46135554])

In [38]:
model.score(X,y) #R2 Score for checking the accuracy of the MLM

0.7503534146588591

Predictions for a new customer

In [54]:
data_new={'age':67,'sex':0,'bmi':33.7,'children':2,'smoker':1,'region':3}

index=[1]

new_cust_data=pd.DataFrame(data_new,index)

In [55]:
new_cust_data

Unnamed: 0,age,sex,bmi,children,smoker,region
1,67,0,33.7,2,1,3


In [56]:
prediction_new=model.predict(new_cust_data)

print("Insurance cost for new customer is", prediction_new)

Insurance cost for new customer is [39801.73415197]
