In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest,chi2,f_regression
import seaborn as sns

In [None]:
dataset = pd.read_csv("../input/insurance/insurance.csv")


In [None]:
dataset.shape

In [None]:
dataset.head()

Before moving forward we've to do some cleaning with the data. Cleaning is the most important part of data analysis as the accuracy of results depends a lot upon the the quality of data. Without quality data predicting accurate results is not possible.

We'll here check for basic missing values and try to impute them with the help of various data cleaning techniques which will involve use of Imputer or deleting unnecessary data. 

Removing of data from available information depends on how much data is missing for a feature becuase if we'll move substantial amount of data then we'll not be left with much to work upon. It'll will result in a less accurate model.

In [None]:
dataset.info()

From Basic Info of dataset, we're able to see that the we're having 2 features of float, 2 features of integer and 3 of Object Data Type. Since the statistical models are available for numeric data so the features which are of object data type needs to be changed to ordinal values during later stages.

Charges is our target variable so we shouldn't be worried much about it.

The information is showing us that it's not having null values so it means that we can proceed towards further stages of data processing without worrying about null values.

In [None]:
X_train,X_test,y_train,y_test = train_test_split(dataset.drop(['charges'],axis=1),dataset['charges'],test_size=0.25,random_state=1000)

In [None]:
X_train.head()

First of all the categorical features will be changes to numeric. This step is required since most of stastical models
are not able to process categorical data. Here we'll try with most basic Label Encoder and One Hot Encoding to make sure that categorical features are changed to ordinal values.

In [None]:
encoder = LabelEncoder()

X_train['sex'] = encoder.fit_transform(X_train['sex'])
X_test['sex'] = encoder.fit_transform(X_test['sex'])
X_train['smoker'] = encoder.fit_transform(X_train['smoker'])
X_test['smoker'] = encoder.fit_transform(X_test['smoker'])


In [None]:
X_train.region.unique()

In [None]:

new_cols_train = pd.get_dummies(X_train['region'])
new_cols_test = pd.get_dummies(X_test['region'])
X_train.drop(columns=['region'],axis=1,inplace=True)
X_test.drop(columns=['region'],axis=1,inplace=True)


In [None]:
X_train = pd.concat([X_train,new_cols_train],axis=1)
X_test = pd.concat([X_test,new_cols_test],axis=1)
X_train.head()

So here we're having dataset in it's most basic form and is having all the possible features. Now i'll create a predictive model using Linear Regression without any changes to dataset. After this we'll move to feature selection stage to improve the features and move towards an enhanced and more focused dataset. This will show us how much we can improve if we enhance the given data.

You'll be clearly able to observe the change in results after the enhancements. So let's move towards basic Linear Regression model creation.

In [None]:
l_reg = LinearRegression()
l_reg.fit(X_train,y_train)

In [None]:
l_reg.predict(X_test).shape

In [None]:
l_reg.score(X_test,y_test)

The score here tells us that our model is able to predict 79% accurate results which is quite good for a model with raw data. Now time is to see if we can improve the accuracy of this by applying feature selection and feature engineering.

In [None]:
sns.heatmap(pd.concat([X_train,y_train],axis=1).corr())
plt.show()

Here the last 4 columns are showing quite high negative corelation. It was supposed to be like that only because these 4 columns were derived from only one feature so it has to behave this way.

Apart from them the corelation value for smoker and charges is quite high. Insurance Premiums are high for people who are smokers since they've more chances of dying early which will result in more chances of insurance getting claimed by the family of insured. This means that insurance charges are very strongly dependent on a person being smoker.

Not only this but age is also having high corelation with charges. It is because the chances of a person dying in an early age is quite less than the person dying at later age.

In [None]:
k_best = SelectKBest(f_regression,k=5)
k_best_transformed = k_best.fit_transform(X_train,y_train)
k_best.scores_

In [None]:
X_train.head()

From the scores we can see that the We're having maximum score for 1st and 5th features which means that they will influence the target variable more than the others. We've seen the simillar behavior for these features using corelation heatmap also. 

Based on the scores we're going to consider only the first 5 column and the southeast column to be used for prediction purpose.
I'll train the model again for these columns and see how much differenc we're having based on these enhancements.

In [None]:
sns.pairplot(X_train,hue='sex',vars=['age','bmi','children'])


In [None]:
cols_to_drop = ['northeast','northwest','southwest']
X_train.drop(columns=cols_to_drop,axis=1,inplace=True)
X_test.drop(cols_to_drop,axis=1,inplace=True)

In [None]:
l_reg.fit(X_train,y_train)
l_reg.score(X_test,y_test)

Here we've seen that since the results have improved but not significantly so we can try some feature engineering and try to introduce some features which will help us get the best of available data.

Feature Engineering is creation of new features based on existing so that we can get some more information of the data which will help us predict the data.This process involves more business understanding. The better we know about the business more we'll be able to create features which will help us. We've to try with multiple feature combinations and see how better each combination works out.

As a result of feature combinations and my understanding of insurance domain, i've able to produce following few features which has helped the result getting better.

In [None]:
def age_transform(ages):
    transformed_list = []
    #Here 1 means 'Young', 2 means 'Middle Aged' and 3 means 'Old Age'
    for age in ages:
        if age <= 30:
            transformed_list.append(1)
        elif age < 60:
            transformed_list.append(2)
        else:
            transformed_list.append(3)
    
    return transformed_list


In [None]:
#Adding a new feature 'life_stage' based on persons age
X_train['life_stage'] = age_transform(X_train.age.values)
X_test['life_stage'] = age_transform(X_test.age.values)

In [None]:
def bmi_category(bmi):
    transformed_list = []
    #Here 1 means 'Under weight', 2 means 'Normal' , 3 means 'Over Weight' and 4 means 'Obese'
    for index in bmi:
        if index < 18.5:
            transformed_list.append(1)
        elif index >= 18.5 and index <= 24.9:
            transformed_list.append(2)
        elif index >= 25 and index <= 29.9:
            transformed_list.append(3)
        else:
            transformed_list.append(4)
    
    return transformed_list


In [None]:
#We'll shift the bmi values to it's corresponding category
X_train['bmi'] = bmi_category(X_train.bmi.values)
X_test['bmi'] = bmi_category(X_test.bmi.values)

In [None]:
def calculate_risk(life_stage,smoker,bmi):
    transformed_list = []
    #Here from 1 till 6 we've increasing risk based on life stage, smoker and bmi
    counter = 0
    if len(life_stage) == len(smoker):
        for stage,smoke in zip(life_stage,smoker):
            if (stage == 1) and (smoke == 1) and (bmi[counter] == 2):
                transformed_list.append(1)
            elif (stage == 1) and (smoke == 1) and (bmi[counter] == 3):
                transformed_list.append(2)
            elif (stage == 2) and (smoke == 1) and (bmi[counter] == 2):
                transformed_list.append(3)
            elif (stage == 2) and (smoke == 1) and (bmi[counter] == 3):
                transformed_list.append(4)
            elif (stage == 3) and (smoke == 1) and (bmi[counter] == 2):
                transformed_list.append(5)
            elif (stage == 3) and (smoke == 1) and (bmi[counter] == 3):
                transformed_list.append(6)
            else:
                transformed_list.append(0)
            counter=counter+1
    
    return transformed_list

In [None]:
X_train['life_risk'] = calculate_risk(X_train.life_stage.values,X_train.smoker.values,X_train.bmi.values)
X_test['life_risk'] = calculate_risk(X_test.life_stage.values,X_test.smoker.values,X_test.bmi.values)

In [None]:
l_reg.fit(X_train,y_train)
l_reg.score(X_test,y_test)

As here you'll be able to see that after doing Feature Engineering the score of our existing model has improved by 8%. Earlier it was able to predict 79.36% but now we're getting accuracy of 86.75%. This is a huge difference. As you've seen that with basic Linear Regression we can improve the results drastically if we analyse the data properly. I haven't tried any other model on this dataset. You can try and see how other models help getting results on this dataset.

Hope you've liked it. Thankyou!!