In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Abstract
An Exploratory Data Analysis is done to find the most impactful variables in predicting Insurance Premium charges. Then Linear Regression using a linear model and Polynomial Regression using a polynomial model (order = 2) are carried out to observe how much the model improves from a simple linear regression model to a slightly complex polynomial model. A significant improvement is observed. 

Subsequently, Ridge Regression is used to investigate higher order polynomials with a regularization parameter to prevent overfitting and reduce the model's sensitivity to the weight of coefficients. It is observed that a Polynomial model of the 5th order paired with a regularization parameter of 100 results in the best R2 value.

In [None]:
df = pd.read_csv('/kaggle/input/insurance/insurance.csv')
df.head()

In [None]:
#checking for missing values
dfna = df.isna()
for column in dfna.columns.values.tolist():
    print(column)
    print(dfna[column].value_counts(dropna = False))

**From the above step we know that the data contains no missing values**

# **Now we *create correlation heat maps, regression plots* and bar plots to assess which variables have a significant impact on premium cost**

First we investigate relationship between sex and price

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

In [None]:
#plotting a boxplot to assess if gender has an impact on insurance cost
sns.boxplot(x = 'sex',y = 'charges', data = df)

**As seen in the above boxplot, while there is a slightly higher charge paid by male patients, both genders have the same median charge and have largely overlapping range between the upper and lower quartiles. This gives us a preliminary indication that sex is not going to have a significant effect on the premium charge**

To prove whether there is a significant difference between male and female premium charges, we use the Analysis of Variance to carry out the F-test on the data.

Analysis of variance involves finding the mean of 2 or more categorical variables, finding the variation between the means and finding its ratio with the variance of each group. It is coupled with a p-value with a significance value of 0.05.

**Our Null Hypothesis is that there is no significant difference between the categorical variables and their coressponding charges. If the p-value is lower than the siginificance value, we say that our starting Null Hypothesis is wrong and there is infact a significant difference between the categorical variables and their coressponding charges.**

In [None]:
from scipy import stats
df_ftestsex = df[['sex','charges']].groupby(df['sex'])
df_ftestsex.head()

In [None]:
ftest_val,p_val = stats.f_oneway(df_ftestsex.get_group('female')['charges'],df_ftestsex.get_group('male')['charges'])
print('F-Value : {} , pvalue : {}'.format(ftest_val,p_val))

**According to the above result we see although the p value is lower than the significance level of 0.05 which shows that there is a  difference between the charges incurred on males and females, the low F-test score shows that the correlation between the sex and premium charged is low. Regardless, a p value of 0.036, although lower than 0.05, is not low enough to show a *significant difference* between the charges on males and females**

****Now we shall investigate the relationship between region and premium charged.****

In [None]:
sns.boxplot(x = 'region', y = 'charges', data = df)

In [None]:
df_ftestregion = df[['region','charges']].groupby('region')
df_ftestregion.head()

In [None]:
ftest_valregion, p_valregion = stats.f_oneway(df_ftestregion.get_group('southeast')['charges'],df_ftestregion.get_group('southwest')['charges'],\
                                             df_ftestregion.get_group('northeast')['charges'],df_ftestregion.get_group('northwest')['charges'])
print('F-Value : {} , pvalue : {}'.format(ftest_valregion,p_valregion))

Here too, the F-value is quite low, although the P-value is lower than the significance value. This shows that the impact region has on the premium charge is weak. Nevertheless it is worth investigating if having both Sex and Region as variables has a significant impact on the linear regression model. **However, the expected observation is that the coefficients assigned to the categorical variables will carry a lower weight.**

**Now we investigate the how premium charge varies depending whether an individual smokes or not**

In [None]:
sns.boxplot(x = 'smoker', y = 'charges', data = df)

We immediately observe from the above boxplot that there is a significant difference between the premium charges incurred on smokers as compared to non-smokers. We expect the F-test to support our observation

In [None]:
df_ftestsmoker = df[['smoker','charges']].groupby('smoker')
df_ftestsmoker.head()

In [None]:
ftest_valsmoker, p_valsmoker = stats.f_oneway(df_ftestsmoker.get_group('yes')['charges'],df_ftestsmoker.get_group('no')['charges'])
print('F-Value : {} , pvalue : {}'.format(ftest_valsmoker,p_valsmoker))

**The above F-Test is strong proof of our preliminary observation. A very high F-Value coupled with an almost 0 P-value shows that there is a significant statistical difference between the charges incurred on smokers and non-smokers. Furthermore there is a clear correlation between smoking and the premimum charged. *Hence, we should include this variable in our model***

Now we concentrate on the numerical variables. To do that we calculate the correlation between all numerical variables.

In [None]:
#correlation matrix
dfcorr = df.corr()
#correlation with premium charges
dfcorrcharges = dfcorr[['charges']]
dfcorrcharges

The correlation coefficients are quite weak for all numerical variables. To investigate further we look at the Pearson Correlation Coefficient. 

# Age vs Charges

In [None]:
corrcoeff_age,pval_age = stats.pearsonr(df['age'],df['charges'])
print("The Pearson Correlation Coefficient is", corrcoeff_age, " with a P-value of P =", pval_age)  

The linear relation between Age and Charge incurred is quite weak, but the P-Value is almost 0, hence the correlation is statistically significant. We can draw a regression plot to see if this observation is justified. 

In [None]:
sns.regplot(x = 'age', y = 'charges', data = df)

As expected the relationship is positive yet weak

# BMI vs Charges

In [None]:
corrcoeff_bmi,pval_bmi = stats.pearsonr(df['bmi'],df['charges'])
print("The Pearson Correlation Coefficient is", corrcoeff_bmi, " with a P-value of P =", pval_bmi)  

A weak linear relationship, however the correlation is statistically significant

In [None]:
sns.regplot(x = 'bmi', y = 'charges', data = df)

# Children vs Charges

In [None]:
corrcoeff_child,pval_child = stats.pearsonr(df['children'],df['charges'])
print("The Pearson Correlation Coefficient is", corrcoeff_child, " with a P-value of P =", pval_child) 

The linear relationship is extremely weak with a value of 0.068. The P-value only suggests moderate certainty. Hence this variable can be ignored in model development.

In [None]:
sns.regplot(x = 'children', y = 'charges', data = df)

From the above Exploratory Data Analysis, the following variables are chosen to be included in the model for initial investigation:
1. Smoker
2. Age
3. Sex
4. BMI

As Smoker and Sex are categorical variables we use ****one-hot encoding**** to make them into numerical variables

In [None]:
#smokerdf is the one-hot encoded dataframe
smokerdf = pd.get_dummies(df['smoker']) 
smokerdf.rename(columns = {'no':'non-smoker','yes':'smokes'}, inplace = True)
smokerdf.head()

In [None]:
#we are dropping the smoker column to replace it with the one-hot encoded dataframe "smokerdf"
df = pd.concat([df,smokerdf],axis = 1)
df.drop(['smoker'],axis = 1, inplace = True)
df.head()

In [None]:
sexdf = pd.get_dummies(df['sex'])
sexdf.head()

In [None]:
df = pd.concat([df,sexdf],axis = 1)
df.drop(['sex'],axis = 1, inplace = True)
df.head()

In [None]:
#independent variable dataframe 
x = df[['age','bmi','non-smoker','smokes','female','male']]
y = df[['charges']]
x.head()

To get a understanding as to whether each numerical variable has a linear or non linear relationship with the premimum charged, residual plots are used. 

# Age vs. Charge

In [None]:
#residual plot for a linear relationship
sns.residplot(x[['age']],y)

The above residual plot suggests that there is linear relationship between Age and Premium Charge as the data is randomly distributed for all values of Age.

# BMI vs Charge

In [None]:

sns.residplot(x[['bmi']],y)

The above residual shows that the error increases with bmi suggesting a non linear relationship. Nevertheless, we shall continue on with Linear regression now. 

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Scaling the x variables
Only Age and BMI have been scaled, as the other variables are either 1s or 0s due to the one-hot encoding which was previously done

In [None]:
SCALE = StandardScaler()
xforscaling = x[['age','bmi']]
SCALE.fit(xforscaling)

In [None]:
print('The mean values of Age and BMI are {} and {} respectively'.format(SCALE.mean_[0],SCALE.mean_[1]))

In [None]:
scaledxdata = pd.DataFrame(SCALE.transform(xforscaling))
scaledxdata.rename(columns = {0:'Age_scaled',1:'BMI_scaled'},inplace = True)
scaledxdata.head()

In [None]:
xtemp = x.drop(['age','bmi'],axis = 1)

In [None]:
xscaleddata = pd.concat([scaledxdata,xtemp],axis = 1)
xscaleddata.head()

**This step is to split the data into a training set and test set**

In [None]:
x_train,x_test,y_train,y_test = train_test_split(xscaleddata,y,random_state = 0)

# Linear regression using a linear model

In [None]:
linreg = LinearRegression()
linreg.fit(x_train,y_train)
print(linreg.coef_,linreg.intercept_)


In [None]:
ypredict_test = linreg.predict(x_test)
ypredict_test[0:10,:]

# R2 Value of the Linear Regression

In [None]:
print(r2_score(y_test,ypredict_test))

# Linear regression using a polynomial features

In [None]:
poly = PolynomialFeatures(2, include_bias = False)
xpoly_train = poly.fit_transform(x_train)
xpoly_train = pd.DataFrame(xpoly_train)
xpoly_train.head()

In [None]:
linregpoly = LinearRegression()
linregpoly.fit(xpoly_train,y_train)

In [None]:
ypolypredict = linregpoly.predict(poly.fit_transform(x_test))

# R2 Value of the Polynomial Regression

In [None]:
print(r2_score(y_test,ypolypredict))

# Residual Curves for both Linear Regression and Polynomial Regression

In [None]:
#residual curve for linear regression
ypredict_test = pd.DataFrame(ypredict_test)
ypredict_test.rename(columns = {0:'charges'}, inplace = True)
ypredict_test.head()

In [None]:
y_testresid = y_test.reset_index(drop = True)
y_testresid.head()

In [None]:
linreg_resid = y_testresid - ypredict_test
linreg_resid.reset_index(inplace = True)
linreg_resid.head()

In [None]:
sns.scatterplot(x = 'index', y = 'charges', data = linreg_resid).set_title('Residual Plot')

In [None]:
#residual curve for polynomial regression
ypolypredict_resid = pd.DataFrame(ypolypredict)
ypolypredict_resid.rename(columns = {0:'charges'},inplace = True)
ypolypredict_resid.head()

In [None]:


linregpoly_resid = y_testresid - ypolypredict_resid
linregpoly_resid.reset_index(inplace = True)
linregpoly_resid.head()


In [None]:
sns.scatterplot(x = 'index', y = 'charges', data = linregpoly_resid).set_title('Residual Plot_Polynomial')

# Testing higher orders polynomials along with Ridge regression and Cross Validation using GridSearchCV


When testing higher order polynomials, it tends to result in overfitting resulting in high variance. To avoid this overfitting, we want to introduce some bias into out model

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge

In [None]:
xscaleddata.head()

In [None]:
y.shape

In [None]:
polyorders = [2,3,4,5,6,7]
parameters = [{'alpha':[0.001,0.005,0.01,0.05,0.1,0.5,1,5,10,100,1000]}]
results = {}
for order in polyorders:
    #getting the polynomical object 
    poly =  PolynomialFeatures(degree = order, include_bias = False)
    #transforming scaled x data to polynomial format
    xpolydata = poly.fit_transform(xscaleddata)
    #ridge model to introduce the regularization term
    ridgemodel = Ridge()
    gridsearchcv = GridSearchCV(ridgemodel,parameters,cv = 10)
    gridsearchcv.fit(xpolydata,y)
    #putting the results into the 'results' dictionary
    results[order] = [gridsearchcv.best_params_,gridsearchcv.best_score_]
    

In [None]:
results

From the above results, we see that polynomial orders of 3 and 5 paired with regularization parameter of 5 and 100 respecively result in high R2 values. However these R2 values are lesser than the R2 value of 0.8808592958824164 that we obtained using a polynomial order of 2, ***without cross validation***. Hence I want to apply the same parameters of alpha = 5,100 to polynomial order = 3,5 to the same test/train split to observe the resulting R2 values.

# Investigating Polynomial order 3 with Regularization Parameter 5

In [None]:
#this is the same test/train split used with Polynomial order of 2 previously
x_train,x_test,y_train,y_test = train_test_split(xscaleddata,y,random_state = 0)
poly3 = PolynomialFeatures(degree = 3)
x_trainpoly3 = poly3.fit_transform(x_train)
x_testpoly3 = poly3.fit_transform(x_test)
ridgepoly3 = Ridge(alpha = 5)
ridgepoly3.fit(x_trainpoly3,y_train)

In [None]:
#predicting the values from the Ridge Regression model for x_testpoly3
y_testpoly3predict = ridgepoly3.predict(x_testpoly3)


In [None]:
#calculating the R2 value on the test set using Ridge Regression model (Polynomial order = 3, Regularization Parameter = 5)
r2_score(y_test,y_testpoly3predict)

# Investigating Polynomial order 5 with Regularization Parameter 100







In [None]:
#5th order polynomial
poly5 = PolynomialFeatures(degree = 5)
x_trainpoly5 = poly5.fit_transform(x_train)
x_testpoly5 = poly5.fit_transform(x_test)
ridgepoly5 = Ridge(alpha = 100)
ridgepoly5.fit(x_trainpoly5,y_train)

In [None]:
#predicting the values from the Ridge Regression model for x_testpoly5
y_testpoly5predict = ridgepoly5.predict(x_testpoly5)

In [None]:
#calculating the R2 value on the test set using Ridge Regression model (Polynomial order = 5, Regularization Parameter = 100)
r2_score(y_test,y_testpoly5predict)

# Final Conclusion

A polynomial order of 5 coupled with a Regularization parameter of 100 gives a slightly better R2 value of 0.886 which is approximately 0.89. This is slightly better than using a Polynomial order of 2 with no regularization parameter. Thus our final model will use a 5th order polynomial with a regularization parameter of 100. 

In [None]:
#final model
ridgepoly5