# Medical Insurance Cost
## Using Linear Regression model

### I. Aim:

Our aim here is to predict the medical insurance cost of individuals using various parameters like
    <ul>
        <li>age - of the primary beneficiary</li>
        <li>sex - male or female</li>
        <li>bmi - a measure of the health condition for given height and weight</li>
        <li>children - number of dependents covered by the insurance</li>
        <li>smoker - yes or no</li>
        <li>region - region in US namely northeast, northwest, southeast and southwest</li>
    </ul>

### II. Model:
Towards this end we will be using the Linear regression model and test its suitability in this context.     

### III. Imports:

In [None]:
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model as lin
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import metrics

In [None]:
warnings.filterwarnings('ignore')

### IV. Loading the data

In [None]:
MedicalInsurance = pd.read_csv('../input/insurance/insurance.csv')
MedicalInsurance.head()

### V. Splitting the dataset into training and test sets

Here based on the training set we will train our model and also take various model tuning decisions and hence it will comprise of both the training and validation sets down the line. The test set here on the other hand is meant only for a final verdict as to how good our output performs (in terms of confidence intervals say) in real world. In a way it's like a meausure of the user experience once an application has been delivered.

In [None]:
TrainSet, TestSet = train_test_split(MedicalInsurance, test_size=0.25)
print('Size of Training set:', len(TrainSet))
print('Size of Test set:', len(TestSet))

### VI. Are there any null values?

In [None]:
MedicalInsurance.isnull().any()

**So good news, no null entry is there to worry about.**

### VII. A look at the categorical features

The training set will further be divided into training and validation set. Prior to that we will look at the categorical variables in the data set to identify how many different categories/classes per feature is there and what is the relative frequency of each class.

In [None]:
# A plot to show the category-wise data distribution for categorical features 
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(14,5))

TrainSet['sex'].value_counts(normalize=True).plot(kind='bar',ax=ax1,title='sex',color=['b','g'],alpha=0.7)
TrainSet['smoker'].value_counts(normalize=True).plot(kind='bar',ax=ax2,title='smoker',color=['brown','violet'],alpha=0.7)
TrainSet['region'].value_counts(normalize=True).plot(kind='bar',ax=ax3,title='region',color=['crimson','cyan','orange','indianred'],alpha=0.7);

**Question: For features where the distribution of data is very imbalanced among the possible feature values, leaving the training-validation set split (or training-test split for that matter) to random splitting might lead to even more drastic imbalances. For example, in the above scenario the data count for *smoker=yes* is significantly lower than *smoker=no*. If it is too low, then during the split it might happen that one of the splits (training/test/validation) ends up with no or too low *smoker=yes* records compared to the other splits. In such scenarios we need to stratify during splitting. This will be checked later if such issues have occured and what effect they have had. In fact if such issues do really occur then this sort of distribution study needs to be done for all features in general and corrective measures (like stratifying) taken for them. For large number of features though this might be tedious and impractical. This trade-off discussion needs to be taken up in detail later.**


So we will use -
<ul>
    <li>sex - 1-level encoder with male=0, female=1</li>
    <li>smoker - 1-level encoder with no=0, yes=1</li>
    <li>region - 4-level one-hot encoder (or 3-level <i>because there are no regions outside the 4</i>)</li>
    <ol>
        <li>northeast - 1000</li>
        <li>northwest - 0100</li>
        <li>southeast - 0010</li>
        <li>southwest - 0001</li>
    </ol>
    <b>Question: Which is better 4-level or 3-level encoding?</b>
</ul>

### VIII. Breaking the training set into training and validation sets

In [None]:
FinalTrainSet,ValSet = train_test_split(TrainSet,test_size=0.25)
FinalTrainSet.index = np.arange(len(FinalTrainSet))
ValSet.index =np.arange(len(ValSet))
print('Final Training set size:', len(FinalTrainSet))
print('Validation set size:', len(ValSet))
X_train,y_train = FinalTrainSet.iloc[:,:-1],FinalTrainSet.iloc[:,-1]
X_val,y_val = ValSet.iloc[:,:-1],ValSet.iloc[:,-1]

In [None]:
# A plot to show the category-wise data distribution for categorical features 
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(8,5))

FinalTrainSet['smoker'].value_counts(normalize=True).plot(kind='bar',ax=ax1,title='Training set smoker distribution',color=['brown','violet'],alpha=0.7)

ValSet['smoker'].value_counts(normalize=True).plot(kind='bar',ax=ax2,title='Validation set smoker distribution',color=['brown','violet'],alpha=0.7);

So the feature about which we were worried has been tackled automatically and no further corrective measures needed.

### IX. Encoding the categoricals

In [None]:
# Encoding the sex and smoker columns
sexSmokerEncoder = preprocessing.OrdinalEncoder(categories=[['male','female'],['no','yes']],dtype=int)
X_train[['sex','smoker']] = sexSmokerEncoder.fit_transform(X_train[['sex','smoker']])
X_train.head()

In [None]:
# Encoding the region column
regionEncoder = preprocessing.OneHotEncoder(categories=[['northeast','northwest','southeast','southwest']],sparse=False,dtype=int)
X_train[['NE','NW','SE','SW']] = regionEncoder.fit_transform(X_train[['region']])
X_train.drop(columns=['region'],inplace=True)
X_train.head()

In [None]:
# Similar transformations done for validation set as well
X_val[['sex','smoker']] = sexSmokerEncoder.transform(X_val[['sex','smoker']])
X_val[['NE','NW','SE','SW']] = regionEncoder.transform(X_val[['region']])
X_val.drop(columns=['region'],inplace=True)

### X. Fitting the Linear Regression model measuring its fit on validation set

In [None]:
linrgr = lin.LinearRegression()
linrgr.fit(X_train,y_train)
df = pd.DataFrame(columns=['Features','Coefficients'])
df['Features']=X_train.columns
df['Coefficients']=linrgr.coef_
df.loc[len(df)]=['intercept',linrgr.intercept_]
df

At this point the coefficients are displayed for the sake of display and they do not make much sense on their own.

In [None]:
trainfitscore = np.around(linrgr.score(X_train,y_train),3)
trainMAE = np.around(metrics.mean_absolute_error(y_true=y_train,y_pred=linrgr.predict(X_train)),3)
valfitscore = np.around(linrgr.score(X_val,y_val),3)
valMAE = np.around(metrics.mean_absolute_error(y_true=y_val,y_pred=linrgr.predict(X_val)),3)

print('Training set R-squared:', trainfitscore)
print('Validation set R-squared:', valfitscore)
print('\n\nTraining set MAE:', trainMAE)
print('Validation set mean absolute error percentage:', valMAE)

NOTE: Everytime we do some tuning we will require to retrain the Linear Regression model and evaluate its fit to training and validation sets. Then we will see if there was some improvement obtained due to the tuning. To streamline this process we will just define a function that -
<ol>
    <li>will take in the training set and validation set</li>
    <li>run plain linear regression on it</li>
    <li>measure and record the fit to the training and validation set respectively along with the reason for the retraining</li>
</ol>

In [None]:
# The table to recod the progression of the model fit for each reason or tuning
modelprogress = pd.DataFrame(columns=['Reason','Training R-squared','Validation R-squared','Training MAE', 'Validation MAE'])

def fitandrecord(X_train,y_train,X_val,y_val,reason,modelprogress,returncoefficients=False):
    linrgr = lin.LinearRegression()
    linrgr.fit(X_train,y_train)
    trainfitscore = np.around(linrgr.score(X_train,y_train),3)
    trainMAE = np.around(metrics.mean_absolute_error(y_true=y_train,y_pred=linrgr.predict(X_train)),3)
    valfitscore = np.around(linrgr.score(X_val,y_val),3)
    valMAE = np.around(metrics.mean_absolute_error(y_true=y_val,y_pred=linrgr.predict(X_val)),3)
    modelprogress.loc[len(modelprogress)] = [reason,trainfitscore,valfitscore,trainMAE,valMAE]
    
    if returncoefficients:
        df = pd.DataFrame(columns=['Features','Coefficients'])
        df['Features']=X_train.columns
        df['Coefficients']=linrgr.coef_
        df.loc[len(df)]=['intercept',linrgr.intercept_]
        return (linrgr,modelprogress,df)
    return linrgr,modelprogress

In [None]:
# Testing the streamlining once

model, modelprogress=fitandrecord(X_train,y_train,X_val,y_val,'Initial',modelprogress)
modelprogress

### XI. Looking for unwanted collinearity

In [None]:
# Displaying the correlation matrix
fig, ax = plt.subplots()
cax = ax.matshow(np.absolute(X_train.corr()), cmap='coolwarm')
fig.colorbar(cax)
ax.set_xticks(np.arange(len(X_train.columns)))
ax.set_xticklabels(X_train.columns, rotation=45, ha='left')
ax.set_yticks(np.arange(len(X_train.columns)))
ax.set_yticklabels(X_train.columns)
ax.set_title('Correlation matrix');

There is no appreciable correlation amongst any two features. However there might be between more than 2 features. For that we will need to measure the ***VIF (variance inflation factor)*** for each variable. It is essentially measuring how good one feature can be measured in terms of the others. Its measure is $\frac{1}{1-{R^2}_{X_j|X_{j-1}}}$ where ${R^2}_{X_j|X_{j-1}}$ is the $R^2$ from the regression (I guess linear) of $X_j$ onto all other features. This is from *ISLR by James, Witten, Hastie, Tibshirani*.

In [None]:
def showmulticollinearity(X_train,y_train):
    multicollinearitytable = pd.DataFrame(columns=['Features','VIF'], dtype='float64')
    multicollinearitytable['Features'] = X_train.columns
    for i, feature in enumerate(X_train.columns):
        concernedFeature = X_train[feature]
        remainingFeatures = X_train[X_train.columns.difference([feature])]
        featureRegressor = lin.LinearRegression()
        featureRegressor.fit(remainingFeatures,concernedFeature)
        r_squared_score = featureRegressor.score(remainingFeatures,concernedFeature)
        VIFscore = np.around(1/(1-r_squared_score),3)
        multicollinearitytable.loc[i,'VIF'] = VIFscore
    return multicollinearitytable

showmulticollinearity(X_train,y_train)

**Note: The 4-level encoding of region clearly produces exact collinearity amongst the 4-features. This introduces uncertainty in the model parameters and we might end up with bad fit in the test set. We hence remove one of the 4-features, say region_southwest, and see if that improves the training and validation set fit.**

In [None]:
X_traincollinearity = X_train.copy()
X_traincollinearity.drop(columns=['SW'],inplace=True)
X_valcollinearity = X_val.copy()
X_valcollinearity.drop(columns=['SW'],inplace=True)

showmulticollinearity(X_traincollinearity,y_train)

So the issue of multicollinearity is solved; how about the fit?

In [None]:
model, modelprogress = fitandrecord(X_traincollinearity,y_train,X_valcollinearity,y_val,'Removing multicollinearity',modelprogress)
modelprogress

**Strange that the fit does not at all improve. Does multicollinearity not affect the fit at all?** See my blog <a href="https://einchako.medium.com/does-feature-collinearity-affect-model-fit-4618c5dc79ea"> here </a> where I have investigated this issue further and reached a solution.

In [None]:
# Finalizing the decision to drop SW column
X_train = X_traincollinearity
X_val = X_valcollinearity

### XII.  Looking for any non-linearity

Our aim here will be to look at the residuals (studentized by expressing them as fraction of **mean absolute error MAE**) and check for any patterns in there. These patterns show some non-linearity not captured by our model.

In [None]:
y_trainpredict = model.predict(X_train)
trainMAE = modelprogress.loc[len(modelprogress.index)-1,'Training MAE']
trainresiduals = y_train-y_trainpredict
trainStudentizedResiduals = trainresiduals/trainMAE

fig, ax = plt.subplots(figsize = (9,6))
ax.plot(y_train,trainStudentizedResiduals,'.g',label='Original responses')
ax.axhline(y=trainStudentizedResiduals.mean(),ls='--',c='r',label='Mean of studentized residuals')
ax.set_xlabel('Charges')
ax.set_ylabel('Studentized residuals')
ax.set_title('y vs residuals')
ax.legend();

The plot above shows for lower y(charges) our linear fit predictions systematically overestimate, while at higher y values it underestimates. So we are tempted to introduce some form of second-degree non-linearity here.

### XIII. Adding non-linearity via polynomial regression

In [None]:
poly = preprocessing.PolynomialFeatures(degree=2)

X_trainpolynomial = poly.fit_transform(X_train)[:,1:]
X_valpolynomial = poly.transform(X_val)[:,1:]

model, modelprogress = fitandrecord(X_trainpolynomial,y_train,X_valpolynomial,y_val,'2-degree polynomial fit',modelprogress)
modelprogress

Plotting the studentized residuals once more

In [None]:
y_trainpredict = model.predict(X_trainpolynomial)
trainMAE = modelprogress.loc[len(modelprogress.index)-1,'Training MAE']
trainresiduals = y_train-y_trainpredict
trainStudentizedResiduals = trainresiduals/trainMAE

y_valpredict = model.predict(X_valpolynomial)
valMAE = modelprogress.loc[len(modelprogress.index)-1,'Validation MAE']
valresiduals = y_val-y_valpredict
valStudentizedResiduals = valresiduals/valMAE

fig, (ax1,ax2) = plt.subplots(2,1,figsize = (9,14))

ax1.plot(y_train,trainStudentizedResiduals,'.g',label='Original responses')
ax1.axhline(y=trainStudentizedResiduals.mean(),ls='--',c='r',label='Mean of studentized residuals')
ax1.set_xlabel('Charges')
ax1.set_ylabel('Studentized residuals')
ax1.set_title('Training set')
ax1.legend()

ax2.plot(y_val,valStudentizedResiduals,'.g',label='Original responses')
ax2.axhline(y=valStudentizedResiduals.mean(),ls='--',c='r',label='Mean of studentized residuals')
ax2.set_xlabel('Charges')
ax2.set_ylabel('Studentized residuals')
ax2.set_title('Validation set')
ax2.legend();

Still some non-linearity seems to be present and introducing a 3-degree polynomial may turn out to be worse as the number of features would go up to around 92 $\left( {8 \choose 1} + {8 \choose 2} + {8 \choose 3} \right)$. Given the amount of training data we have, ~750 records, that is too many features. So we can stop here with Linear regression.

### XIV. How well does it then fit the final test set?

In [None]:
X_test,y_test = TestSet.iloc[:,:-1], TestSet.iloc[:,-1]
X_test[['sex','smoker']] = sexSmokerEncoder.transform(X_test[['sex','smoker']])
X_test[['NE','NW','SE','SW']] = regionEncoder.transform(X_test[['region']])
X_test.drop(columns=['region','SW'],inplace=True)

In [None]:
X_test = poly.transform(X_test)[:,1:]

testscore = np.around(model.score(X_test,y_test),3)
testMAE = np.around(metrics.mean_absolute_error(y_true=y_test,y_pred=model.predict(X_test)),3)

print('Test fit:', testscore)
print('Test MAE:', testMAE)

In [None]:
modelprogress

**The slightly larger differences in the fit scores of 2-3% between the training/validation/test sets instead of the more desired say ~1% is possibly due to the small number of samples although I cannot empirically establish it as yet.**