> *Please provide feedback and suggestions and help me to improve.*
If any other point could be concluded then share it in comments section.
Leave an upvote to encourage me.

# Predicting Medical costs using Linear Regression

## Inspiration
Can we accurately predict insurance costs based on given features?
## **About the dataset:** Columns:

* age: age of primary beneficiary

* sex: insurance contractor gender, female, male

* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

* children: Number of children covered by health insurance / Number of dependents

* smoker: insurance contractor is a smoker or not

* region: the beneficiary's residential area in the US: northeast, southeast, southwest, northwest.

* charges: Individual medical costs billed by health insurance (*Target variable*)

In [None]:
#   importing libraries

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('../input/insurance/insurance.csv')

In [None]:
# Getting essence of our data!
data.info()

There is no missing value in this dataset which is really rare in practical world.

In [None]:
data.head()

# EDA

In [None]:
fig,axes=plt.subplots(2,1,figsize=(5,7))
sns.countplot(data.region,palette='spring',ax=axes[0])
axes[0].set_title("Region-wise distribution of dataset",fontsize=20)
sns.countplot(data.sex,palette='rainbow',ax=axes[1])
axes[1].set_title("Gender-wise distribution of dataset",fontsize=20)
plt.tight_layout();

Our data is quite balanced with respect to sex and region features.

In [None]:
sns.countplot(data.smoker,palette='prism')
plt.title("Smokers in our data");

There are very less smokers as compared to non-smokers.

In [None]:
sns.jointplot(data.bmi,data.charges,color='orange');

* There is not clear linear relationship between target variable 'charges' and feature 'bmi'.
* We can see that charges distribution is not normally distributed which is important assumption for linear regression.
* In BMI histogram we can see that most of the observations have BMI centered around 25-35.

Let's explore our target variable which is _charges_.

In [None]:
fig,axes = plt.subplots(1,2,figsize=(14,5))
sns.kdeplot(data.charges,color='purple',ax=axes[0])
sns.boxenplot(data.charges,color='green',ax=axes[1]);

Target variable is not normally distributed.Let's apply **logarithmic transformation** to solve this problem.

In [None]:
backup_data = data.copy()
data.charges = np.log(data.charges)

In [None]:
fig,axes = plt.subplots(1,2,figsize=(14,5))
sns.distplot(data.charges,color='orange',ax=axes[0])
sns.boxenplot(data.charges,color='orange',ax=axes[1]);

Seems better.

# Linear Regression

> In case you are new to regression, read [this](https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html) article.

Here's a small recap of assumptions of linear regression:
* **Linearity**- This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables. A linear relationship suggests that a change in response Y due to one unit change in X1 is constant, regardless of the value of X1. An additive relationship suggests that the effect of X1 on Y is independent of other variables. Polynomial terms (X, X², X³) can be included in model to capture the non-linear effect.
* Error terms must be **normally distributed** with mean 0. If the errors are not normally distributed, non – linear transformation of the variables (response or predictors) can bring improvement in the model.
* **Constant variance** (a.k.a. **homoscedasticity**)- This means that different values of the response variable have the same variance in their errors, regardless of the values of the predictor variables. Look at residual vs fitted values plot to check this assumption. If heteroskedasticity exists, the plot would exhibit a funnel shape pattern.
* **No Autocorrelation**- There should be no correlation between the residual (error) terms. It is most likely to occur in time series model. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error. To check this use [Durbin-Watson test](https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic). The Durbin Watson test reports a test statistic, with a value from 0 to 4, where 2 is no autocorrelation,0 to <2 is positive autocorrelation and >2 to 4 is negative autocorrelation. A rule of thumb is that test statistic values in the range of 1.5 to 2.5 are relatively normal. Values outside of this range could be cause for concern. 
* **No Multicollinearity**- The independent variables should not be correlated. A variance inflation factor(VIF) detects multicollinearity in regression analysis. VIFs are calculated by taking a predictor, and regressing it against every other predictor in the model. VIF ranges from 1 to infinity. For example, a VIF of 1.9 tells you that the variance of a particular coefficient is 90% bigger than what you would expect if there was no multicollinearity — if there was no correlation with other predictors. A rule of thumb for interpreting the variance inflation factor:1 means not correlated, Between 1 and 5 means moderately correlated and >5 means highly correlated.

### Converting categorical columns into numerical ones using dummy variables

In [None]:
data.children.value_counts()

In [None]:
data.children.replace([3,4,5],'More than 3',inplace=True)
data.children.replace(0,'Zero',inplace=True)
data.children.replace(1,'One',inplace=True)
data.children.replace(2,'Two',inplace=True)

In [None]:
dummies = pd.get_dummies(data[['sex','smoker','region','children']],drop_first=True)
df_dummies = pd.concat([data,dummies],axis=1)
df_dummies.drop(['sex','smoker', 'region','charges','children'],axis=1,inplace=True)
df_dummies.head(3)

Now, our dataset is ready for regression.

### Splitting data into train and test sets

In [None]:
from sklearn.model_selection import train_test_split
X=df_dummies
y=data.charges
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

# Applying regression using SciKit Learn library

In [None]:
# importing 
from sklearn.linear_model import LinearRegression

lm=LinearRegression()
lm.fit(X_train,y_train)
pred_lm = lm.predict(X_test)

# Our predictions
plt.scatter(y_test,pred_lm)

# Perfect predictions
plt.plot(y_test,y_test,'r');

### Evaluating the model

In [None]:
from sklearn.metrics import mean_absolute_error,mean_squared_error,explained_variance_score,r2_score
print(f"Mean absolute error (MAE) is: {mean_absolute_error(y_test,pred_lm).round(3)}\n\
Mean squared error (MSE) is: {mean_squared_error(y_test,pred_lm).round(3)}\n\
Root Mean Squared error (RMSE) is: {np.sqrt(mean_squared_error(y_test,pred_lm)).round(3)}\n\
Explained Variance Score is: {explained_variance_score(y_test,pred_lm).round(3)}\n\
R-squared for transformed target variable is: {r2_score(y_test,pred_lm).round(3)}")

In [None]:
print(f"R-squared for actual target variable is: {r2_score(np.exp(y_test),np.exp(pred_lm)).round(3)}")

> * RMSE is the standard deviation of the residuals. RMSE gives us the standard deviation of the unexplained variance by the model. It can be calculated by taking square root of Mean Squared Error. The more concentrated the data is around the regression line, the lower the residuals and hence lower the standard deviation of residuals. It results in lower values of RMSE. So, lower values of RMSE indicate better fit of data.

> * R2 Score is another metric to evaluate performance of a regression model. It is also called Coefficient of Determination. It gives us an idea of goodness of fit for the linear regression models. It indicates the percentage of variance that is explained by the model. In general, the higher the R2 Score value, the better the model fits the data. Usually, its value ranges from 0 to 1. Its value can become negative if our model is wrong.

So, this model explains 79% variance of the target variable.

## Cross validation

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model.
In cross-validation, we split the training data into several subgroups. Then we use each of them in turn to evaluate the model fitted on the remaining portion of the data. It helps us to obtain reliable estimates of the model's generalization performance. So, it helps us to understand how well the model performs on unseen data. We can perform cross validation as follows:-

In [None]:
# import the library
from sklearn.model_selection import cross_val_score

# Compute 4-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(lm, X, y, cv=4)

print("Average 4-Fold CV Score: {}".format(np.mean(cv_scores).round(4)))

# Applying regression using Statsmodel library

In [None]:
import statsmodels.api as sm
X1 = sm.add_constant(X)
results = sm.OLS(y,X1).fit()
results.summary()

**CONCLUSION :**

* R-squared (0.769) implies that our regression line explains 76% variation of y.
* A predictor that has a low p-value is likely to be a meaningful addition to our model because changes in the predictor's value are related to changes in the response variable. Seeing p-values in table we can conclude that all our variables are significant except 'children_Two' which is infact surprising.
* Durbin-Watson test suggests that there is negligible autocorrelation as it is close to 2. Assumption of autocorrelation is also satisfied.
* > Prob(Omnibus): One of the assumptions of OLS is that the errors are normally distributed. Omnibus test is performed in order to check this. Here, the null hypothesis is that the errors are normally distributed. Prob(Omnibus) is supposed to be close to the 1 in order for it to satisfy the OLS assumption.

* In this case Prob(Omnibus) is 0., which implies that the OLS assumption is not satisfied. Due to this, the coefficients estimated out of it are not Best Linear Unbiased Estimators(BLUE).
* It seems like a case where we would need to model this data using methods that can model non-linear relationships. Also variables need to be transformed to satisfy the normality assumption.

# Polynomial Regression

The implementation of polynomial regression is a two-step process. First, we transform our data into a polynomial using the PolynomialFeatures function from sklearn and then use linear regression to fit the parameters. With the increasing degree of the polynomial, the complexity of the model also increases. Therefore, the value of n must be chosen precisely. If this value is low, then the model won’t be able to fit the data properly and if high, the model will overfit the data easily.

In [None]:
# Fitting Polynomial Regression to the dataset 
from sklearn.preprocessing import PolynomialFeatures 
  
poly = PolynomialFeatures(degree = 2) 
X_poly = poly.fit_transform(X_train)
poly.fit(X_poly, y_train) 
lin2 = LinearRegression() 
lin2.fit(X_poly, y_train)
poly_pred=lin2.predict(poly.fit_transform(X_test))
# Visualising the Polynomial Regression results 
# Our predictions
plt.scatter(y_test,poly_pred)
# Perfect predictions
plt.plot(y_test,y_test,'r');

In [None]:
print(f"Mean absolute error (MAE) is: {mean_absolute_error(y_test,poly_pred).round(3)}\n\
Mean squared error (MSE) is: {mean_squared_error(y_test,poly_pred).round(3)}\n\
Root Mean Squared error (RMSE) is: {np.sqrt(mean_squared_error(y_test,poly_pred)).round(3)}\n\
Explained Variance Score is: {explained_variance_score(y_test,poly_pred).round(3)}\n\
R-squared for transformed target variable is: {r2_score(y_test,poly_pred).round(3)}")

In [None]:
print(f"R-squared for actual target variable is: {r2_score(np.exp(y_test),np.exp(poly_pred)).round(3)}")

So, RMSE decreased from 0.417 to 0.345 and R-squared increased to 85%.