# <center>Medical Cost Personal Datasets</center>
## <center>Insurance Forecast by using Linear Regression</center>
### Context
Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book.

### Content
#### Columns

1. age: age of primary beneficiary

2. sex: insurance contractor gender, female, male

3. bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight $(kg / m ^ 2)$ using the ratio of height to weight, ideally 18.5 to 24.9

4. children: Number of children covered by health insurance / Number of dependents

5. smoker: Smoking

6. region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

7. charges: Individual medical costs billed by health insurance

### Acknowledgements
The dataset is available on GitHub <a href="https://github.com/stedy/Machine-Learning-with-R-datasets">here</a>.

### Inspiration
Can you accurately predict insurance costs?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import scipy as sp

from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.metrics import mean_absolute_error, mean_squared_error

%matplotlib inline
plt.style.use('ggplot')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/insurance/insurance.csv')
df.head()

### Null values

In [None]:
df.isna().sum()

There are no Null values

__More about the dataframes__

In [None]:
df.info()

In [None]:
df.describe().transpose()

__There seem to be a few outliers in the charges column as the 50% value is 9382.033\\$ but the average is found to be 13270.42$__

In [None]:
df['age'].describe()

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.countplot(
    data=df, 
    x = 'age',
    palette="Paired"
)
fig.set_title('Distribution of Age')
plt.show()

We see that there is a burst for the age 18 and 19. This probably because of some deals of the ACME foundation.

## Age vs Charges

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.scatterplot(
    data=df, 
    x = 'age',
    y = 'charges',
)
fig.set_title('Age vs Charges')
plt.show()

## Age vs Charges - w.r.t. region

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.scatterplot(
    data=df, 
    x = 'age',
    y = 'charges',
    hue = 'region',
)
fig.set_title('Age vs Charges - hue = region')
plt.legend(bbox_to_anchor=(1.15, 1), title='Region', fontsize=13)
plt.show()

## Age vs Charges - w.r.t. sex

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.scatterplot(
    data=df, 
    x = 'age',
    y = 'charges',
    hue = 'sex',
)
fig.set_title('Age vs Charges - hue = sex')
plt.legend(bbox_to_anchor=(1.15, 1), title='Sex', fontsize=13)
plt.show()

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.scatterplot(
    data=df, 
    x = 'age',
    y = 'charges',
    hue = 'smoker',
)
fig.set_title('Age vs Charges - hue = smoker')
plt.legend(bbox_to_anchor=(1.15, 1), title='smoker', fontsize=13)
plt.show()

This is pretty explanatory that non smokers would more likely have lesser Annual medical charges. Those who are non smokers having higher charges is probably because of accidents, or some genetic disorder. Which may or may not be the case though.

# BMI

In [None]:
df['bmi'].describe()

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.histplot(
    data = df, 
    x = 'bmi'
)
fig.set_title('Distribution of BMI')
plt.show()

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.histplot(
    data = df, 
    x = 'bmi',
    hue = 'smoker'
)
fig.set_title('Distribution of BMI - hue = Smoker')
plt.show()

# Charges

In [None]:
df['charges'].describe()

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.histplot(
    data = df, 
    x = 'charges',
)
fig.set_title('Distribution of Charges')
plt.show()

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.histplot(
    data = df, 
    x = 'charges',
    hue = 'smoker'
)
fig.set_title('Distribution of Charges - hue = smoker')
plt.show()

### Box plot of Smoker vs Charges

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.boxplot(
    data = df, 
    x = 'smoker',
    y = 'charges'
)
fig.set_title('Smoker vs Charges')
plt.show()

### Box plot of Children vs Charges

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.boxplot(
    data = df, 
    x = 'children',
    y = 'charges'
)
fig.set_title('Children vs Charges')
plt.show()

In [None]:
df.groupby('children').describe().transpose().loc['charges']

# Smoker

In [None]:
plt.figure(figsize=(15, 6), dpi=80)
fig = sns.countplot(
    data = df, 
    x = 'smoker',
    hue = 'sex'
)
plt.yticks(ticks=range(0, 650, 50))
fig.set_title('Smoker vs Count vs sex')
plt.show()

# HeatMap

In [None]:
plt.figure(figsize=(6, 6), dpi=80)
fig = sns.heatmap(
    data = df.corr(), 
    annot=True,
    cmap='viridis'
)
plt.show()

# Regression
## Preprocessing
### Converting categorical data to continuous numerical data

In [None]:
enc = OneHotEncoder().fit(df[['sex', 'smoker', 'region']])
enc.categories_

In [None]:
factors = enc.transform(df[['sex', 'smoker', 'region']]).toarray()
factors

In [None]:
df[['female', 'male', 'non-smoker', 'smoker', 'northeast', 'northwest', 'southeast', 'southwest']] = factors

In [None]:
df

In [None]:
df.drop(['region', 'sex'], inplace=True, axis=1)

In [None]:
df

### train test split

In [None]:
X = df.drop('charges', axis=1)
y = df['charges']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=101)

In [None]:
model_name = []
model_rmse = []

## LINEAR REGRESSION

In [None]:
model = LinearRegression().fit(X_train, y_train)

In [None]:
predicted = model.predict(X_test)

__Plot of the residuals__

In [None]:
sns.scatterplot(
    y = predicted - y_test,
    x = range(len(y_test))
)

In [None]:
fig, ax = plt.subplots(figsize=(6,8),dpi=100)
_ = sp.stats.probplot(predicted,plot=ax)

There seems to be a huge variation in the calculated value. Thus a simple linear regression is not good enough. Thus, it means that, we will have to consider a regression of higher degree and mightbe better even interaction terms.

In [None]:
pd.DataFrame({
    'terms': X.columns,
    'weights': model.coef_
})

In [None]:
print(f'Mean Absolute Error = {mean_absolute_error(y_test, predicted)}')
print(f'Mean Squared Error = {mean_squared_error(y_test, predicted)}')
print(f'Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test, predicted))}')
model_name.append('Linear Regression')
model_rmse.append(np.sqrt(mean_squared_error(y_test, predicted)))

## Choosing degree of Regression

In [None]:
train_RMSE = []
test_RMSE = []
degree = []

for d in range(1, 5):
    degree.append(d)
    
    poly_conv = PolynomialFeatures(degree=d, include_bias=False)
    features = poly_conv.fit_transform(X)
    
    X_train, X_test, y_train, y_test = train_test_split(features, y, test_size=0.30, random_state=42)
    
    model = LinearRegression().fit(X_train, y_train)
    
    # test data
    pred = model.predict(X_test)
    test_RMSE.append(np.sqrt(mean_squared_error(y_test, pred)))
    
    # train data
    pred = model.predict(X_train)
    train_RMSE.append(np.sqrt(mean_squared_error(y_train, pred)))

for i in range(len(degree)):
    print(f"Degree: {degree[i]}", end=' => ')
    print(f"trainRMSE: {train_RMSE[i]}", end=', ')
    print(f"testRMSE: {test_RMSE[i]}")

plt.plot(degree,train_RMSE,label='TRAIN')
plt.plot(degree,test_RMSE,label='TEST')
plt.xlabel("Polynomial Complexity")
plt.ylabel("RMSE")
plt.show()

There is definitely a case of overfitting towards greater degrees. So, the best fit would most likely be of degree 2

## Degree 2 Polynomial Regression

In [None]:
features = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, y, test_size=0.30, random_state=42)

In [None]:
model = LinearRegression().fit(X_train, y_train)

In [None]:
predicted = model.predict(X_test)

In [None]:
print(f'Mean Absolute Error = {mean_absolute_error(y_test, predicted)}')
print(f'Mean Squared Error = {mean_squared_error(y_test, predicted)}')
print(f'Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test, predicted))}')
model_name.append('Polynomial Regression')
model_rmse.append(np.sqrt(mean_squared_error(y_test, predicted)))

## Regularization
### Scaling the data

In [None]:
X_train[0]

In [None]:
scaler = StandardScaler().fit(X_train)

In [None]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
X_train[0]

## Ridge Regression - L2 Regularization

In [None]:
ridge_ = RidgeCV(
    alphas=(0.1, 1.0, 10.0),
    cv=None, # USES EFFICIENT LEAVE ONE OUT CV
    scoring='neg_mean_squared_error'    
)

In [None]:
ridge_.fit(X_train, y_train)

In [None]:
ridge_.alpha_

In [None]:
ridge_.coef_

In [None]:
predicted = ridge_.predict(X_test)

In [None]:
print(f'Mean Absolute Error = {mean_absolute_error(y_test, predicted)}')
print(f'Mean Squared Error = {mean_squared_error(y_test, predicted)}')
print(f'Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test, predicted))}')
model_name.append('Ridge Regression')
model_rmse.append(np.sqrt(mean_squared_error(y_test, predicted)))

## LASSO Regression - L1 Regularization

In [None]:
lasso_ = LassoCV(
    eps=0.1,
    n_alphas=100,
    cv=None
)

In [None]:
lasso_.fit(X_train, y_train)

In [None]:
predicted = lasso_.predict(X_test)

In [None]:
lasso_.coef_

In [None]:
print(f'Mean Absolute Error = {mean_absolute_error(y_test, predicted)}')
print(f'Mean Squared Error = {mean_squared_error(y_test, predicted)}')
print(f'Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test, predicted))}')
model_name.append('LASSO Regression')
model_rmse.append(np.sqrt(mean_squared_error(y_test, predicted)))

The RMSE Error Seems to have increased here. But, is impressive considering only 3 terms.

## Elastic Net

In [None]:
elastic_ = ElasticNetCV(
    l1_ratio=[.1, .5, .7, .9, .95, .99, 1], 
    eps=0.1,
    n_alphas=100, 
    cv=None
)

In [None]:
elastic_.fit(X_train, y_train)

In [None]:
elastic_.l1_ratio_

In [None]:
predicted = elastic_.predict(X_test)

In [None]:
elastic_.alpha_

In [None]:
lasso_.alpha_

In [None]:
ridge_.alpha_

In [None]:
print(f'Mean Absolute Error = {mean_absolute_error(y_test, predicted)}')
print(f'Mean Squared Error = {mean_squared_error(y_test, predicted)}')
print(f'Root Mean Squared Error = {np.sqrt(mean_squared_error(y_test, predicted))}')
model_name.append('Elastic Net')
model_rmse.append(np.sqrt(mean_squared_error(y_test, predicted)))

## Comparisons of All Models

In [None]:
model_name

In [None]:
model_rmse

In [None]:
model = {
    'model': model_name,
    'rmse': model_rmse
}


model = pd.DataFrame(model)
model

In [None]:
plt.figure(figsize=(10, 6))
fig = sns.barplot(
    x = model_name,
    y = model_rmse
)
fig.set_title('Performance of Different Regression Models')
fig.set_ylabel('RMSE')
fig.set_xlabel('Models')
plt.show()

The performance of Polynomial Regression and Ridge Regression are pretty much the same, although ridge regression gives a RMSE of 4519.750225 and Polynomial Regression gives 4520.906559.

__Do review the work :)__