# Health Insurance Cost (Insurance Forecast by using Linear Regression)

## Inspiration

**Can you accurately predict insurance costs?**

[Data Source from Kaggle](https://www.kaggle.com/mirichoi0218/insurance)

- Aim of the problem is to find the health insurance cost incured by Individuals based on thier age, gender, BMI, number of children, smoking habit and geo-location.

- Features available are:

    - age: age of primary beneficiary

    - sex: insurance contractor gender, female, male 

    - bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9 

    - children: Number of children covered by health insurance / Number of dependents

    - smoker: Smoking (habits)

    - region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

    - charges: Individual medical costs billed by health insurance


[Dataset download from Github](https://github.com/stedy/Machine-Learning-with-R-datasets) 


## 1) IMPORT LIBRARIES AND DATASETS

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# read the csv file 
insurance_df = pd.read_csv('insurance.csv')

In [None]:
insurance_df.head()

In [None]:
insurance_df.tail()

In [None]:
insurance_df.shape

## 2) EXPLORATORY DATA ANALYSIS (EDA)

In [None]:
# check if there are any Null values
insurance_df.isnull().sum()

In [None]:
# Check the dataframe info
insurance_df.info()

In [None]:
# Grouping by region to check for any relationship between region and charges
# South East region has the highest charges and body mass index
df_region = insurance_df.groupby(by='region').mean()
df_region

In [None]:
# Check unique values in the 'sex' column
insurance_df['sex'].unique()

In [None]:
# convert categorical variable to numerical
insurance_df['sex'] = insurance_df['sex'].apply(lambda x: 0 if x == 'female' else 1)

In [None]:
insurance_df.head()

In [None]:
# Check the unique values in the 'smoker' column
insurance_df['smoker'].unique()

In [None]:
# Convert categorical variable to numerical 
insurance_df['smoker'] = insurance_df['smoker'].apply(lambda x: 0 if x == 'no' else 1)

In [None]:
insurance_df.head()

In [None]:
# Check unique values in 'region' column
insurance_df['region'].unique()

In [None]:
region_dummies = pd.get_dummies(insurance_df['region'], drop_first = True)

In [None]:
region_dummies.head()

In [None]:
insurance_df = pd.concat([insurance_df, region_dummies], axis = 1)

In [None]:
insurance_df.head()

In [None]:
# Let's drop the original 'region' column 
insurance_df.drop(['region'], axis = 1, inplace = True)

In [None]:
insurance_df.head()

In [None]:
insurance_df.describe()

## 3) VISUALIZATION

In [None]:
# Check Distributions
insurance_df[['age', 'sex', 'bmi', 'children', 'smoker', 'charges']].hist(bins = 30, figsize = (20,20), color = 'r');

In [None]:
# plot pairplot
sns.pairplot(insurance_df)

In [None]:
# Regression Plot (No Machine Learning)
sns.regplot(x = 'age', y = 'charges', data = insurance_df)
plt.show()

In [None]:
# Regression Plot (No Machine Learning)
sns.regplot(x = 'bmi', y = 'charges', data = insurance_df)
plt.show()

In [None]:
# Check Correlation
corr = insurance_df.corr()

In [None]:
# Heatmap for Correlation
plt.figure(figsize=(10,10))
sns.heatmap(corr,annot=True)
plt.show()

## 4) CREATE TRAINING AND TESTING DATASET

In [None]:
# Print Columns Names
insurance_df.columns

In [None]:
X = insurance_df.drop(columns =['charges'])
y = insurance_df['charges']

In [None]:
# Check X
X.head()

In [None]:
# Check y
y.head()

In [None]:
# Check Shape
X.shape

In [None]:
# Check Shape
y.shape

In [None]:
# Casting to NP Arrays
X = np.array(X)
y = np.array(y)

In [None]:
# Reshaping of y
y = y.reshape(-1,1)

In [None]:
y.shape

In [None]:
#Scaling the data numerical data before feeding the model
from sklearn.preprocessing import MinMaxScaler

scaler_x = MinMaxScaler()
X = scaler_x.fit_transform(X)

scaler_y = MinMaxScaler()
y = scaler_y.fit_transform(y)

In [None]:
X

In [None]:
y

In [None]:
# Split the data into 20% Testing and 80% Training
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=72)

In [None]:
# Shape Checking
X_train.shape

In [None]:
# Shape Checking
X_test.shape

In [None]:
1070+268

## 5) TRAIN AND TEST A LINEAR REGRESSION MODEL IN SK-LEARN

In [None]:
# Using Linear Regression Model
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
# Get the predictions
y_predict = regressor.predict(X_test)

In [None]:
y_predict.shape

In [None]:
# Get the Values "before" scaling
y_predict_orig = scaler_y.inverse_transform(y_predict)
y_test_orig = scaler_y.inverse_transform(y_test)

In [None]:
# Number of Features and Cases
k = X_test.shape[1] # Number of Features
n = len(X_test) # Number of Cases
print("Features:",k)
print("Cases:",n)

In [None]:
# Metrics Calculation

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

RMSE = float(format(np.sqrt(mean_squared_error(y_test_orig, y_predict_orig)),'.3f'))
MSE = mean_squared_error(y_test_orig, y_predict_orig)
MAE=mean_absolute_error(y_test_orig, y_predict_orig)
r2=r2_score(y_test_orig, y_predict_orig)
adj_r2 = 1 - (1 - r2) * (n -1) / (n - k -1)

In [None]:
# Evaluation Results Printing
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 

In [None]:
# Columns Check
insurance_df.columns

In [None]:
# Check the Weights for the various Features
list(zip(['age', 'sex', 'bmi', 'children', 'smoker', 'northwest',
       'southeast', 'southwest'], regressor.coef_[0])) 

## 6) Only Most Significative Features

In [None]:
X_3f = insurance_df[['smoker','bmi','age']].values
y_3f = insurance_df['charges'].values

In [None]:
# Casting to NP Arrays
X_3f = np.array(X_3f)
y_3f = np.array(y_3f)

In [None]:
# Reshaping of y
y_3f = y_3f.reshape(-1,1)

In [None]:
#Scaling the data numerical data before feeding the model
#from sklearn.preprocessing import MinMaxScaler

scaler_x3f = MinMaxScaler()
X_3f = scaler_x3f.fit_transform(X_3f)

scaler_y3f = MinMaxScaler()
y_3f = scaler_y3f.fit_transform(y_3f)

In [None]:
# Split the data into 20% Testing and 80% Training
#from sklearn.model_selection import train_test_split

X3f_train,X3f_test,y3f_train,y3f_test = train_test_split(X_3f,y_3f,test_size=0.20,random_state=72)

In [None]:
# Using Linear Regression Model
#from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X3f_train, y3f_train)

In [None]:
# Get the predictions
y3f_predict = regressor.predict(X3f_test)

In [None]:
# Get the Values "before" scaling
y3f_predict_orig = scaler_y3f.inverse_transform(y3f_predict)
y3f_test_orig = scaler_y3f.inverse_transform(y3f_test)

In [None]:
# Number of Features and Cases
k = X3f_test.shape[1] # Number of Features
n = len(X3f_test) # Number of Cases
print("Features:",k)
print("Cases:",n)

In [None]:
# Metrics Calculation
#from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

RMSE = float(format(np.sqrt(mean_squared_error(y3f_test_orig, y3f_predict_orig)),'.3f'))
MSE = mean_squared_error(y3f_test_orig, y3f_predict_orig)
MAE=mean_absolute_error(y3f_test_orig, y3f_predict_orig)
r2=r2_score(y3f_test_orig, y3f_predict_orig)
adj_r2 = 1 - (1 - r2) * (n -1) / (n - k -1)

In [None]:
# Evaluation Results Printing
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 

## 7) ARTIFICIAL NEURAL NETWORK FOR REGRESSION

In [None]:
#!pip install tensorflow

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import Adam

In [None]:
# Deep Neural Network
ANN_model = keras.Sequential()
ANN_model.add(Dense(32, input_dim = 8, activation = 'relu'))
ANN_model.add(Dense(64, activation = 'relu'))
ANN_model.add(Dropout(0.25))
ANN_model.add(Dense(64, activation = 'relu'))
ANN_model.add(Dropout(0.25))
ANN_model.add(Dense(32, activation = 'linear')) # Continuous Activation for Regression Problems
ANN_model.add(Dense(1)) # Output

ANN_model.summary() # Print the Model Summary

In [None]:
ANN_model.compile(optimizer='adam', loss='mean_squared_error')

epochs_hist = ANN_model.fit(X_train, y_train, epochs = 100, batch_size = 8, validation_split = 0.2)

In [None]:
# All information about the training
epochs_hist.history.keys()

In [None]:
plt.plot(epochs_hist.history['loss'])
plt.plot(epochs_hist.history['val_loss'])
plt.title('Model Loss Progress During Training')
plt.xlabel('Epoch')
plt.ylabel('Training and Validation Loss')
plt.legend(['Training Loss', 'Validation Loss'])

In [None]:
y_predict = ANN_model.predict(X_test)
plt.plot(y_test, y_predict, "^", color = 'r')
plt.xlabel('True Values')
plt.ylabel('Model Predictions')

In [None]:
y_predict_orig = scaler_y.inverse_transform(y_predict)
y_test_orig = scaler_y.inverse_transform(y_test)

In [None]:
plt.plot(y_test_orig, y_predict_orig, "^", color = 'r')
plt.xlabel('True Values')
plt.ylabel('Model Predictions')

In [None]:
k = X_test.shape[1]
n = len(X_test)
n

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test_orig, y_predict_orig)),'.3f'))
MSE = mean_squared_error(y_test_orig, y_predict_orig)
MAE = mean_absolute_error(y_test_orig, y_predict_orig)
r2 = r2_score(y_test_orig, y_predict_orig)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 