# Insurance Forecast by using Linear Regression - Ganesh Nagappa Shetty

## Introduction
To make profit, insurance companies should collect higher premium than the amount paid to the insured person. Due to this, insurance companies invests a lot of time, effort, and money in creating models that accurately predicts health care costs. In this kernel, I will try to build the most accurate model as possible but at the same time I would keep everything simple.

## 1. Reading and Understanding the Data

Let us first import necessary libraries, dataset and try to understand the data

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

#Import all important libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
#Load the dataset and check initial entries of the dataset
df=pd.read_csv('/kaggle/input/insurance/insurance.csv')
df.head()

In [None]:
#Shape of the dataset
df.shape

In [None]:
#Information Summary of the dataset
df.info()

In [None]:
#Checking for Null Values
df.isnull().sum()

**Observation:** There no missing values in the dataset. Lets check for outliers in the dataset

## 2. Cleaning the data

### 2.1 Checking for Outliers

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(8, 8))

fig.suptitle('Outlier Analysis')
sns.boxplot(ax=axes[0, 0], data=df['age'])
axes[0, 0].set_title('Age')
sns.boxplot(ax=axes[0, 1], data=df['bmi'])
axes[0, 1].set_title('BMI')
sns.boxplot(ax=axes[1, 0], data=df['children'])
axes[1, 0].set_title('Children')
sns.boxplot(ax=axes[1, 1], data=df['charges'])
axes[1, 1].set_title('Charges')

plt.show()

**Observation:** There are no outliers in the numerical variables of the dataset. The datapoints beyond 75th percentile in Charges and BMI are continuous in nature. This is quite clean dataset without outliers.

## 3. Exploratory Data Analysis

### 3.1 Bivariate Analysis

In [None]:
plt.figure(figsize=(10,4))
plt.title('Effect of Sex on Charges')
sns.boxplot(x='sex',y='charges',data=df)
plt.show()

In [None]:
plt.figure(figsize=(10,4))
plt.title('Effect of Age on Charges')
sns.scatterplot(x='age',y='charges',data=df)
plt.show()

**Observation:** Males are spending more than females for healthcare. Another obvious observation is healthcare expenditure is continuously increasing with age

### 3.2 Multivariate Analysis

In [None]:
plt.figure(figsize=(10,4))
plt.title('Effect of Smokers on Charges')
sns.boxplot(x='smoker',y='charges',data=df,hue='sex',palette='viridis')
plt.show()

**Observation:** 
- Smokers are spending more in hospital. 
- Majority of spendings by non-smokers between males and females are in the similar rages. Females spend fractionally higher
- However among smokers males endup spending more in hospital

In [None]:
plt.figure(figsize=(10,4))
plt.title('Effect of Regions and sex on Charges')
sns.boxplot(x='region',y='charges',data=df,hue='sex',palette='viridis')
plt.show()

In [None]:
plt.figure(figsize=(10,4))
plt.title('Effect of Regions and smokers on Charges')
sns.boxplot(x='region',y='charges',data=df,hue='smoker',palette='viridis')
plt.show()

**Observation:** 
- People in Southeast spend more on healthcare compared to other regions
- Irrespective of regions generally its males who are spending more in hospitals
- Again irrespective of regions its smokers who spend heavily in hospitals. Here as well Southeast region has upper hand over other regions

In [None]:
# Lets check the correlation of different variables
sns.pairplot(df)
plt.show()

In [None]:
#Heatmap of variables to check correlation between variables
corrMatt = df.corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(20,10)
sns.heatmap(corrMatt, mask=mask,cmap='viridis', square=True,annot=True)
plt.show()

**Observation:** We can observe some kind of linear relationship between `age` and `charges`

## 4. Preparation of dataset for Model Building

In [None]:
#First few lines of data
df.head()

### 4.1 Handling Categorical Variables

In [None]:
# Lets convert Sex and Smoker as binary categorical variables(Male: 1, Female: 0  & Smoker_yes: 1 , Smoker_no: 0)
df.sex=df.sex.apply(lambda x: 1 if x=='male' else 0)
df.smoker=df.smoker.apply(lambda x: 1 if x=='yes' else 0)
df.head()

In [None]:
# Lets convert region as dummy variables
region = pd.get_dummies(df['region'], drop_first = True,prefix='region')
df = pd.concat([df, region], axis = 1)

#Dropping season variable
df.drop('region',axis=1,inplace=True)
df.head()

### 4.2 Splitting the Data into Training and Testing Sets

In [None]:
# Lets split the data into Training and testing sets (70%-30% combination)
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state = 100)

### 4.3 Rescaling the Features

We see that charges, age and BMI variables are in larger scale compared to other. Lets scale Training data sets using minmax scaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['charges', 'age', 'bmi']

df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

df_train.head()

### 4.4 Dividing into X and Y sets for the model building

In [None]:
y_train = df_train.pop('charges')
X_train = df_train

## 5. Building the Model - Considering all variables

Lets build the model with all variables first and then compare the performance with the model with eliminated features

### 5.1 Fitting Linear regression model onto Train data

In [None]:
# Importing LinearRegression
from sklearn.linear_model import LinearRegression

# Fitting LinearRegression onto the train data
lm = LinearRegression()
lm.fit(X_train, y_train)

### 5.2 Making Predictions Using the  Model
#### 5.2.1 Applying the scaling on the test sets. Only transforming not fitting

In [None]:
num_vars = ['charges', 'age', 'bmi']

df_test[num_vars] = scaler.transform(df_test[num_vars])

#### 5.2.2 Dividing into X_test and y_test

In [None]:
y_test = df_test.pop('charges')
X_test = df_test

#### 5.2.3 Model Evaluation

In [None]:
# Making predictions using the third model
y_pred = lm.predict(X_test)

In [None]:
# Plotting y_test and y_pred to understand the spread

fig = plt.figure()
plt.scatter(y_test, y_pred)
# Plot heading 
fig.suptitle('y_test vs y_pred', fontsize = 20) 
# X-label
plt.xlabel('y_test', fontsize = 18) 
# y-label
plt.ylabel('y_pred', fontsize = 16)
plt.show()

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

**Observation:** We now have a model with r2_score of **77.7%** which is not bad. Lets now try to check the possibility of building mode efficient model(devoid of insignificant variables and multicolinearity)

## 6. Building the Model - Considering significant variables and avoiding Multicolinearity if any

### 6.1 Splitting the Data into Training and Testing Sets

In [None]:
# Lets split the data into Training and testing sets (70%-30% combination)
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state = 100)

### 6.2 Rescaling the Features

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['charges', 'age', 'bmi']

df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

df_train.head()

### 6.3  Dividing into X and Y sets for the model building

In [None]:
y_train = df_train.pop('charges')
X_train = df_train

### 6.4 Fitting Linear regression model onto Train data

Lets use stats model library for its great statistical output

In [None]:
#Lets have a backup of X_train data
X_train_bc=X_train.copy()

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_new = sm.add_constant(X_train_bc)

# Running the linear model
lm = sm.OLS(y_train,X_train_new).fit()

#Let's see the summary of our linear model
print(lm.summary())

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_bc
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

**Observation:** As we can see `sex` with p-value of 1.00 (much higher than 0.03) is highly insignificant. Lets drop this variable and rebuild the model again

In [None]:
#dropping sex variable
X_train_bc = X_train_bc.drop(['sex'], axis=1)

In [None]:
#Rebuilding second model

# Adding a constant variable 
import statsmodels.api as sm  
X_train_new = sm.add_constant(X_train_bc)

# Running the linear model
lm2 = sm.OLS(y_train,X_train_new).fit()

#Let's see the summary of our linear model
print(lm2.summary())

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_bc
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

**Observation:** We can see that after dropping `sex` we have no alarming multicolinearity in the model(all VIFs are less than 5). We can also see that Adj. R-squared has not changed. We still have `region_northwest` & `region_southeast` as insignificant (p-value more than 0.03). We shall drop `region_northwest` and rebuild the model

In [None]:
#dropping region_northwest variable
X_train_bc = X_train_bc.drop(['region_northwest'], axis=1)

In [None]:
#Rebuilding third model

# Adding a constant variable 
import statsmodels.api as sm  
X_train_new = sm.add_constant(X_train_bc)

# Running the linear model
lm3 = sm.OLS(y_train,X_train_new).fit()

#Let's see the summary of our linear model
print(lm3.summary())

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_bc
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

**Observation:** We can see that after dropping `region_northwest` we have no alarming multicolinearity in the model(all VIFs are less than 5). We can also see that Adj. R-squared has not changed. We still have `region_southeast` as insignificant (p-value more than 0.03). We shall drop `region_southeast` and rebuild the model

In [None]:
#dropping region_southeast variable
X_train_bc = X_train_bc.drop(['region_southeast'], axis=1)

In [None]:
#Rebuilding fourth model

# Adding a constant variable 
import statsmodels.api as sm  
X_train_new = sm.add_constant(X_train_bc)

# Running the linear model
lm4 = sm.OLS(y_train,X_train_new).fit()

#Let's see the summary of our linear model
print(lm4.summary())

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_bc
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

**Observation:** We can see that after dropping `region_southeast` we have no alarming multicolinearity in the model(all VIFs are less than 5). We can also see that Adj. R-squared has not changed. We still have `region_southwest` as insignificant (p-value more than 0.03). We shall drop `region_southwest` and rebuild the model

In [None]:
#dropping region_southwest variable
X_train_bc = X_train_bc.drop(['region_southwest'], axis=1)

In [None]:
#Rebuilding fifth model

# Adding a constant variable 
import statsmodels.api as sm  
X_train_new = sm.add_constant(X_train_bc)

# Running the linear model
lm5 = sm.OLS(y_train,X_train_new).fit()

#Let's see the summary of our linear model
print(lm5.summary())

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_bc
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

**Observation:** After leaving out `region_southwest` variable, we now have a model which is free of multicolinearity (all VIFs are less than 5) and all the remaining variables are significant (p-values are less than 0.03). We can also observe that Adj. R-squared has only fractionally come down(by 0.001). Now we have a efficient model. Lets use this as final model and predict the charges for test data.

### 6.5 Making Predictions Using the Final Model

#### 6.5.1 Applying the scaling on the test sets. Only transforming not fitting

In [None]:
num_vars = ['charges', 'age', 'bmi']

df_test[num_vars] = scaler.transform(df_test[num_vars])

df_test.describe()

#### 6.5.2 Dividing into X_test and y_test

In [None]:
y_test = df_test.pop('charges')
X_test = df_test

#### 6.5.3 Preparing the test dataset

In [None]:
# Adding constant variable to test dataframe
import statsmodels.api as sm 
X_test_m5 = sm.add_constant(X_test)

# Creating X_test_m5 dataframe by dropping variables from X_test_m5
X_test_m5 = X_test_m5.drop(['sex', 'region_northwest', 'region_southeast', 'region_southwest'], axis = 1)

# Making predictions using the fifth model
y_pred = lm5.predict(X_test_m5)

### 6.6 Model Evaluation

In [None]:
# Plotting y_test and y_pred to understand the spread

fig = plt.figure()
plt.scatter(y_test, y_pred)
# Plot heading 
fig.suptitle('y_test vs y_pred', fontsize = 20) 
# X-label
plt.xlabel('y_test', fontsize = 18) 
# y-label
plt.ylabel('y_pred', fontsize = 16)
plt.show()

In [None]:
#Model parameters
round(lm5.params,3)

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

**Conclusion:** The model now has fewer variables (4 insignificant variables are left out). We now have a R2_squared value of **78%**. 

This model has a R2_square value marginally better than previous model (With all variables). Now the model has no insignificant variables hence making it cost and time effective.

The final equation of the model is

$ charges = -0.043 + 0.191  \times  age + 0.165  \times  bmi + 0.007 \times children + 0.383 \times smoker $ 

**`Smoker`** turned out to be the most significant variable in deciding hospital charges