### Personal Insuarance Prediction
Insurance premium and assured ammount largely depends on lifestyle and existing health condion of an individual. The data contains following information about people and their related insurance charges.

Columns

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df=pd.read_csv('../input/insurance/insurance.csv')
df.head()

#### Understanding Data

In [None]:
#Data dimensions
r,c=df.shape
print(f"The data has {r} rows and {c} columns.")

In [None]:
#Checking data types
cat_cols=df.select_dtypes(exclude=np.number).columns
num_cols=df.select_dtypes(include=np.number).columns
print(f"There are {len(cat_cols)} categorical columns in data.\nThey are:\n {cat_cols}\n")
print(f"There are {len(num_cols)} numerical columns in data.\nThey are:\n {num_cols}")

In [None]:
df.info()

In [None]:
#Descriptive Statistics

In [None]:
df.describe()

Observations:

1. Average age of individuals is about 39 years with a standard deviation of 14 years. Age range being considered is 18-64 years.
2. Average BMI is nearly 30. 
3. Most individuals have one child.
4. Distribution of data for Age, BMI & Child is near normal. For charges it is right skewed.
5. Upto 50% individuals are charged around 9382.9033

In [None]:
df.describe(exclude=np.number)

Observations: 

1. The data is balanced with respect to gender. Number of males is slightly more. 
2. Most individuals are non smokers.
3. Out of the 4 unique regions covered in the data most belong to the SouthEast region. 

In [None]:
df2=pd.get_dummies(df,drop_first=True)

In [None]:
sns.pairplot(df2)

Observations: 

1. Charges Vs Age: We can observe that individuals can be linearly separated based on age and charges. 
2. There seems to be some linearity in relationship of charges with BMI

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(df2.corr(), annot=True)

Observations: 

1. There is some correlation between charges and age.
2. Smoker status yes has high correlation with Insurance charges.

### Data Preprocessing

In [None]:
#Scaling Data
X=df2.drop('charges',axis=1)
y=df2['charges']
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
inp_sc=sc.fit_transform(X)
inp_sc=pd.DataFrame(inp_sc,columns=X.columns)

In [None]:
inp_sc.head()

### Model Building

#### Linear Regression: OLS based model

In [None]:
import statsmodels.api as sm
c=sm.add_constant(inp_sc)
ols=sm.OLS(y,X)
mod=ols.fit()
mod.summary()

##### Checking Assumptions

In [None]:
#Multicollinearity: 
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif=pd.DataFrame()
vif['VIF']=[variance_inflation_factor(inp_sc.values,i) for i in range(inp_sc.shape[1])]
vif['Feature']=inp_sc.columns
vif.sort_values('VIF',ascending=False)

Observation: 
    No problem of multicollinearity

In [None]:
#Linearity:
for i in inp_sc.columns:
    sns.scatterplot(inp_sc[i],y)
    plt.xlabel(f"{i}")
    plt.xlabel(f"Charges")
    plt.title(f"{i} Vs Charges")
    plt.show()

Observations:

1. Linear relationship between age and charges
2. Some linearity for BMI vs Charges. 
3. Observable difference in charges for smokers and non smokers. 

In [None]:
#Normality: 
sns.distplot(mod.resid)

Observation: 
Near normal distribution of residues.

In [None]:
#Autocorrelation
plt.figure(figsize=(10,5))
sns.heatmap(df2.corr(), annot=True)

Observation: 

No/low observable autocorrelation among input features

In [None]:
#Homoscadasticity: ypred vs error
sns.residplot(mod.predict(),mod.resid)


In [None]:
from statsmodels.stats.api import het_goldfeldquandt
het_goldfeldquandt(mod.resid,mod.model.exog)

Observation: 

Since the p values > 0.05,  We accept the null hypothesis 
ie. The variance among the residues and predicted values are same.
Therefore the model satisfies the condition of homoscadasticity.

#### Linear Regression: Sklearn based model

In [None]:
#Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inp_sc, y, test_size=0.3, random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(X_train,y_train)
y_pred_train=lr.predict(X_train)
y_pred_test=lr.predict(X_test)

In [None]:
from sklearn.metrics import r2_score,mean_absolute_error, mean_squared_error
print("Training stage characteristics\n")
print("Accuracy of model: ", r2_score(y_train,y_pred_train))
print("Mean Absolute Error of model: ", mean_absolute_error(y_train,y_pred_train))
print("Mean Squared Error of model: ", mean_squared_error(y_train,y_pred_train))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_train,y_pred_train)))
print("\n\nTesting stage characteristics\n")
print("Accuracy of model: ", r2_score(y_test,y_pred_test))
print("Mean Absolute Error of model: ", mean_absolute_error(y_test,y_pred_test))
print("Mean Squared Error of model: ", mean_squared_error(y_test,y_pred_test))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_test,y_pred_test)))

##### Note: Ignore R2 Scores for non-linear models. 

In [None]:
# Based on relationship between age and charges, trying KNN Regressor

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn=KNeighborsRegressor(n_neighbors=100,weights='distance')

knn.fit(X_train,y_train)
y_pred_train=knn.predict(X_train)
y_pred_test=knn.predict(X_test)

In [None]:
from sklearn.metrics import r2_score,mean_absolute_error, mean_squared_error
print("Training stage characteristics\n")
print("Accuracy of model: ", r2_score(y_train,y_pred_train))
print("Mean Absolute Error of model: ", mean_absolute_error(y_train,y_pred_train))
print("Mean Squared Error of model: ", mean_squared_error(y_train,y_pred_train))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_train,y_pred_train)))
print("\n\nTesting stage characteristics\n")
print("Accuracy of model: ", r2_score(y_test,y_pred_test))
print("Mean Absolute Error of model: ", mean_absolute_error(y_test,y_pred_test))
print("Mean Squared Error of model: ", mean_squared_error(y_test,y_pred_test))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_test,y_pred_test)))

In [None]:
# Trying non linear ensemble model to see if there is any improvement in performance


In [None]:
from sklearn.ensemble import RandomForestRegressor
rfc=RandomForestRegressor()
rfc.fit(X_train,y_train)
y_pred_train=rfc.predict(X_train)
y_pred_test=rfc.predict(X_test)

In [None]:
from sklearn.metrics import r2_score,mean_absolute_error, mean_squared_error
print("Training stage characteristics\n")
print("Accuracy of model: ", r2_score(y_train,y_pred_train))
print("Mean Absolute Error of model: ", mean_absolute_error(y_train,y_pred_train))
print("Mean Squared Error of model: ", mean_squared_error(y_train,y_pred_train))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_train,y_pred_train)))
print("\n\nTesting stage characteristics\n")
print("Accuracy of model: ", r2_score(y_test,y_pred_test))
print("Mean Absolute Error of model: ", mean_absolute_error(y_test,y_pred_test))
print("Mean Squared Error of model: ", mean_squared_error(y_test,y_pred_test))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_test,y_pred_test)))

Observation: 

While the prediction accuracy of the model has increased the resultant model is overfitting in nature. 
Next we will try to improve for this model. 

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
rfc=RandomForestRegressor()
grid={'criterion':['mse', 'mae']}
gc=GridSearchCV(rfc,param_grid=grid, cv=10, scoring='neg_mean_squared_error')
gc.fit(inp_sc,y)

In [None]:
gc.best_params_

In [None]:
y_pred_train=gc.predict(X_train)
y_pred_test=gc.predict(X_test)
from sklearn.metrics import r2_score,mean_absolute_error, mean_squared_error
print("Training stage characteristics\n")
print("Accuracy of model: ", r2_score(y_train,y_pred_train))
print("Mean Absolute Error of model: ", mean_absolute_error(y_train,y_pred_train))
print("Mean Squared Error of model: ", mean_squared_error(y_train,y_pred_train))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_train,y_pred_train)))
print("\n\nTesting stage characteristics\n")
print("Accuracy of model: ", r2_score(y_test,y_pred_test))
print("Mean Absolute Error of model: ", mean_absolute_error(y_test,y_pred_test))
print("Mean Squared Error of model: ", mean_squared_error(y_test,y_pred_test))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_test,y_pred_test)))

In [None]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
ab=AdaBoostRegressor()
grid={'learning_rate':[0.001,0.01,0.1,1, 2,5,10,30], 'random_state': [20]}
gc=GridSearchCV(ab,param_grid=grid,cv=10,scoring='neg_mean_squared_error')
gc.fit(inp_sc,y)

In [None]:
y_pred_train=gc.predict(X_train)
y_pred_test=gc.predict(X_test)
from sklearn.metrics import r2_score,mean_absolute_error, mean_squared_error
print("Training stage characteristics\n")
print("Accuracy of model: ", r2_score(y_train,y_pred_train))
print("Mean Absolute Error of model: ", mean_absolute_error(y_train,y_pred_train))
print("Mean Squared Error of model: ", mean_squared_error(y_train,y_pred_train))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_train,y_pred_train)))
print("\n\nTesting stage characteristics\n")
print("Accuracy of model: ", r2_score(y_test,y_pred_test))
print("Mean Absolute Error of model: ", mean_absolute_error(y_test,y_pred_test))
print("Mean Squared Error of model: ", mean_squared_error(y_test,y_pred_test))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_test,y_pred_test)))

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
gb=GradientBoostingRegressor()
grid={'learning_rate':[0.001,0.01,0.1,1, 2,5,10,30], 'random_state': [20]}
gc=GridSearchCV(gb,param_grid=grid,cv=10,scoring='neg_mean_squared_error')
gc.fit(inp_sc,y)

In [None]:
y_pred_train=gc.predict(X_train)
y_pred_test=gc.predict(X_test)
from sklearn.metrics import r2_score,mean_absolute_error, mean_squared_error
print("Training stage characteristics\n")
print("Accuracy of model: ", r2_score(y_train,y_pred_train))
print("Mean Absolute Error of model: ", mean_absolute_error(y_train,y_pred_train))
print("Mean Squared Error of model: ", mean_squared_error(y_train,y_pred_train))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_train,y_pred_train)))
print("\n\nTesting stage characteristics\n")
print("Accuracy of model: ", r2_score(y_test,y_pred_test))
print("Mean Absolute Error of model: ", mean_absolute_error(y_test,y_pred_test))
print("Mean Squared Error of model: ", mean_squared_error(y_test,y_pred_test))
print("Root Mean Squared Error of model: ", np.sqrt(mean_squared_error(y_test,y_pred_test)))

In [None]:
#Trying all models at once
models=[]
models.append(('Linear Regression',lr))
models.append(('KNN',knn))
models.append(('RandomForest',rfc))
models.append(('AdaBoost',ab))
models.append(('GradientBoost',gb))

In [None]:
model_name=[]
scores=[]
for name,model in models: 
    kfold=KFold(n_splits=10, shuffle=True, random_state=20)
    score=cross_val_score(model,inp_sc,y, cv=kfold, scoring='neg_mean_squared_error')
    model_name.append(name)
    scores.append(score)
    print(f"{name}, bias: {np.mean(1-score)}, Variance error: {np.var(score,ddof=1)}")
         

In [None]:
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(scores)
ax.set_xticklabels(model_name)
plt.show()

Interpretation: 

1. Non-linear models perform better at predicting the insurance charges. 
2. Gradient boosting Algorithm give the lowest bias error at expense of slightly more variance error as compared with other ensemble models. 
3. Random Forest Regressor give the best balance between the bias and variance error. So we will proceed with these algorithms to check if the combinationn of ensemble model gives a better result. 


In [None]:
from sklearn.ensemble import VotingRegressor
va=VotingRegressor(estimators=[('RandomForest',rfc),('AdaBoost',ab),('GradientBoost',gb)])
models=[]
models.append(('Linear Regression',lr))
models.append(('KNN',knn))
models.append(('RandomForest',rfc))
models.append(('AdaBoost',ab))
models.append(('GradientBoost',gb))
models.append(('VotingAlgo',va))


In [None]:
model_name=[]
scores=[]
for name,model in models: 
    kfold=KFold(n_splits=10, shuffle=True, random_state=20)
    score=cross_val_score(model,inp_sc,y, cv=kfold, scoring='neg_mean_squared_error')
    model_name.append(name)
    scores.append(score)
    print(f"{name}, bias: {np.mean(1-score)}, Variance error: {np.var(score,ddof=1)}")
         

In [None]:
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(scores)
ax.set_xticklabels(model_name)
plt.xticks(rotation=30)
plt.show()

Observation: 

As expected, the Voting Regressor provides a model with best features from the combination of ensemble model. This model gives comparable performanhce with better bias and variance error balance. We can proceed with this model for deployment.

### Future scope: 
Can create an app for such prediction