**Forecasting insurance price for customers using Regression techniques**

Click here for data source - [Data source](https://www.kaggle.com/mirichoi0218/insurance)

The dataset is a collection of medical cost prices for 1338 instances. The objective is to predict the charges for customers based on certain information available about them. **Feature set is as follows: **

* age: age of primary beneficiary
* sex: insurance contractor gender, female, male
* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
* children: Number of children covered by health insurance / Number of dependents
* smoker: Smoking status (whether smokes or not)
* region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
* charges: Individual medical costs billed by health insurance

Predicting the charges will require application of regression algorithms such as Random Forest Regressor and Linear Regression, etc. Before diving in to generation of model, there are some steps necessary to render our data into model understandable and usable format. Also, to understand the type of data we are dealing with, studying its features and statistical analysis of data is required. 

**Some of the steps required are -**
* Data description - to view and understand how the data looks like, what features exist - their datatypes and values they hold.
* Target variable - The most important aspect of the data. Charges is our target (to predict) and we see how it is distrubuted in the data.
* Data cleaning and pre-processing - finding and handling missing values, checking for valid column names and valid entries for those column, converting data-types of columns in to model acceptable formats and dealing with categorical variables (by generating dummy variables or by updating exisiting features with binary values).
* Data visualization - To generate hidden insights from the data. For example, smokers are charged higher charges than non-smokers.
**Visualization is also required to figure out which features are responsible for changes in the target variable. This is called feature correlation.** 
* Prepare data, model generation and testing -
This is the part where Machine learning comes in to picture. Data is divided into training and testing sets. Models are produced by learning training data and finally, their performance is evaluated on testing/unseen data.
A good model is capable of accuractely predicting target for unseen instances. 
A poor model maybe a result of excessive parameter tuning (adjusting parameters to perform well precisely on training data), over-fitting (model learns training data too much and does not understand how to deal with new/unseen feature values) or due to structure of data itself (extremely noisy, messy, highly uncorrelated, unevenly distributed, etc.)

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Read the data
data = pd.read_csv("../input/insurance.csv")

# See how top 5 rows of the data look like.
data.head()

In [None]:
# How bottom 5 rows look like.
data.tail() 

In [None]:
# Generate statistical summary of the data's numerical features
data.describe()

**Information from above stats - **

Average age of customers is about 39 years with maximum age of 64 years and they have one child on an average with minimum of no child and maximum of 5 children.
75% of observations show 51 years of age and 2 children.
The charges for insurance on an average is 13270.42 units with 75% obseravtions close to 16639.91 units.

In [None]:
# View all column names and their respective data types
data.info()

In [None]:
# Check for missing values
print(data.isnull().sum())

#All zeros show that there is no missing value

In [None]:
#-------------------- DATA VISUALIZATION -------------------------
# Visualize distribution of values for target variable - 'charges'
plt.figure(figsize=(6,6))
plt.hist(data.charges, bins = 'auto', color = 'purple')
plt.xlabel("charges ->")
plt.title("Distribution of charges values :")

**What we know about target variable?**
* It is unevenly distrubuted.
* Most beneficiaries are charged between 1000 to 10,000 units.
* Very few are charged above 50,000.
* We already know from statistical data description above that mean is 13270.42 (close to lower limit of target range), which inclines the data towards the left of the distribution.

In [None]:
# Generate Box-plots to check for outliers and relation of each feature with 'charges'
cols = ['age', 'children', 'sex', 'smoker', 'region']
for col in cols:
    plt.figure(figsize=(8,8))
    sns.boxplot(x = data[col], y = data['charges'])

**Insights from boxplots generated above -**
* As **age** increases, insurance cost increases. The plots show an increasing trend (with several small ranges for charges for some ages) in charges starting from around 1000 for age 18-19 to about 10,000 or so for customers with age near 60 
    - This may be due to general medical assumption that younger people are more fit or possess robust immune system. 
     - Another reason could be the types of medical conditions covered by the insurance. If the insurance is designed to cover conditions likely to develop with growing age, charges will be higher for older age groups.
* **Customers with 2 children** are charged highest when compared to others. Those with 5 or more children are charged less - This may be due to dominance of group with 2 or 3 children in the entire population.
* Being **a male or female** have lesser impact on cost, even though range for males is larger than for females. That means, males are charged higher in several cases than maximum charges for females.
* The plot shows a clear distribution pattern of high charges for beneficiaries who are **smokers** and considerably low costs for **non-smokers**.
* **Region** does not show much correlation with charges, though, South-east region have larger range up to about 20,000 in its dsitribution of customer charges. - This could be due to medical costs being higher in the region, some pre-known environmental/physical hazards or because it is a well-developed area with higher costs of living. 



In [None]:
# Converting categorical features' string values to int
# Updating directly to binary because only two values exist
data.smoker = [1 if x == 'yes' else 0 for x in data.smoker]
data.sex = [1 if x == 'male' else 0 for x in data.sex]

# Use pandas because multiple values exist for these columns.
data.region = pd.get_dummies(data.region)
data.charges = pd.to_numeric(data.charges)
data.columns.values

In [None]:
# Create Correlation matrix for all features of data.
data.corr()

In [None]:
# Generate heatmap to visualize strong & weak correlations.
sns.heatmap(data.corr())

* Above heatmap shows that there is **highest correlation between Charges and whether customer is a smoker** and **lowest correlation between Region and Charges**

Since, there are only few features, it is feasible to generate pairplots for all of them. Otherwise, we would have only generated pairplots for features having high positive or negative correlation with the target variable.

In [None]:
# Generate pairplots for all features because there are only 7 in all.
sns.pairplot(data)

In [None]:
#------------------- Prepare data for predictive regression models ----------------------------
y = data.charges.values
X = data.drop(['charges'], axis = 1)   # Drop the target variable

In [None]:
# import scikit learn's built-in Machine learning libraries and functions
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import statsmodels.api as sm

# Split using 20% for testing and 80% for training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = None)

# ----------------- PREDICTIVE MODELLING (Call the models to be used) -----------------------
rf_reg = RandomForestRegressor(max_features = 'auto', bootstrap = True, random_state = None)
lin_reg = LinearRegression(normalize = True)
ada_reg = AdaBoostRegressor()

# R2-score is used here as a metric. Any other metric could be used instead by just importing 
# it from sklearn

# Predict using Random Forest Regressor.
rf_reg.fit(X_train, y_train)
predtrainRF = rf_reg.predict(X_train)     # Prediction for train data
predtestRF = rf_reg.predict(X_test)       # Prediction for test data

# Compute R-squared score for both train and test data.
print("R2-score on train data:", r2_score(y_train,predtrainRF))
print("R2-score on test data:", r2_score(y_test, predtestRF))

# Predict using Linear Regression
lin_reg.fit(X_train, y_train)
predtrainL = lin_reg.predict(X_train)
predtestL = lin_reg.predict(X_test)
print("R2-score on train data:",r2_score(y_train, predtrainL))
print("R2-score on test data:",r2_score(y_test, predtestL))

# Predict using XGBoost Regressor
ada_reg.fit(X_train, y_train)
predtrainAda = ada_reg.predict(X_train)
predtestAda = ada_reg.predict(X_test)
print("R2-score on train data:",r2_score(y_train, predtrainAda))
print("R2-score on test data:",r2_score(y_test, predtestAda))

# ----------------- Using Ordinary Least Square from Statsmodel --------------------------------
# -------- Allows to view full summary statistics along with p-value and F-statistics -----------
# On Train data.
X_newtrain = sm.add_constant(X_train)
ols_train = sm.OLS(y_train, X_newtrain)
ols_train_new = ols_train.fit()
print(ols_train_new.summary())

# On Test data.
X_newtest = sm.add_constant(X_test)
ols_test = sm.OLS(y_test, X_newtest)
ols_test_new = ols_test.fit()
print(ols_test_new.summary())   # Produce full statistical summary 

plt.show()

Please feel free to provide your comments/suggestions below and please do **upvote**, if you liked this work.*

Thank you and happy learning!