# Medical Cost Predictions with all Regression Model

#### Content

- age: age of primary beneficiary

- sex: insurance contractor gender, female, male

- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

- children: Number of children covered by health insurance / Number of dependents

- smoker: Smoking

- region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

- charges: Individual medical costs billed by health insurance


## Outline

- Data Observation


- Exploratory Data Analysis
    - Univariate Analysis
    - Multivariate Analysis


- Data Preprocessing
    - Encoding
    - Preprocessing
    
    
- Modeling and Evaluation Metrics
    - Linear Regression
    - XGBoost Regression
    - LGBMRegressor
    - RandomForestRegressor
    - GradientBoostingRegressor


- Summary
    - Actual vs Predicted

# Data Observation

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
sns.set_style('darkgrid')
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv('../input/insurance/insurance.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
numerical_df = df.select_dtypes(exclude = 'object')
categorical_df = df.select_dtypes(include = 'object')

# Exploratory Data Analysis

In [None]:
xdf = df.copy()

## Missing Values

In [None]:
xdf.isnull().sum()

No missing values.

## Univariate Analysis

In [None]:
# Lets start with target attribute

xdf['charges'].describe()

In [None]:
# Lets check its distribution

sns.displot(x = 'charges', data = xdf, aspect = 2, height = 6, kde = True);

As we can see, it's not normally distributed. As well as it contains some amount of outliers.

In [None]:
# Boxplot of target attribute

plt.figure(figsize = (10,6))
sns.boxplot(x = 'charges', data = xdf);

In [None]:
# Function to remove outliers 

def remove_outliers(dataset, columns):
    q1 = dataset[columns].quantile(0.15)
    q3 = dataset[columns].quantile(0.65)
    
    iqr = q3 - q1
    lower = q1 - (1.5 * iqr)
    upper = q3 + (1.5 * iqr)
    
    dataset = dataset[(dataset[columns] > lower) & (dataset[columns] < upper)]
    
    return dataset

In [None]:
# Function to remove 'positive skewness'

def log_transformation(dataset):
    return np.log1p(dataset)

We have less number of dataset, in this case we don't want to loose our data, so we will fix the oultiers with log transformation.

In [None]:
# Transforming target attribute with Log.

xdf['charges'] = log_transformation(xdf['charges'])

In [None]:
# After fixing the target attribute

sns.displot(x = 'charges', data = xdf, aspect = 2, height = 6, kde = True);

## Numerical Data Analysis (Uni & Mutivariate)

In [None]:
numerical_df.columns

### Age

In [None]:
xdf['age'].describe()

In [None]:
# Lets see the distributioin

sns.displot(x = 'age', data = xdf, aspect = 2, height = 6, kde = True);

In [None]:
# Visualization for all ages

plt.figure(figsize = (15,6))
sns.violinplot(x = 'age', y = 'charges', data = xdf);

Let's group all the ages, and have a insight.

In [None]:
bins = [17,25,35,45,55,65]
labels = ['0-18','18-25','25-35','35-35','55-65']

xdf['age_range'] = pd.cut(xdf['age'], bins = bins, labels = labels)

In [None]:
# Charges of age_range

plt.figure(figsize = (15,6))
sns.violinplot(x = 'age_range', y = 'charges', data = xdf);

In [None]:
# smoker vs non-smoker

plt.figure(figsize = (15,6))
sns.violinplot(x = 'age_range', y = 'charges', hue = 'smoker',data = xdf);

### bmi

In [None]:
xdf['bmi'].describe()

In [None]:
# Plotting the distriubtion

sns.displot(x = 'bmi', data = xdf, aspect = 2, height = 6, kde = True);

In [None]:
# Boxplot visualization

plt.figure(figsize = (10,6))
sns.boxplot(x = 'bmi', data = xdf);

In [None]:
# Scatterplot 

plt.figure(figsize = (10,6))
sns.scatterplot(x = 'bmi', y = 'charges', data = xdf);

As we can see some outliers present in the data. We are following the same method, instead of removing <b> outliers </b> we will squeeze them with <b> log </b>

In [None]:
# Let's fix the outliers with 'log_transformation'

xdf['bmi'] = log_transformation(xdf['bmi'])

In [None]:
# After removing 'Outliers'

plt.figure(figsize = (10,6))
sns.scatterplot(x = 'bmi', y = 'charges', data = xdf);

### children

In [None]:
xdf['children'].describe()

This is indeed a categorical data in numeric order.

In [None]:
# Countplot

plt.figure(figsize = (8,6))
sns.countplot(data = df, x = 'children');

In [None]:
# Boxplot Visualization

plt.figure(figsize = (8,6))
sns.boxplot(data = xdf, x = 'children', y = 'charges');

In [None]:
#  Children and the 'charges'

plt.figure(figsize = (8,6))
sns.violinplot(data = xdf, x = 'children', y = 'charges');

In [None]:
# The charge of Non-smoking vs Smoking with Children 

plt.figure(figsize = (8,6))
sns.violinplot(data = xdf, x = 'children', y = 'charges', hue = 'smoker');

<b> Observations </b>:
- Person with 0 children,(Non-smoker) expenses ranges from (6 - 10.8 (Max)). (Smoker) starts from 9.2 to 11.5 (Max)
- Person with 1 children, (Non-smoker) charge ranges from (7 - 10.9 (Max)) (Non-smoker) starts from 9.3 to 11.4 (Max) and so on.

<b> (In short) </b>:
- The more the children, the more likely their minimal charge is higher.
- But don't get confused with 0 children, its minimal charge is lower but based on the type of health condition the cost is higher (upto 11.5)


## Categorical Data Analysis

In [None]:
categorical_df.columns

### Sex

In [None]:
# what's the count?

plt.figure(figsize = (8,6))
sns.countplot(x = 'sex', data = xdf);

Both, `Female` and `Male` has equal number of distributions.

In [None]:
# in term of charges

plt.figure(figsize = (8,6))
sns.violinplot(x = 'sex', y = 'charges', data = xdf);

In [None]:
# Female and Male 'Smoker' vs Female and Male 'Non-Smoker'

plt.figure(figsize = (8,6))
sns.violinplot(x = 'sex', y = 'charges', hue = 'smoker', data = xdf);

`Female` minimal charges is a bit expensive than a `male`. Also, this dataset consists of more number of <b> Female</b> records.

### Smoker

In [None]:
# countplot

plt.figure(figsize = (8,6))
sns.countplot(x = 'smoker', data = xdf);

In [None]:
# in term of charges

plt.figure(figsize = (8,6))
sns.violinplot(x = 'smoker', y = 'charges', data = xdf);

Watchout, smokers!

### region

In [None]:
# countplot

plt.figure(figsize = (8,6))
sns.countplot(x = 'region', data = xdf);

Residental area in the US

In [None]:
# Does it makes any impact on 'charges'

plt.figure(figsize = (8,6))
sns.violinplot(y = 'charges', x = 'region', data = xdf);

In [None]:
# Which region consist more number of smokers?

plt.figure(figsize = (8,6))
sns.violinplot(y = 'charges', x = 'region', hue = 'smoker',data = xdf);

<b> nortwest </b> and <b> northeast </b> minimal charge is higher than <b> southwest </b> and <b> southeast</b>. Also, the <b> northwest </b> and <b> northeast </b> consists more number of smokers.

Let's take a quick look on each features, and their correlation with <b> target attribute. </b> Since we are visualizing each feature, so we are making sure, we are not messing with the original dataset.

In [None]:
cdf = xdf.copy()

In [None]:
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()

In [None]:
cdf['sex'] = label_encoder.fit_transform(cdf['sex'])
cdf['region'] = label_encoder.fit_transform(cdf['region'])
cdf['smoker'] = label_encoder.fit_transform(cdf['smoker'])

In [None]:
corr = cdf.corr()
f,ax = plt.subplots(figsize = (10,10))
sns.heatmap(corr, vmax = .8, annot = True);

# Data Preprocessing

Dropping <b> region </b> as it got negative correlation. As well as <b> age_range </b>, it is derived from age.

In [None]:
X = xdf.drop(['charges','region','age_range'], axis = 1)
y = xdf['charges']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

In [None]:
X_train

### Preprocessing and Encoding

In [None]:
ct = ColumnTransformer([
                ('scaler', StandardScaler(),['age','bmi','children']),
                ('one-hot-encoder', OneHotEncoder(sparse = False),['sex','smoker'])
], remainder = 'drop')

# Modelling and Evaluation Metrics

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import math

In [None]:
models = pd.DataFrame(columns = ['R2_score','MSE','RMSE','MAE'])

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()
lr_pipe = Pipeline([
            ('column-transformer', ct),
            ('LinearRegression', lr)
])


In [None]:
lr_pipe.fit(X_train, y_train)

In [None]:
lr_pred = lr_pipe.predict(X_test)

In [None]:
print("R2 Score:", r2_score(y_test, lr_pred))
print("Mean Squarred Error: %.3f" % mean_squared_error(y_test, lr_pred))
print("RMSE:", math.sqrt(mean_squared_error(y_test, lr_pred)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,lr_pred)))

In [None]:
r2 = r2_score(y_test, lr_pred)
mse = mean_squared_error(y_test, lr_pred)
rmse = math.sqrt(mean_squared_error(y_test, lr_pred))
mae = mean_absolute_error(y_test,lr_pred)

In [None]:
models.loc["LinearRegression"] = [r2, mse, rmse, mae]

In [None]:
models

## XGBoost Regressor

In [None]:
from xgboost import XGBRegressor

In [None]:
xgb = XGBRegressor(n_estimators = 100, learning_rate = 0.05)
xgb_pipe = Pipeline([
            ('column-transformer', ct),
            ('XGBRegression', xgb)
])

In [None]:
xgb_pipe.fit(X_train, y_train)

In [None]:
xgb_pred = xgb_pipe.predict(X_test)

In [None]:
print("R2 Score:", r2_score(y_test, xgb_pred))
print("Mean Squarred Error: %.3f " % mean_squared_error(y_test, xgb_pred))
print("RMSE:", math.sqrt(mean_squared_error(y_test, xgb_pred)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,xgb_pred)))

In [None]:
r2 = r2_score(y_test, xgb_pred)
mae = mean_squared_error(y_test, xgb_pred)
rmse = math.sqrt(mean_squared_error(y_test, xgb_pred))
mae = mean_absolute_error(y_test,xgb_pred)

In [None]:
models.loc["XGBoostRegressor"] = [r2, mse, rmse, mae]

In [None]:
models

## LGBMRegressor

In [None]:
from lightgbm import LGBMRegressor

In [None]:
lgbm = LGBMRegressor()

In [None]:
lgbm_pipe = Pipeline([
            ('column-transformer', ct),
            ('LGBMRegressor', lgbm)
])

In [None]:
lgbm_pipe.fit(X_train, y_train)

In [None]:
lgbm_pred = lgbm_pipe.predict(X_test)

In [None]:
print("R2 Score:", r2_score(y_test, lgbm_pred))
print("Mean Squarred Error:", mean_squared_error(y_test, lgbm_pred))
print("RMSE:", math.sqrt(mean_squared_error(y_test, lgbm_pred)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,lgbm_pred)))

In [None]:
r2 = r2_score(y_test, lgbm_pred)
mae = mean_squared_error(y_test, lgbm_pred)
rmse = math.sqrt(mean_squared_error(y_test, lgbm_pred))
mae = mean_absolute_error(y_test,lgbm_pred)

In [None]:
models.loc["LGBMRegressor"] = [r2, mse, rmse, mae]

## RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(n_estimators = 50, max_depth = 15, random_state = 42, min_samples_leaf = 10)

In [None]:
rf_pipe = Pipeline([
            ('column-transformer', ct),
            ('RandomForestRegressor', rf)
])

In [None]:
rf_pipe.fit(X_train, y_train)

In [None]:
rf_pred = rf_pipe.predict(X_test)

In [None]:
print("R2 Score:", r2_score(y_test, rf_pred))
print("Mean Squarred Error:", mean_squared_error(y_test, rf_pred))
print("RMSE:", math.sqrt(mean_squared_error(y_test, rf_pred)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,rf_pred)))

In [None]:
r2 = r2_score(y_test, rf_pred)
mae = mean_squared_error(y_test, rf_pred)
rmse = math.sqrt(mean_squared_error(y_test, rf_pred))
mae = mean_absolute_error(y_test,rf_pred)

In [None]:
models.loc["RandomForestRegressor"] = [r2, mse, rmse, mae]

## GradientBoostingRegressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
gbr = GradientBoostingRegressor()

In [None]:
gbr_pipe = Pipeline([
            ('column-transformer', ct),
            ('GradientBoostingRegressor', gbr)
])

In [None]:
gbr_pipe.fit(X_train, y_train)

In [None]:
gbr_pred = gbr_pipe.predict(X_test)

In [None]:
print("R2 Score:", r2_score(y_test, gbr_pred))
print("Mean Squarred Error:", mean_squared_error(y_test, gbr_pred))
print("RMSE:", math.sqrt(mean_squared_error(y_test, gbr_pred)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,gbr_pred)))

In [None]:
r2 = r2_score(y_test, gbr_pred)
mae = mean_squared_error(y_test, gbr_pred)
rmse = math.sqrt(mean_squared_error(y_test, gbr_pred))
mae = mean_absolute_error(y_test,gbr_pred)

In [None]:
models.loc["GradientBoostingRegressor"] = [r2, mse, rmse, mae]

# Summary

In [None]:
models

In [None]:
train_pred = gbr_pipe.predict(X_train)

In [None]:
fig, ax = plt.subplots()
ax.plot([0,1],[0,1], transform = ax.transAxes)

plt.scatter(gbr_pred, y_test)
plt.xlabel("Predicted Values")
plt.ylabel("Actual values")
plt.show()

### Actual vs Predicted 

In [None]:
actual = pd.DataFrame(data = y_test.values, columns = ['Actual'])
predicted = pd.DataFrame(data = gbr_pred, columns = ['Predicted'])

final = pd.concat([actual, predicted], axis = 1)
final