# Introduction

The task is to accurately predict insurance costs based on the following variables:

* **age:** age of primary beneficiary
* **sex:** insurance contractor gender, female, male
* **bmi:** Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
* **children:** Number of children covered by health insurance / Number of dependents
* **smoker:** Smoking
* **region:** the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
* **charges:** Individual medical costs billed by health insurance

Begin by reading csv file


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

insurance_df = pd.read_csv('/kaggle/input/insurance/insurance.csv')

In [None]:
insurance_df.info()

Showing first 5 rows of the dataset

In [None]:
insurance_df.head()

checking for any missing values

In [None]:
insurance_df.isnull().sum()

# Explanatory Data Analysis

# Description of variables

Sex:

In [None]:
plt.figure(figsize=(5,5))
plt.grid()
sns.countplot(x='sex', data= insurance_df)
plt.title("Sex", fontsize=15)
plt.show()

Smoker

In [None]:
plt.figure(figsize=(5,5))
plt.grid()
sns.countplot(x='smoker', data= insurance_df)
plt.title("Smoker", fontsize=15)
plt.show()

Age

In [None]:
plt.figure(figsize=(5,5))
plt.grid()
sns.distplot(insurance_df['age'])
plt.title("Age", fontsize=15)
plt.show()

print('The maximum age is {}'.format(insurance_df['age'].max()))
print('The minimum age is {}'.format(insurance_df['age'].min()))
print('The average age is {}'.format(insurance_df['age'].mean()))
print('With an exceptionally high population at age {}'.format(int(insurance_df['age'].mode())))

Region

In [None]:
plt.figure(figsize=(5,5))
plt.grid()
sns.countplot(x='region', data= insurance_df)
plt.title("Region", fontsize=15)
plt.show()

Children

In [None]:
plt.figure(figsize=(5,5))
plt.grid()
sns.countplot(x='children', data= insurance_df)
plt.title("Children", fontsize=15)
plt.show()

BMI

In [None]:
# displot
plt.figure(figsize=(5,5))
plt.grid()
sns.distplot(insurance_df['bmi'])
plt.title("bmi", fontsize=15)
plt.show()

In [None]:
print("the max bmi is: ",insurance_df['bmi'].max())
print("the min bmi is: ",insurance_df['bmi'].min())
print("the average bmi is: ",insurance_df['bmi'].mean())

charges

In [None]:
# displot
plt.figure(figsize=(5,5))
plt.grid()
sns.distplot(insurance_df['charges'])
plt.title("charges", fontsize=15)
plt.show()

In [None]:
print("the max charges is: ",insurance_df['charges'].max())
print("the min charges is: ",insurance_df['charges'].min())
print("the average charges is: ",insurance_df['charges'].mean())

# Summary of variables:
1. **sex**: there is an approximately equal number of male and female
2. **smoker**: most of the people are non-smokers
3. **age**: age of policyholders is ranging from 18 to 64, with an exceptionally high population at age 18
4. **region**: The policyholders are roughly evenly distributed among 4 regions, namely southeast, southwest, northeast and northwest, with southeast region having a slightly larger population.
5. **children**: The dataset are divided into 6 categories regarding number of dependents they covered in their policies,with the min number of children being 0 and the max being 5. Most of the people do not have any dependents. The population tend to decrease in size with increasing number of dependents.
6. **bmi**: BMI is following a normal distribution(having a bell-shaped population with a mean of 31) Value of BMI score is ranging from 15 to 53.
7. **charges**: Charges in this dataset are ranging from 1122 to 63770 dollars with the mean of 13270. Charges is right-skewed with a long tail stretching to larger values, which means that most policyholders are charged at low prices while a few outliers charging at extremely high prices.

---------------------------------------------------------------------------------------

# Charges against other variables

Charges against sex

In [None]:
tmp_df = insurance_df.copy()
tmp_df_female = insurance_df[insurance_df['sex']=='female']
tmp_df_male = insurance_df[insurance_df['sex']=='male']
# plot graph
plt.figure(figsize=(8,6))
ax = sns.boxplot(x="sex", y="charges", data=tmp_df)
plt.title("charges against sex", fontsize=15)
plt.grid()
plt.show()

In [None]:
print("The median medical costs for male is ", tmp_df_male['charges'].median())
print("The median medical costs for female is ", tmp_df_female['charges'].median())

The median of medical costs for two sex groups are roughly the same. (9367 for male and 9412 for female) Outliers can be seen in both groups. The only difference is that the male group has a slightly higher 3rd quantile than that of the female group.

This finding suggests that sex may not be the major factor that determines the medical costs for each policyholders. Both male and female policyholders are regarded as having the same amount of risk with all other factors held constant.  

Charges against smoker

In [None]:
tmp_df = insurance_df.copy()
# plot graph
plt.figure(figsize=(8,6))
ax = sns.boxplot(x="smoker", y="charges", data=tmp_df)
plt.title("charges against smoker", fontsize=15)
plt.grid()
plt.show()

It is patently obviously that smokers are being charged with higher medical costs when comparing to non-smokers. Most insurance firms reckons smoking as a high risk activity that may result in policyholders' deteriorating health conditions, and thus higher medical costs are collected to offset the risk.

In [None]:
tmp_df = insurance_df.copy()
# plot graph
plt.figure(figsize=(8,6))
ax = sns.boxplot(x="region", y="charges", data=tmp_df)
plt.title("charges against region", fontsize=15)
plt.grid()
plt.show()

4 regions have similar spreads, with southeast region having a higher 3rd quantile.

This may suggest that the place where policyholders live does not contribute much to the rate making for insurance policies.

charges against children

In [None]:
tmp_df = insurance_df.copy()
# plot graph
plt.figure(figsize=(8,6))
ax = sns.boxplot(x="children", y="charges", data=tmp_df)
plt.title("charges against children", fontsize=15)
plt.grid()
plt.show()

All categories have a similar spread. It seems like number of dependents covered in a policy is not a contributing factor of having a higher medical costs. It may due to the fact that medical costs for dependents are charged separately, so the value of medical costs in the dataset do not rise with the increasing number of dependents.

charges against age

In [None]:
# scatterplot
plt.figure(figsize=(8,6))
sns.scatterplot(data=insurance_df, x="age", y="charges")
plt.title("charges against age", fontsize=15)

# regression line
x = insurance_df['age']
y = insurance_df['charges']
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x+b, color='yellow')
plt.grid()
plt.show()

The pattern shows a strong positive correlation between age and charges (i.e. the older you are, the higher the medical costs is)
We could also observe that the data points are clustered into roughly 3 groups, with the lowest band(charges ranging from 0 to 20,000) having the strongest relationship between age and charges. 

In [None]:
# scatterplot
plt.figure(figsize=(8,6))
sns.scatterplot(data=insurance_df, x="bmi", y="charges")
plt.title("charges against bmi", fontsize=15)

# regression line
x = insurance_df['bmi']
y = insurance_df['charges']
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x+b, color='yellow')
# ideal BMI
plt.vlines(x = 18.5, ymin = 0, ymax=60000,
           color = 'purple')
plt.vlines(x = 24.9, ymin = 0, ymax=60000,
           color = 'purple')
 
plt.grid()
plt.show()

The pattern shows a mildly positively correlated relationship between charges and bmi.(i.e. the higher the BMI of the policyholder, the higher the charges)

BMI is a ratio of height to weight, which is often used as an indicator of people's health. The ideal BMI score range is 18.5 - 24.9 as indicated by the purple lines in the above scattorplot. In theory, those who have scores within the ideal range should be charged with lower prices as they are considered to be healthier than most.

We can observe that those within the range are charged with lower prices(with the maxmimum at around 30,000). However, even for those who have BMI scores exceeding the ideal range, there is still a majority of them having lower medical costs at around 10,000. 

This may due to the fact that solid information of clients' BMI scores is not collected during the ratemaking process, but rather information about clients' lifestyle is gathered (e.g. smoking habit). Therefore, some of the policyholders having high BMI score remain unspotted and thus they are charged at low prices.

--------------------------------------------------------------------------------

# Further Investigation

So, we observed that smoking is the major contributing factor of deciding the medical costs, and there are observable patterns in charges against BMI and age. Let's do a further investigation into these factor by splitting the dataset into smoker and non-smoker categories.

In [None]:
insurance_df_smoker=insurance_df[insurance_df['smoker']=='yes']
insurance_df_nonsmoker=insurance_df[insurance_df['smoker']=='no']
# scatterplot
plt.figure(figsize=(8,6))
sns.scatterplot(data=insurance_df_smoker, x="age", y="charges",color='red',label='smoker')
sns.scatterplot(data=insurance_df_nonsmoker, x="age", y="charges",color='blue',label='non-smoker')
plt.title("charges against age", fontsize=15)
plt.legend()
plt.show()

We can clearly see that all the policyholders in the lowest band are non-smokers, and smokers occupy higher bands when compared with non-smokers at all age groups.

In [None]:
# scatterplot
plt.figure(figsize=(8,6))
sns.scatterplot(data=insurance_df_smoker, x="bmi", y="charges",color='red',label='smoker')
sns.scatterplot(data=insurance_df_nonsmoker, x="bmi", y="charges",color='blue',label='non-smoker')
plt.title("charges against BMI", fontsize=15)
plt.show()

This observation aligns with my previous insights. Insurance firms are fast to spot out smokers but unable to identify those with high BMI scores. Even if you have a high BMI score, medical costs remain low if you do not smoke. However, as insurance firms paid high attention on smokers, those who smoke and have a high BMI scores are charged with high-than-average medical costs. All the outliers having extreme charges are smokers with high BMI scores.

The pattern may also suggest that the 3 bands we observed in the "charges against age" scatterplot are: 1. non-smokers, 2. smokers with low BMI scores and non-smoking outliers, 3. smokers with high BMI scores.

# Regression Model Run

In [None]:
X = insurance_df.drop('charges',axis=1)
y = insurance_df['charges']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)


categorical_cols = [cname for cname in X_train.columns if 
                    X_train[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train.columns if 
                X_train[cname].dtype in ['int64', 'float64']]
print(categorical_cols)
print(numerical_cols)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

# Preprocessing for numerical data
# normalize numerical data
numerical_transformer = MinMaxScaler()

# Preprocessing for categorical data
# using one-hot-encoding method to cope with categorical data
categorical_transformer = OneHotEncoder(handle_unknown='ignore')


# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression

# Define model
lr=LinearRegression()

# Bundle preprocessing and modeling code in a pipeline
model_1 = Pipeline(steps=[('preprocessor', preprocessor),
                     ('model', lr)
                    ])

# Preprocessing of training data, fit model
# take log of y_train as charges is following a lognormal distribution
model_1.fit(X_train, np.log(y_train))

# Preprocessing of validation data, get predictions
preds = model_1.predict(X_test)

# measuring the size of error/noise in this model
print('R square: ', r2_score(y_test, np.exp(preds)))

print('Average charges: ', (insurance_df['charges'].sum())/len(insurance_df))
print('Mean absolute error:', mean_absolute_error(y_test, np.exp(preds)))
print('size of error: ',mean_absolute_error(y_test, np.exp(preds))/((insurance_df['charges'].sum())/len(insurance_df)))


From R square, we can see that linear regression model got a pretty high score(0.602) which is not too close to 1, meaning that model is not fitting too well with the data points.


XGBoost

In [None]:
from xgboost import XGBRegressor

# Define model
xgb= XGBRegressor(n_estimators=1000, learning_rate=0.05)

# Bundle preprocessing and modeling code in a pipeline
model_2 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', xgb)
                     ])

# Preprocessing of training data, fit model 
model_2.fit(X_train, np.log(y_train))

# Preprocessing of validation data, get predictions
preds = model_2.predict(X_test)

# test for accuarcy
print('Average charges: ', (insurance_df['charges'].sum())/len(insurance_df))
print('Mean absolute error:', mean_absolute_error(y_test, np.exp(preds)))
print('size of error: ',mean_absolute_error(y_test, np.exp(preds))/((insurance_df['charges'].sum())/len(insurance_df)))


Model 2(XGBoost) performs better than Model 1(Linear Regression model), as it has a lower mean absolute error.
We would generate csv output file using the result from XGBoost

In [None]:
output = pd.DataFrame({'test charges': y_test,
                       'predict charges': np.exp(preds)})
output.to_csv('submission.csv', index=False)