<img src="https://github.com/MCKasman/mckasman.github.io/blob/master/misc/medical.png?raw=true" width='300' height='300'>


# Introduction
I've always been curious why insurance rates differ between people, even if both individuals are identically healthy. How do health insurance agencies determine medical charges for individuals? *Should we eat more or eat less? To smoke or not to smoke?* Haha, it's always been clear in the real world that smoking will definitely hike up the price, but let's take advantage of the variables in this amazing dataset and see if they affect medical charges or not. Once we find out a pattern or variables related to charges, we can then develop a machine learning model to predict anyone else's medical charge within the constraints of the data we have. Most specifically, we'll be using **multiple linear regression (MLR)** in this notebook!

# **1. Exploratory Data Analysis (EDA)**

## Import Libraries
* To get started on our exploratory data analysis, let's first import all the libararies we'll be using

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Import The Dataset
* Next, we will be importing the provided dataset, **"insurance.csv,"** into our notebook file
* Before creating our exploratory data analysis and machine learning model, we must check if there are any missing values in the dataset

In [None]:
# import dataset using Pandas
data = pd.read_csv('../input/insurance/insurance.csv')

# check if any columns have NaN values
data.isnull().sum()

* There are no missing values so we can now start working with the dataset!

In [None]:
# output first five rows of the dataset using the ".head()" function
data.head()

## Visualization of The Correlation Matrix (Heatmap)
* A good way to check correlations among variables in a dataset is by visualizing the correlation matrix as a heatmap
* I will be using the heatmap method in this article with the Seaborn library: https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07

**However, we must first encode our catageorical variables: *sex, smoker, and region***

In [None]:
data = pd.get_dummies(data)

In [None]:
data.head()

Now that it's encoded we can start the process of visualizing the correlation matrix

In [None]:
# calculate variable correlations in regards to 'charges'
corr = data.corr()['charges'].sort_values()

In [None]:
# display correlation values
corr

In [None]:
corr = data.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=sns.diverging_palette(250,10,as_cmap=True),
            square=True,annot=True,ax=ax)

From the heatmap we can see that there is a **strong correlation** between being a **smoker** and the **medical charges.**

There is also small correlation between **age** and **medical charges** as well.

## Descriptive Statistics of Smokers and Non-Smokers
* We can get more numerical insight on smokers & non-smokers by using the 'describe()' function on the dataset

In [None]:
tmp_data = pd.read_csv('../input/insurance/insurance.csv')

# Descriptive statistics smoker
statistics_smoker = tmp_data[tmp_data['smoker'] == 'yes'].describe()
statistics_smoker.rename(columns=lambda x: x + '_smoker', inplace=True)

# Descriptive statistics non-smoker
statistics_non_smoker = tmp_data[tmp_data['smoker'] == 'no'].describe()
statistics_non_smoker.rename(columns=lambda x: x + '_non_smoker', inplace=True)

# Dataframe that contains statistics for both male and female
statistics = pd.concat([statistics_smoker, statistics_non_smoker], axis=1)
statistics

* From the heatmap and descriptive statistics chart, we can infer that **being a smoker definitely increases medical charges**
* The **average charge of smokers (32,050.23)** is higher compared to  **non-smokers (8,434.27)**

But what other variables impact medical charges?

## Distribution of Variables
* Let's visualize the variables in our dataset, we want to view their distribution
* We will be using **histograms** and **bar charts** to visualize the variable distributions

In [None]:
from statistics import mode 

plt.style.use('ggplot')

# histogram of ages
data.age.plot(kind='hist', color='orange', edgecolor='black', figsize=(10,7))
plt.title('Distribution of Age', size=24)
plt.xlabel('Age', size=18)
plt.ylabel('Frequency', size=18)

# find most frequent age
mode_age = mode(data.age)
print('Mode of Age:', mode_age)

* There are more 18 year olds paying for medical charges than any other age
* Most people paying for medical charges are young

In [None]:
# histogram of BMI
data.bmi.plot(kind='hist', color='orange', edgecolor='black', figsize=(10,7))
plt.title('Distribution of BMI', size=24)
plt.xlabel('Body Mass Index (BMI)', size=18)
plt.ylabel('Frequency', size=18)

# find average BMI
avg_BMI = data.bmi.mean()
print('Average BMI:', avg_BMI)

* A BMI of 30.7 is considered overweight which may possibly affect medical charges

<img src = 'https://www.richardlipmanmd.com/wp-content/uploads/2019/01/BMI-Chart.jpg'>

In [None]:
# countplot to compare the number of children
plt.figure(figsize=(12,4))
sns.countplot(x='children', data=data, color='orange', edgecolor='black') 
plt.title('Distribution of Children', size='24')
plt.ylabel('Frequency',size=18)
plt.xlabel('Number of Children',size=18)
plt.show()

* Majority of people paying for medical charges do not have children

In [None]:
# countplot to compare the number of people from different regions
plt.figure(figsize=(12,4))
sns.countplot(x='region', data=tmp_data, color='orange', edgecolor='black') 
plt.title('Distribution of People Across Regions', size='24')
plt.ylabel('Frequency',size=18)
plt.xlabel('Region',size=18)
plt.show()

In [None]:
# histogram of region
data[data['region_northeast'] == 1].charges.plot(kind='hist', color='blue', edgecolor='black', alpha=0.5, figsize=(10, 7))
data[data['region_northwest'] == 1].charges.plot(kind='hist', color='magenta', edgecolor='black', alpha=0.5, figsize=(10, 7))
data[data['region_southeast'] == 1].charges.plot(kind='hist', color='green', edgecolor='black', alpha=0.5, figsize=(10, 7))
data[data['region_southwest'] == 1].charges.plot(kind='hist', color='red', edgecolor='black', alpha=0.5, figsize=(10, 7))
plt.legend(labels=['Northeast','Northwest','Southeast','Southwest'])
plt.title('Distribution of Charges Between Regions', size=24)
plt.xlabel('Medical Charges', size=18)
plt.ylabel('Frequency', size=18)

* The **number of people** across the regions is about the same
* The **amount of medical charges** between regions do no vary drastically

In [None]:
data[data['smoker_yes'] == 1].charges.plot(kind='hist', color='blue', edgecolor='black', alpha=0.5, figsize=(10, 7))
data[data['smoker_no'] == 1].charges.plot(kind='hist', color='magenta', edgecolor='black', alpha=0.5, figsize=(10, 7))
plt.legend(labels=['Smoker', 'Non-Smoker'])
plt.title('Distribution of Charges on Smokers & Non-Smokers', size=24)
plt.xlabel('Medical Charges', size=18)
plt.ylabel('Frequency', size=18)

* We can see that there are more non-smokers than smokers paying less medical charges

In [None]:
data[data['sex_male'] == 1].charges.plot(kind='hist', color='blue', edgecolor='black', alpha=0.5, figsize=(10, 7))
data[data['sex_female'] == 1].charges.plot(kind='hist', color='magenta', edgecolor='black', alpha=0.5, figsize=(10, 7))
plt.legend(labels=['Male', 'Female'])
plt.title('Distribution of Charges on Males & Females', size=24)
plt.xlabel('Medical Charges', size=18)
plt.ylabel('Frequency', size=18)

* The distribution of **medical charges on males and females** above show that it is normal

We now have more of an understanding of our dataset and what we should focus on: **age, BMI, number of children**

## Visulization of Variables Involving Medical Charges
* From our variable distrubtions, we know that **smoking** greatly affects the price of medical charges
* Medical charges are not affected by **sex** and **regions** as their distributions in regards to medical charges is normal

We will now further analyze if **age, BMI, and number of children** affects the price of medical charges.

### (A) Relationship Between Age & Medical Charges
How does **age** affect medical charges?

In [None]:
# scatter plot of Age, Smokers, and Medical Charges
ax1 = data[data['smoker_yes'] == 1].plot(kind='scatter', x='age', y='charges', color='blue', alpha=0.5, figsize=(10, 7))
data[data['smoker_no'] == 1].plot(kind='scatter', x='age', y='charges', color='magenta', alpha=0.5, figsize=(10 ,7), ax=ax1)

# legend, title, and labels
plt.legend(labels=['Smoker', 'Non-Smoker'])
plt.title('Relationship Between Age, Smoking, and Medical Charges', size=24)
plt.xlabel('Age', size=18)
plt.ylabel('Medical Charges', size=18);

* We could infer from the upward trend in the plot above that **increasing age increases medical charges**
* Other variables such as **BMI, and children** could also explain the deviation of points in medical charges straying away from the trend of smokers and non-smokers above

### (B) Relationship Between BMI & Medical Charges
How does **BMI** affect medical charges?

In [None]:
# scatter plot of BMI, Smokers, and Medical Charges
ax1 = data[data['smoker_yes'] == 1].plot(kind='scatter', x='bmi', y='charges', color='blue', alpha=0.5, figsize=(10, 7))
data[data['smoker_no'] == 1].plot(kind='scatter', x='bmi', y='charges', color='magenta', alpha=0.5, figsize=(10 ,7), ax=ax1)
plt.legend(labels=['Smoker', 'Non-Smoker'])
plt.title('Relationship Between BMI, Smoking, and Medical Charges', size=24)
plt.xlabel('BMI', size=18)
plt.ylabel('Medical Charges', size=18)

We want to check if being unhealthy affects medical charges. Other than smoking, we can check the health of people with their BMI. Being **underweight, overweight, and obese** is considered unhealthy.

In [None]:
# plot for underweight
plt.figure(figsize=(12,5))
plt.title("Medical Charges of BMI < 18.5 (Underweight)")
ax = sns.distplot(data[(data.bmi <= 18.5)]['charges'], color = 'm')

# calculate average medical charge for someone underweight
underweight_charge = data[(data.bmi <= 18.5)]['charges'].mean()
print('Average Medical Charge (Underweight BMI):', underweight_charge)

In [None]:
# plot for normal weight
plt.figure(figsize=(12,5))
plt.title("Medical Charges of BMI Between 18.5 - 25 (Normal)")
ax = sns.distplot(data[(data.bmi.between(18.5,25))]['charges'], color = 'g')

# calculate average medical charge for someone normal
normal_charge = data[data.bmi.between(18.5,25)]['charges'].mean()
print('Average Medical Charge (Normal BMI):', normal_charge)

In [None]:
# plot for overweight
plt.figure(figsize=(12,5))
plt.title("Medical Charges of BMI Between 25 - 30 (Overweight)")
ax = sns.distplot(data[(data.bmi.between(25,30))]['charges'], color = 'y')

# calculate average medical charge for someone overweight
overweight_charge = data[data.bmi.between(25,30)]['charges'].mean()
print('Average Medical Charge (Overweight BMI):', overweight_charge)

In [None]:
# plot for obese
plt.figure(figsize=(12,5))
plt.title("Medical Charges of BMI >= 30 (Obese)")
ax = sns.distplot(data[(data.bmi >= 30)]['charges'], color = 'r')

# calculate average medical charge for someone obese
obese_charge = data[data.bmi >= 30]['charges'].mean()
print('Average Medical Charge (Obese BMI):', obese_charge)

It's quite interesting how **the average medical charge for someone underweight is less than someone with a normal BMI** since being underweight is related to more health issues.
* Based on the plots and average medical charge across BMI types, **higher medical charge is dependent on higher BMI.**

### (C) Relationship Between Children & Medical Charges
Will the **number of children** affect medical charge?

In [None]:
# calculate average medical charge for someone with zero children
zero_child = data[(data.children == 0)]['charges'].mean()
print('Average Medical Charge (Zero Children):', zero_child)

# calculate average medical charge for someone with one child
one_child = data[(data.children == 1)]['charges'].mean()
print('Average Medical Charge (One Child):', one_child)

# calculate average medical charge for someone with two children
two_child = data[(data.children == 2)]['charges'].mean()
print('Average Medical Charge (Two Children):', two_child)

# calculate average medical charge for someone with three children
three_child = data[(data.children == 3)]['charges'].mean()
print('Average Medical Charge (Three Children):', three_child)

# calculate average medical charge for someone with four children
four_child = data[(data.children == 4)]['charges'].mean()
print('Average Medical Charge (Four Children):', four_child)

# calculate average medical charge for someone with five children
five_child = data[(data.children == 5)]['charges'].mean()
print('Average Medical Charge (Five Children):', five_child)

In [None]:
g= sns.catplot(x="children", y='charges', hue=None, data=tmp_data,
                height= 6, kind="point", aspect=1.0, legend_out=True, width=0.4, linewidth=3,  linestyles = '--', capsize=.1, dodge= 0.15,
                sharey=True, 
                palette = sns.color_palette("deep", n_colors = 1))

g.despine(left=True)
g.set_titles("Relationship Between Children and Medical Charges", weight='bold')
g.set_axis_labels("Number of Children", "Medical Charges")

* We can see there is **no clear pattern between the number of children and medical charges.**

Let's plot a comparison between smokers and non-smokers and the number of children they have.

In [None]:
g= sns.catplot(x="children", y='charges', hue='smoker', data=tmp_data,
                height= 6, kind="point", aspect=1.0, legend_out=True, width=0.4, linewidth=3,  linestyles = '--', capsize=.1, dodge= 0.15,
                sharey=True, 
                palette = sns.color_palette("deep", n_colors = 2))

g.despine(left=True)
g.set_titles("Relationship Between Children, Smoker, and Medical Charges", weight='bold')
g.set_axis_labels("Number of Children", "Medical Charges")
g._legend.set_title("Smoker")

* From the plot above we can see the **difference of medical charge between smokers and non-smokers with having the same number of children**
* The number of children doesn't seem to affect medical charges, rather **smoking** causes the great difference in medical charges

## EDA Results
Through analyzing the variables in the dataset we found that only your **age, BMI,** and being a **smoker** impacts medical charge prices. **Sex, region, and the number of children** does not affect the price of medical charges.

**MEDICAL CHARGE WILL BE AFFECTED BECAUSE OF...** 
* **Age:** The price of medical charges increases the older you become.
* **BMI:** The higher your BMI level, the higher your medical charge will be.
* **Smoker:** *Smoke and become broke.* Medical charge price will sky rocket if you're a smoker!
    * From our descriptive statistics analysis earlier: ***the average charge of smokers (32,050.23) is higher compared to non-smokers (8,434.27)***
    
**MEDICAL CHARGE WILL NOT BE AFFECTED BECAUSE OF...** 
* **Sex:** Your sex will not impact medical charges. Male or female, the medical charge price is approximate.
* **Region:** Where you live will not affect medical charges, the distribution of charges is about the same across all regions.
* **Number of Children:** The number of children  dependent on you will not affect the price of your medical charge.

# 2. Multiple Linear Regression (MLR) Model
With the multiple variables we have within the dataset, we will create an MLR model to predict the price of someone's medical charge.

## Import and Initialize Independent & Dependent Variables
Let's import the dataset again and create our 'x' (independent) & 'y' (dependent) variables.

We will be dropping the **'regions'** column because of ***multicollinearity*** - this may skew our model.

In [None]:
# import dataset using Pandas
data = pd.read_csv('../input/insurance/insurance.csv')

# drop regions column
data = data.drop(['region'], axis=1)

In [None]:
data.head()

In [None]:
# set independent and dependent variables
x = data.iloc[:,:-1].values # age, sex, BMI, children, smoker, region
y = data.iloc[:,-1].values # charges

print('Independent Variables\n',x)
print('\nDependent Variables\n',y)

## Encode Categorical Data (Independent Variables)
Instead of using the 'pd.get_dummy()'function to encode our categorical variables, we'll be using a OneHotEncoder.

In [None]:
# get column index of categorical variables (sex, smoker, region)
print('Sex Column Index:', data.columns.get_loc('sex'))
print('Smoker Column Index:', data.columns.get_loc('smoker'))

In [None]:
# import module for one-hot encoding scheme
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# sex and smoker column index is 1 and 4
dummy_transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1,4])], remainder='passthrough')
x = np.array(dummy_transformer.fit_transform(x))

## Split The Dataset Into The Training & Test Set

In [None]:
# import module to split data into training and test set
from sklearn.model_selection import train_test_split 

# 80% training & 20% testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

## Train The Multiple Linear Regression Model On The Training Set

In [None]:
# import the LinearRegression() class
from sklearn.linear_model import LinearRegression

# create a regressor model
regressor = LinearRegression()

# fit the training data, feature scaling is not needed for regression models
regressor.fit(x_train, y_train)

## Predict The Test Set Results

In [None]:
# the vector of the predicted medical charges in the training set
y_train_pred = regressor.predict(x_train)

# the vector of the predicted medical charges in the test set
y_test_pred = regressor.predict(x_test)

In [None]:
# compare y_test_pred (prediction) to the y_test (actual)
i = 0
while i < len(y_test_pred):
    diff = abs(round(y_test_pred[i], 2) - y_test[i])
    print("Predicted: " + str(round(y_test_pred[i], 2)) + " vs Actual: " + str(round(y_test[i], 2)) +
          " ---> Difference: " + str(round(diff, 2)))
    i += 1

## Calculate Mean Squared Error (MSE) and R-Squared (R2) Values
To see how well the model predicts medical charges, we must calculate the MSE and R2 values

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# calculate MSE values on the training and test set
MSE_train = mean_squared_error(y_train, y_train_pred)
MSE_test = mean_squared_error(y_test, y_test_pred)

# calculate R2 values on the training and test set
R2_train = r2_score(y_train, y_train_pred)
R2_test = r2_score(y_test, y_test_pred)

print('MSE (Training):', MSE_train)
print('MSE (Test):', MSE_test)

print('\nR2 (Training):', R2_train)
print('R2 (Test):', R2_test)

## Making a Single Prediction
For example the medical charge of someone with these factors: 
* **Sex** = Female, **Smoker** = Yes, **Age** = 19, **BMI** = 27.9, **Children** = 0

In [None]:
"""
regressor.predict([[sex_female, sex_male, smoker_no, smoker_yes, age, BMI, children]])

Only for categorical variables:
1 - Yes/True
0 - No/False
"""
# enter categorical and numerical inputs
print(regressor.predict([[1,0,0,1,19,27.90,0]]))

# Conclusion
* **Age, BMI, and being a smoker** affects the price of medical charges for individuals
    * Medical charge **increases** as age and BMI **increases**
    * Medical charge will always be high if you're a smoker
* Although not 100% accurate, our MLR model is **fairly accurate** as it predicates a medical charge close to the actual price. We can try other types of regression to get better results such as **random forest** and **support vector regression**

If you liked my notebook please give it an **upvote!** Also please comment if you have any feedback as this is my first notebook on Kaggle!

# References
I definitely learned a lot on exploratory data analysis for this notebook from these sources:
* https://www.kaggle.com/hely333/eda-regressio
* https://towardsdatascience.com/simple-and-multiple-linear-regression-with-python-c9ab422ec29c
* https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07