# Capstone Project

Import the required libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import patsy as pt
from sklearn import metrics
from sklearn import linear_model
from sklearn.model_selection import train_test_split

## Data Extraction

Data Source :  https://www.kaggle.com/mirichoi0218/insurance?select=insurance.csv

The purpose of this project is to understand what are the factors influencing the insurance charges. Then, use these variables to predict insurance cost in US.

Below are the provided information in the csv file.
- "age": age of participant
- "sex": gender of participant (female or male)
- "bmi": Body Mass Index
- "children": Number of children (dependents) covered by health insurance
- "smoker": Smoker or Non-smoker
- "region": the participant's residential area in the US - northeast, southeast, southwest, northwest.
- "charges": Individual medical costs billed by health insurance

In [None]:
# Import data from local drive
df0 = pd.read_csv('../input/insurance/insurance.csv')
# Show the header and the first 3 rows of the data
df0.head(3)

In [None]:
# Print a concise summary of the imported df0. 
df0.info()

From the table above, it indicates that there are a total of 7 columns in the imported csv file named 'insurance', which are age, sex, BMI, children, smoker, region and charges. There are 1338 non-null values in each column. In other words, there are no missing or undefined values in this data frame. The dtype shows the type of data stored in each variable.

In [None]:
# Describe the data frame
df0.describe(include = 'all').round()

The mean age of participants are 39, with the youngest and oldest to be 18 and 64. There are slightly more male participants than female participants. In addition, there are a large amount of participant from southeast region. The average BMI of all participants is 31, with the minimum and maximum BMI to be 16 and 53. Majority of the participants are non-smoker. The mean insurance charge is 13,270, where less than 50% of the participants are paying above 9,382.

## Data Cleaning

In [None]:
# Change female to Female and male to Male
df0.sex = df0.sex.replace(['female', 'male'], ['Female', 'Male'])
# Update smoker's yes to Yes and no to No
df0.smoker = df0.smoker.replace(['yes', 'no'], ['Yes', 'No'])
# Update charges to 2 decimal points
df0.charges = df0.charges.round(2)
# Show the first 3 rows of the updated columns
df0.filter(['sex', 'smoker','charges']).head(3)

In [None]:
# Add a column as ID, the unique number representing each participant
df1 = df0.reset_index()
# Edit the name of the columns
df1.columns = ['ID', 'Age', 'Gender', 'BMI', 'NumberOfChildren', 'Smoker', 'Region', 'Charges']
# Each ID will start with '100'
df1.ID = df1.ID + 1000001
# Reorder the columns in the data frame
df1 = df1[['ID', 'Age', 'Gender', 'Smoker', 'BMI', 'Region', 'NumberOfChildren', 'Charges']]
# Show the header and the first 3 rows of the data
df1.head(3)

In [None]:
# Set a new data frame for calculating the correlation.
# This data frame will be used for heatmap and Predictive Analytics
# All plots will use df1, except for heatmap
# We are converting all string [dtype : object] columns presented in the first section to integers
df2 = pd.read_csv('../input/insurance/insurance.csv')
# Change female to 0 and male to 1
df2.sex = df2.sex.replace(['female', 'male'], [0,1])
# Update smoker's yes to 1 and no to 0
df2.smoker = df2.smoker.replace(['no', 'yes'], [0,1])
# Update region from string to integer
df2.region = df2.region.replace(['southeast', 'southwest', 'northeast','northwest'],[1,2,3,4])
# Update charges to 2 decimal points
df2.charges = df2.charges.round(2)
# Show the header and the first 5 rows of the data
df2.head()

## Data Visualization

In [None]:
# Set the figure's size
plt.figure(figsize=(20,4))
# Plot a heatmap
sns.heatmap(df2.corr(), annot = True, cmap = 'Blues_r')

Above is a heat map from seaborn library. It shows the correlation for age, gender, BMI, number of children, smoker, region and insurance charges. Number of children (dependents) and gender have correlation at minimal, which is 0.068 and 0.057. As the number of children is the least influencing factor with all the other variables, a correlation range between 0.0077 to 0.068, this variable shall be excluded in most of the data analysis below. 

Smoker is highly correlated to the insurance charge, with a correlation coefficient of 0.79. Age and BMI are moderately correlated with insurance charge, a correlation coefficient of 0.3 and 0.2. This indicates that the insurance charge will be higher if the person is older or is a smoker or the person has a higher BMI value.

As there are four different categories for region, it is not very clear what are the impacts based on this value. It looks like the participants from the south are getting a higher insurance charge. A deep dive on various plots below may provide more insights.

In [None]:
# Set a default colour for all plots below
sns.set_palette(['palevioletred','steelblue','#ffcc99' , 'mediumaquamarine'])

In [None]:
# Select variables for pairplot
sns.pairplot(df1, corner  = True, 
    x_vars=["Age", "Gender", "Smoker", "BMI", "Region", "Charges"],
    y_vars=["Age", "Gender", "Smoker", "BMI", "Region", "Charges"])

The above pair plot is a quick view of the relation between age, gender, smoker, bmi, region and insurance charges. From the heatmap earlier, number of children is the least influencing factor in this data frame, therefore, it is excluded from the pairplot. 

Based on the diagonal plots, a large amount of participants are in their 20s. There are slightly more male participants than female. More than a third of the participants are non-smokers. BMI of all participants seems to be normal distributed, with a mean of approxtimately 30. Participants are coming from all four different regions. Most participants are charged below 20,000.

The Age-Charge graph indicates that as the participant's age increases, the insurance charge increases, which is in-line with the heatmap information. From the Gender-BMI graph, it shows that male participants has a wider range of BMI compared to female participants. 

Smoker-Charges plot displays that smokers have a significantly higher insurance charge compared to non-smoker. On the BMI-Region plot, participants from southeast has a higher BMI value. 

Below are more information in detail about the data.

In [None]:
# Set the figure's size
plt.figure(figsize=(20,4))
# Plot the first graph
plt.subplot(1,3,1)
# Create a Pie Chart for gender
plt.pie(x = df1.Gender.value_counts(), explode = [0, 0.05], autopct='%0.01f%%', labels = [ 'Male', 'Female'])
# Plot the second graph
plt.subplot(1,3,2)
# Create a second graph to view the amount of smoker group spilt by female and male
sns.histplot(data=df1, x='Gender', multiple='stack', hue = 'Smoker', stat='density')
# Plot the second graph
plt.subplot(1,3,3)
# Create a Pie Chart to view the overall smokers vs non-smokers
plt.pie(x = df1.Smoker.value_counts(), explode = [0, 0.05], autopct='%0.01f%%', labels = [ 'Non-smoker', 'Smoker'])

The first pie chart indicates that 50.5% of the participants are male. There are 0.5% more male participants than female. On the pie chart to the right, 79.5% of the participants are non-smokers, 20.5% are smokers.

The histogram describes the amount of smokers and non-smoker within female and male. There are more male smokers than female smokers.



In [None]:
# Set the figure's size
plt.figure(figsize=(20,4))
# Plot the first graph
plt.subplot(1,3,1)
# Create a Pie Chart for region
plt.pie(x = df1.Region.value_counts(), autopct = '%0.2f%%', labels = [ 'southeast', 'southwest', 'northwest', 'northeast'])
# Plot the second graph - Show the amount of smokers and non-smokers in each region
plt.subplot(1,3,2)
sns.countplot(x = df1.Smoker, hue = df1.Region)
# Plot the third graph - Show the amount of female and male in each region
plt.subplot(1,3,3)
sns.countplot(x = df1.Gender, hue = df1.Region)

The pie chart indicates that approximately 24% of participants are from southwest, northwest and northeast. As there are slightly more participants from southeast, it is not surprising to see the blue bars representing southeast are higher than the other 3 regions. The ratio of female and male participants from each region are quite similar.

In [None]:
# Set the figure's size
plt.figure(figsize=(20,4))
# Plot the first graph - Age of participants group by gender
plt.subplot(1,2,1)
sns.histplot(data=df1, x='Age', hue = 'Gender', kde=True)
# Plot the second graph - Age of participants group by smoker
plt.subplot(1,2,2)
sns.histplot(data=df1, x='Age', hue = 'Smoker', kde=True)

There are more participants between 18 to early 20s and a slightly low number of participants in their late 30s to early 40s. From the plot on the left, there are slightly more male participants than female participants between age 18 to mid-40s. There are more female than male participants from mid 40s onwards. The amount of smokers decreasing consistently between mid-40s to late 50s when there is a fluctuation on the amount of participants around this age group. This decrease may be due to the increase on female participants and the decrease on male participants as the data contains more male smokers than female smokers.

In [None]:
# Set the figure's size
plt.figure(figsize=(20,4))
# Plot the first graph - Kernel Density Estimation (KDE) plot for BMI
plt.subplot(1,2,1)
sns.histplot(data=df1, x='BMI', stat='density', kde=True,alpha = 0.2)
# Plot the second graph - KDE for BMI, group by smokers and non-smokers
plt.subplot(1,2,2)
sns.histplot(data=df1, x='BMI', hue = 'Smoker', stat='density', kde=True,alpha = 0.2)

From the histogram on the left, the BMI is right skewed normal distributed, with a mean of 31. After spilting by smokers and non-smokers, it is clear that there is a second spike around 35 on smoker's KDE plot. The KDE plot for BMI group by gender is not provided as male tends to be taller than female. Therefore, it will be expected to see the male BMI curve more towards the right than female.

Below are some plots on charges impacted by different variables. 

In [None]:
# Set the figure's size
plt.figure(figsize=(20,4))
# Plot the first violin plot : Gender vs Charges
plt.subplot(1,2,1)
sns.violinplot(data=df1, x='Gender', y='Charges', order=['Female','Male'])
plt.title('Distribution of Charges in Relation to Gender')
# Plot the second violin plot : Smoker vs Charges
plt.subplot(1,2,2)
sns.violinplot(data=df1, x='Smoker', y='Charges', order=['Yes','No'])
plt.title('Distribution of Charges in Relation to Smoking')

Based on the Gender-Charges violin plot, there are more male charged around 40,000 compare to female. From the heatmap earlier, smoking is the most influencing factor for charges in this data set. It is expected that the average charges for smokers and non-smokers differs. The distribution of charges for most non-smokers are close to 0, whereas smokers are charged around 20,000 and 45,000. It is interesting to see that there is a smaller amount of smokers charged around 30,000., creating a plot shape for smokers which is totally different from non-smokers.

In [None]:
# Set the figure's size
plt.figure(figsize=(20,6))
# Plot the violin plot : Region vs Charges  
sns.violinplot(data=df1, y='Charges', x='Region')
plt.title('Distribution of Charges in Relation to Region')

It is known that BMI for southeast participants are higher on average when compared to the other three regions. In addition, BMI has a positive correlation with charges. Thus, it is not surprising to see the charges are higher for participants coming from southeast. 

Since smoking is a driving factor for insurance charges, below is a pairplot where age, BMI and charges are categorized by whether the participants are smokers or non-smokers.

In [None]:
# Create a pairplot graph for selected variables and group by smokers and non-smokers
sns.pairplot(df1[["Age", "BMI", "Charges", 'Smoker']], hue = 'Smoker', corner  = True,)

After grouping the data by smokers and non-smokers, there some interesting results. On the diagonal plots, it indicates that most smokers are aged between 20 to 50. There are lesser smokers above 50s. This might be coming from the increase of older female participants than male participants or some other reasons. On the BMI graph, it shows that most participants are overweight, with a BMI above the maximum healthy BMI value, which is 24.9. The charges graph tells us that smoker are definitely receiving a higher charge than non smokers.

On the Age-Charge plot, smokers have a higher insurance charge in general. As the smoker's age increases, the insurance charge increases too. This is true for non-smokers too. On the BMI-Charge plot, non smokers's insurance charge does not seems to be higher when the participants are overweight or obese. For smoker participants with the same BMI as non-smokers, they are receiving a higher charge. Especially for smokers who are obese, with BMI above 30, their charges seems to be significantly higher than others. This might be the reason why there seems to be four linear equations available on Age-Charges plot. It might be a split between Smoking and BMI (underweight + healthy vs overweight + obese). 

Next, the graphs below will state how Age-Charge plot and BMI-Charge plot is affected when the data is group by Gender or Region.

In [None]:
# lmplot : Age vs Charges group by Gender
sns.lmplot(data=df1, x='Age', y='Charges', hue='Gender', scatter_kws={"alpha": .3, "s": 25}, aspect=2.5)

Female tend to be charged lower than male from the same age. On the BMI-Charge plot below, healthy and underweighted female are receiving a higher insurance charge than male. The insurance charge for overweighted and obese female participants are lower compared to overweighted and obese male. This was not an expected trend based on the earlier plots.

In [None]:
# lmplot : BMI  vs charges group by Gender
sns.lmplot(data=df1, x='BMI', y='Charges', hue='Gender', scatter_kws={"alpha": .3, "s": 25}, aspect=2.5)

In [None]:
# lmplot : Age vs Charges group by Region
sns.lmplot(data=df1, x='Age', y='Charges', hue='Region', scatter_kws={"alpha": .3, "s": 25}, aspect=2.5)

Based on the first pairplot (BMI-Charge plot), southeast participant has a higher BMI, thus the insurance charges are expected to be higher for participants from southeast. As southeast BMI values are high, the insurance charges are expected to be relatively higher compared to people with the same age coming from different region.

## Predictive analytics

From the analysis above, Smoker, Age and BMI are the driving factors on insurance charge. After checking the R-Squared for each test data, excluding the other variables will drop the value by 0.5, approximately 0.65% to 1.34%, which is insignificant. Thus, only the three main factors will be used for the prediction below.

Note that starting from here df2 will be used as smoker's data type are converted from string (No and Yes) to integer (0 and 1).

In [None]:
# Set the values for X (independent variables)
X = df2[['smoker', 'age', 'bmi']]
# View the first three rows of X
X.head(3)

In [None]:
# Set the values for y (dependent variable)
y = df2.charges
# View the first three rows of y
y.head(3)

In [None]:
# Create a train and test data with 80% and 20% spilt
train_x, test_x, train_y, test_y = train_test_split(X,y, test_size = 0.2, random_state = 1)
# Get the shape
[train_x.shape, test_x.shape, train_y.shape, test_y.shape]

After viewing the results of each model for 70/30, 80/20 and 90/10, 80/20 spilt works the best. More description on this can be found after producing the R-Squared results.

In [None]:
# Define the 4 models used to see which will be best fitted to predict the charges
# Model 1 : Linear Regression
lm = linear_model.LinearRegression()
# Model 2 : Lasso Regression
lm_lasso = linear_model.Lasso()
# Model 3 : Ridge Regression
lm_ridge = linear_model.Ridge()
# Model 4 : Elastic Net Regression
lm_elastic = linear_model.ElasticNet()
# Fit the four models on the train data
lm.fit(train_x, train_y)
lm_lasso.fit(train_x, train_y)
lm_ridge.fit(train_x, train_y)
lm_elastic.fit(train_x, train_y)
# Print the intercept and coefficient of each model
print('lm Intercept        : ', lm.intercept_.round(2), '; lm Coefficient        : ', lm.coef_.round(2))
print('lm_lasso Intercept  : ', lm_lasso.intercept_.round(2), '; lm_lasso Coefficient  : ', lm_lasso.coef_.round(2))
print('lm_ridge Intercept  : ', lm_ridge.intercept_.round(2), '; lm_ridge Coefficient  : ', lm_ridge.coef_.round(2))
print('lm_elastic Intercept: ', lm_elastic.intercept_.round(2), ' ; lm_elastic Coefficient: ', lm_elastic.coef_.round(2))

Above is the intercept and coefficient of all four models. After analying the R-Squared and MSE value below, Linear Regression is the best model among these four to predict the insurance charge.

The following are the expected charge equations for smokers and non-smokers based on best fitted model (linear regression).
- Expected Charge for Non-Smokers = -11052.77 + (258.96 * Age) + (303.37 * BMI)
- Expected Charge for Smokers = (-11052.77 + 23723.48) + (258.96 * Age) + (303.37 * BMI)

Note that there is an unfavorable outcome on the expected charge for all four models. A young adult between 18 to mid 20s who are underweight or healthy (BMI below 24.9), could receive a negative insurance charge. This is not possible in the real world. Therefore, when after getting the expected charge from either smoker or non-smoker formula above, pick the maximum between the expected charge and 0. This will eliminate the negative expected charge.

In [None]:
# R-Squared on train data : Measures how well the regression line fits the train data
print('R-Squared for lm train data         : ', np.round(lm.score(train_x, train_y), 4),
      '\nR-Squared for lm_lasso train data   : ', np.round(lm_lasso.score(train_x, train_y), 4),
      '\nR-Squared for lm_ridge train data   : ', np.round(lm_ridge.score(train_x, train_y), 4),
      '\nR-Squared for lm_elastic train data : ', np.round(lm_elastic.score(train_x, train_y), 4), )
# R-Squared on test data : Measures how well the regression line fits the test data
print('\nR-Squared for lm test data         : ', np.round(lm.score(test_x, test_y), 4),
      '\nR-Squared for lm_lasso test data   : ', np.round(lm_lasso.score(test_x, test_y), 4),
      '\nR-Squared for lm_ridge test data   : ', np.round(lm_ridge.score(test_x, test_y), 4),
      '\nR-Squared for lm_elastic test data : ', np.round(lm_elastic.score(test_x, test_y), 4), )

R-Squared indicates how much variation of the dependent variable (charges) is explained by the independent variables (smoker, age and bmi) in a regression model. This value will indicate how well the data fit the regression model.

As both test and train data's R-Squared is very close, it indicates that the model predicts new observations nearly as well as it fits the dataset. In general, the higher the R-Squared, the better the model fits the data.

The first model (linear regression) and second model (lasso regression) has the highest R-Squared value for both train data and test data.  75.68% on lm and lm_lasso test data indicates that the models explains all the variability of the response data (charges) around its mean for 75.68%. The R-squared for the first three models are approximately 75%, which is good but not great.

Notice that R-squared does not rely the causation relationship between the independent variables (X) and dependent variable (y). Moreover, it does not tells us the correctness of the regression model. Thus, mean squared error (MSE) is used together to draw conclusions about which is the best fitted model. Since the forth model (elastic net) has the lowest score, it is the least suitable model. This model will not be used to calculate MSE.

In [None]:
# Predict on test data
pred_test_lm = lm.predict(test_x)
pred_test_lm_lasso = lm_lasso.predict(test_x)
pred_test_lm_ridge = lm_ridge.predict(test_x)
# Mean Squared Error (MSE)
print('MSE for lm       : ', np.round(metrics.mean_squared_error(test_y, pred_test_lm),0),
      '\nMSE for lm_lasso : ', np.round(metrics.mean_squared_error(test_y, pred_test_lm_lasso),0),
      '\nMSE for lm_ridge : ', np.round(metrics.mean_squared_error(test_y, pred_test_lm_ridge),0))

The smaller MSE value indicated the model is predicting better. It signifies that the distances between the data points and the fitted values are smaller. MSE is known to be strictly positive and not zero. As it is impossible to create a totally perfect model.

Based on R-Squared and MSE, it is recommended to use the linear regression model to predict the insurance charges using the three main variables, which are smoking, age and BMI.

The models might be able to create a better prediction if we include other variables such as the participant's personal health history (cholesterol levels, blood pressure, etc.), drinking and hobbies. It might seems weird to include hobbies, but if the participant spends their time on high risk hobbies like skydiving and car racing, it could lead to higher charges.