# This notebook supplements the main notebook 'data_vizualization' to provide additional insights from the data.  In the same way as the main notebook 'data_vizualization', the first few steps load the insurance data and Python libraries.


# **Data Visualization**
### Objectives
- Read the cleaned data and visualize it in order to understand trends, correlations, and to understand which variables influence *insurance charges*.
### Inputs
- The file uses `insurance_cleaned.csv` text file located in the `data\cleaned` folder.
### Outputs
- This notebook generates various plots and relies on `matplotlib`, `seaborn` and `plotly` to generate them.

## Load the libraries and the data
In this section the relevant data analysis libraries and the raw data will be loaded.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
sns.set_style('whitegrid')

THe data will be loaded as the variable `insurance`:

In [2]:
insurance = pd.read_csv("../data/cleaned/insurance_cleaned.csv")
print(insurance.shape)
insurance.head()

(1338, 9)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges,age_bracket,bmi_category
0,19,female,27.9,0,True,southwest,16884.924,18-25,overweight
1,18,male,33.77,1,False,southeast,1725.5523,18-25,obesity
2,28,male,33.0,3,False,southeast,4449.462,26-35,obesity
3,33,male,22.705,0,False,northwest,21984.47061,26-35,normal
4,32,male,28.88,0,False,northwest,3866.8552,26-35,overweight


In the main notebook 'data_vizualization' the average charge was found to be substantially higher for smokers than non smokers an slightly higher for males than females.However, there are more male smokers than female smokers which may also partly explain the slightly higher average charge for males than females so it seems reasonable to investigate the correlations for the 4 subgroups combining smoker status and sex as below.

In the main notebook 'data_vizualization', the correlation between bmi and charges for the whole sample was found to be 0.806 for smokers and 0.198 for non-smokers. Below the correlation between charges with age and with BMI for male smokers, female smokers, male non smokers and female non-smokers is tested to see if the correlations in any of these sub groups is stronger than simply grouping by smoker status as in the main notebook 'data_vizualization'

In [3]:
# Check correlation between BMI and charges for male smokers
malesmoker_true = insurance[(insurance['sex'] == 'male') & (insurance['smoker'] == True)]
correlation = malesmoker_true['bmi'].corr(malesmoker_true['charges'])
print(f"Correlation between BMI and charges for malesmokers: {correlation:.3f}")

# Visualize the relationship for male smokers
fig = px.scatter(malesmoker_true, x='charges', y='bmi', trendline='ols',
                 title='Charges vs BMI for Malesmokers',
                 hover_data=['age', 'sex', 'region'])
fig.show()

# Check correlation between BMI and charges for female smokers
femalesmoker_true = insurance[(insurance['sex'] == 'female') & (insurance['smoker'] == True)]
correlation = femalesmoker_true['bmi'].corr(femalesmoker_true['charges'])
print(f"\n Correlation between BMI and charges for femalesmokers: {correlation:.3f}")

# Visualize the relationship for female smokers
fig = px.scatter(femalesmoker_true, x='charges', y='bmi', trendline='ols',
                 title='Charges vs BMI for femalesmokers',
                 hover_data=['age', 'sex', 'region'])
fig.show()

# Check correlation between BMI and charges for male non-smokers
malesmoker_false = insurance[(insurance['sex'] == 'male') & (insurance['smoker'] == False)]
correlation = malesmoker_false['bmi'].corr(malesmoker_false['charges'])
print(f"\n Correlation between BMI and charges for male non-smokers: {correlation:.3f}")

# Visualize the relationship for male non-smokers
fig = px.scatter(malesmoker_false, x='charges', y='bmi', trendline='ols',
                 title='Charges vs BMI for male Non-Smokers',
                 hover_data=['age', 'sex', 'region'])
fig.show()

# Check correlation between BMI and charges for female non-smokers
femalesmoker_false = insurance[(insurance['sex'] == 'female') & (insurance['smoker'] == False)]
correlation = femalesmoker_false['bmi'].corr(femalesmoker_false['charges'])
print(f"\n Correlation between BMI and charges for female non-smokers: {correlation:.3f}")

# Visualize the relationship for female non-smokers
fig = px.scatter(femalesmoker_false, x='charges', y='bmi', trendline='ols',
                 title='Charges vs BMI for female non-Smokers',
                 hover_data=['age', 'sex', 'region'])
fig.show()

Correlation between BMI and charges for malesmokers: 0.769



 Correlation between BMI and charges for femalesmokers: 0.846



 Correlation between BMI and charges for male non-smokers: 0.096



 Correlation between BMI and charges for female non-smokers: 0.075


The correlation of bmi and charges for female smokers is 0.846 which is slightly higher than the correlation of 
0.806 across male and female smokers 0.806. But the correlation of male smokers is 0.769 which is slightly lower.  So seperating between male and female does not really improve the correlation overall. 

Unsurprisingly, the correlations for male and female non smokers are very low, even lower than the correlation for all non-smokers including male and female of 0.198.

In the main data_vizualization `data_vizualization.ipynb` age vs charges is correlated and plotted coloured by smoker status.  The overall correlation of age vs charges = 0.299 which is very weak, but an apparent visual correlation in 3 clusters can be seen.  This is explored further below with analysis seeking to identify why this clustering occurs.

Age vs charges is plotted again, but this time coloured by sex, region and BMI category to see if any of these other variables may explain this visual observation. In each case as in the main data_vizualization `data_vizualization.ipynb`, the overall correlation shown at the top is for the whole sample and not the subdivisions shown by colouring.

In [4]:
correlation = insurance['age'].corr(insurance['charges'])
print(f"Correlation between age and charges: {correlation:.3f}")
px.scatter(insurance, x='charges', y='age', color='sex', trendline='ols', hover_data=['bmi', 'sex', 'region'], title='Scatter plot of Charges vs Age colored by Sex')

Correlation between age and charges: 0.299


There is no visual correspondence between the 3 apparent correlation clusters and sex.

In [5]:
correlation = insurance['age'].corr(insurance['charges'])
print(f"Correlation between age and charges: {correlation:.3f}")
px.scatter(insurance, x='charges', y='age', color='region', trendline='ols', hover_data=['bmi', 'sex', 'region'], title='Scatter plot of Charges vs Age colored by Region')

Correlation between age and charges: 0.299


There is no visual correspondence between the 3 apparent correlation clusters and region.

In [6]:
correlation = insurance['age'].corr(insurance['charges'])
print(f"Correlation between age and charges: {correlation:.3f}")
px.scatter(insurance, x='charges', y='age', color='bmi_category', trendline='ols', hover_data=['bmi', 'sex', 'region'], title='Scatter plot of Charges vs Age colored by BMI Category')

Correlation between age and charges: 0.299


There does appear to be a slight pattern to the above with the majority in the high charge cluster being in the obese BMI category although there are still a reasonable number of obese in the lower charges clusters.  It may be that the rightmost cluster is mainly obese smokers and the other 2 clusters contain a mix of smokers and non-smokers, so further analysis for smokers and non-smokers follows below.

In [7]:
smoker_true = insurance[insurance['smoker'] == True]
correlation = smoker_true['age'].corr(smoker_true['charges'])
print(f"Correlation between age and charges for smokers: {correlation:.3f}")
correlation = smoker_true[smoker_true['bmi_category'] == 'obesity']['age'].corr(smoker_true['charges'])
print(f"Correlation between age and charges for overweight smokers: {correlation:.3f}")
correlation = smoker_true[smoker_true['bmi_category'] == 'overweight']['age'].corr(smoker_true['charges'])
print(f"Correlation between age and charges for normal weight smokers: {correlation:.3f}")
correlation = smoker_true[smoker_true['bmi_category'] == 'normal']['age'].corr(smoker_true['charges'])
print(f"Correlation between age and charges for obese smokers: {correlation:.3f}")
correlation = smoker_true[smoker_true['bmi_category'] == 'underweight']['age'].corr(smoker_true['charges'])
print(f"Correlation between age and charges for underweight smokers: {correlation:.3f}")
px.scatter(smoker_true, x='charges', y='age', color='bmi_category', trendline='ols', hover_data=['bmi', 'sex', 'region'], title='Scatter plot of Charges vs Age colored by BMI Category for Smokers')


Correlation between age and charges for smokers: 0.368
Correlation between age and charges for overweight smokers: 0.667
Correlation between age and charges for normal weight smokers: 0.733
Correlation between age and charges for obese smokers: 0.730
Correlation between age and charges for underweight smokers: 0.517


So the correlation of age vs charges is fairly strong within each of the obese, overweight and normal weight bmi categories.  The correlation is less strong for underweight people of whom there are relatively few.  

The rightmost cluster in the plot of age vs charges for the whole sample can be seen to comprise of and the majority of obese smokers from the sample and the charges look consistent with the rightmost of the 3 clusters seen in the plot containing the whole sample which strongly suggests the right cluster in the plot of age vs charges for the whole sample is mainly obese smokers.

The left cluster in the plot above contains a mix of other BMI categories (overweight, normal, and underweight) and and the charges look consistent with the middle of the 3 clusters seen in the plot containing the whole sample which strongly suggests the middle cluster in the plot of age vs charges for the whole sample is mainly non-obese smokers.

The above conclusions are consistent with the high correlation found for BMI vs charges for smokers of 0.806 in `data_vizualization.ipynb`

Below the correlation of age vs charges is explored for non-smokers, coloured by bmi_category.

In [8]:
smoker_false = insurance[insurance['smoker'] == False]
correlation = smoker_false['age'].corr(smoker_false['charges'])
print(f"Correlation between age and charges for non-smokers: {correlation:.3f}")
correlation = smoker_false[smoker_false['bmi_category'] == 'obesity']['age'].corr(smoker_false['charges'])
print(f"Correlation between age and charges for overweight non-smokers: {correlation:.3f}")
correlation = smoker_false[smoker_false['bmi_category'] == 'overweight']['age'].corr(smoker_false['charges'])
print(f"Correlation between age and charges for normal weight non-smokers: {correlation:.3f}")
correlation = smoker_false[smoker_false['bmi_category'] == 'normal']['age'].corr(smoker_false['charges'])
print(f"Correlation between age and charges for obese non-smokers: {correlation:.3f}")
correlation = smoker_false[smoker_false['bmi_category'] == 'underweight']['age'].corr(smoker_false['charges'])
print(f"Correlation between age and charges for underweight non-smokers: {correlation:.3f}")
px.scatter(smoker_false, x='charges', y='age', color='bmi_category', trendline='ols', hover_data=['bmi', 'sex', 'region'], title='Scatter plot of Charges vs Age colored by BMI Category for Non-Smokers')

Correlation between age and charges for non-smokers: 0.628
Correlation between age and charges for overweight non-smokers: 0.631
Correlation between age and charges for normal weight non-smokers: 0.623
Correlation between age and charges for obese non-smokers: 0.587
Correlation between age and charges for underweight non-smokers: 0.991


There is much weaker correlation seen in plotting age vs charges for non-smokers but a visual clustering is seen at the lower end of the charges range which includes all bmi categories but shows a reasonable correlation of age vs charges.  It can be seen that the majority of non-smokers are in this cluster so it may reflect the majority who are perhaps relatively healthy compared with the relatively small number of non-smokers at higher change levels.  The charges for this cluster seem consistent with those in the left cluster seen in the plot of age vs charges for the whole population which strongly suggests the left cluster in the plot of age vs charges for the whole sample is mainly non-smokers.

Some other data investigations follow below.

Correlations and plots of charges vs bmi by age category firstly for smokers then for non-smokers.

In [9]:
smoker_true = insurance[insurance['smoker'] == True]
correlation = smoker_true['bmi'].corr(smoker_true['charges'])
print(f"Correlation between bmi and charges for smokers: {correlation:.3f}")
correlation = smoker_true[smoker_true['age_bracket'] == '18-25']['bmi'].corr(smoker_true['charges'])
print(f"Correlation between bmi and charges for 18-25 year old smokers: {correlation:.3f}")
correlation = smoker_true[smoker_true['age_bracket'] == '26-35']['bmi'].corr(smoker_true['charges'])
print(f"Correlation between bmi and charges for 26-35 year old smokers: {correlation:.3f}")
correlation = smoker_true[smoker_true['age_bracket'] == '36-45']['bmi'].corr(smoker_true['charges'])
print(f"Correlation between bmi and charges for 36-45 year old smokers: {correlation:.3f}")
correlation = smoker_true[smoker_true['age_bracket'] == '46-55']['bmi'].corr(smoker_true['charges'])
print(f"Correlation between bmi and charges for 46-55 year old smokers: {correlation:.3f}")
correlation = smoker_true[smoker_true['age_bracket'] == '56+']['bmi'].corr(smoker_true['charges'])
print(f"Correlation between bmi and charges for 56+ year old smokers: {correlation:.3f}")
px.scatter(smoker_true, x='charges', y='bmi', color='age_bracket', trendline='ols', hover_data=['bmi', 'sex', 'region'], title='Scatter plot of Charges vs bmi colored by Age Bracket for Smokers')

Correlation between bmi and charges for smokers: 0.806
Correlation between bmi and charges for 18-25 year old smokers: 0.843
Correlation between bmi and charges for 26-35 year old smokers: 0.817
Correlation between bmi and charges for 36-45 year old smokers: 0.810
Correlation between bmi and charges for 46-55 year old smokers: 0.907
Correlation between bmi and charges for 56+ year old smokers: 0.854


This shows there is strong correlation of bmi vs charges for smokers in all age ranges and charge appears to more strongly linked with bmi than age range which is in line with findings in `data_vizualization.ipynb`

In [10]:
smoker_false = insurance[insurance['smoker'] == False]
correlation = smoker_false['bmi'].corr(smoker_false['charges'])
print(f"Correlation between bmi and charges for smokers: {correlation:.3f}")
correlation = smoker_false[smoker_false['age_bracket'] == '18-25']['bmi'].corr(smoker_false['charges'])
print(f"Correlation between bmi and charges for 18-25 year old non-smokers: {correlation:.3f}")
correlation = smoker_false[smoker_false['age_bracket'] == '26-35']['bmi'].corr(smoker_false['charges'])
print(f"Correlation between bmi and charges for 26-35 year old non-smokers: {correlation:.3f}")
correlation = smoker_false[smoker_false['age_bracket'] == '36-45']['bmi'].corr(smoker_false['charges'])
print(f"Correlation between bmi and charges for 36-45 year old non-smokers: {correlation:.3f}")
correlation = smoker_false[smoker_false['age_bracket'] == '46-55']['bmi'].corr(smoker_false['charges'])
print(f"Correlation between bmi and charges for 46-55 year old non-smokers: {correlation:.3f}")
correlation = smoker_false[smoker_false['age_bracket'] == '56+']['bmi'].corr(smoker_false['charges'])
print(f"Correlation between bmi and charges for 56+ year old non-smokers: {correlation:.3f}")
px.scatter(smoker_false, x='charges', y='bmi', color='age_bracket', trendline='ols', hover_data=['bmi', 'sex', 'region'], title='Scatter plot of Charges vs bmi colored by Age Bracket for Non-Smokers')

Correlation between bmi and charges for smokers: 0.084
Correlation between bmi and charges for 18-25 year old non-smokers: -0.016
Correlation between bmi and charges for 26-35 year old non-smokers: 0.024
Correlation between bmi and charges for 36-45 year old non-smokers: 0.066
Correlation between bmi and charges for 46-55 year old non-smokers: -0.015
Correlation between bmi and charges for 56+ year old non-smokers: 0.054


This shows there is little correlation of bmi vs charges for non-smokers in all age ranges but visually confirms that age range is much more significant for non-smokers.

# Overall conclusions
The above analysis provides the following insights:

1. When charges and age are scatter plotted on the x and y axes for the whole sample, 3 clusters can be seen.  The analysis strongly indicates that the rightmost cluster with the highest charges comprises mainly obese smokers.  The middle cluster comprises mainly non-obese smokers.  And the left cluster comprises mainly non-smokers.

2. BMI is the strongest influencing variable on charges for smokers but age is the strongest influencing variable for non-smokers. 

3. There appears to be little correlation between sex and charges.