# An Analysis of Insurance Premiums

## This notebook will explore insurance premiums and attempt to predict premium charges from demographic information.

In [None]:
#import dependencies
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

sns.set_palette("pastel")

In [None]:
df = pd.read_csv('../input/ushealthinsurancedataset/insurance.csv')
df.head()

In [None]:
df.shape

In [None]:
df.describe()

Check for missing values in the dataset.

In [None]:
df.isnull().value_counts()

We are all good!

# Age Analysis

In [None]:
df['age'].describe()

In [None]:
np.median(df['age'])

Age ranges from 18 to 64. Average and median age is 39, and IQR is between ages 27 and 51. 

In [None]:
#bins = 8
#print(bins)
ages_hist = sns.histplot(df['age'], bins=47)
plt.show()

In [None]:
ages_kde = sns.kdeplot(df['age'])
plt.show()

In [None]:
sns.boxplot(df['age'])

As we can see this appears to be a fairly uniform distribution, though the number of 18 year olds and 19 year olds is almost double the number of every other age group.

In [None]:
var_1 = df['age'][df['age']==18].value_counts()
var_2 = df['age'][df['age']==19].value_counts()
print('18 yrs: {} \n19 yrs: {}'.format(var_1, var_2))

# Sex Analysis

In [None]:
sns.countplot(df['sex'])

Even mix of female and male.  Nice.

# BMI Analysis

In [None]:
bmi = df['bmi']
bmi.head()

In [None]:
bmi.describe()

In [None]:
np.median(bmi)

BMI ranges from ~16 to ~53.  The mean and median BMI are ~30.  The IQR falls between ~26 and ~35.  

In [None]:
sns.kdeplot(bmi)

In [None]:
plt.hist(bmi, bins=20)

In [None]:
sns.boxplot(bmi)

The BMI feature follows a fairly standard distribution, as can be seen above.  BMI is ~30 on average, and we see seveeral outliers on the higher end past ~46

# Children Analysis

In [None]:
children = df[['children']]
children.head()

In [None]:
children.describe()

In [None]:
np.median(children)

The number of children ranges from 0 to 5, with the IQR falling between 0 and 2.  The mean and median number of children is 1


In [None]:
sns.countplot(df['children'])

We see the distribution of # of children is left-skewed, with most individuals having 0 children, with decreasing numbers having increasing numbers of children.

In [None]:
children.value_counts()

# Smoker Analysis

In [None]:
smoker=df['smoker']
smoker.head()

In [None]:
smoker.value_counts()

In [None]:
sns.countplot(df['smoker'])

In [None]:
smoker.value_counts()['no']/smoker.value_counts()['yes']

The majority of individuals are non-smokers, with non-smokers outweighing smokers by a factor of 3.88.  Woo hoo!

# Region Analysis

In [None]:
region = df['region']
region.head()

In [None]:
region.value_counts()

In [None]:
sns.countplot(region)

We have a nice even mix of the 4 different regions around the US.  Great representation!

# Charges Analysis

In [None]:
charges = df['charges']
charges.head()

In [None]:
charges.describe()

In [None]:
np.median(charges)

The hospital charges range from USD 1,122 to USD 63,770, with a mean of USD 13,270 but a median of USD 9,382.  Let's get a closer look.

In [None]:
sns.distplot(charges)

In [None]:
sns.boxplot(charges)

We can see heavy left-skew with the majority of charges falling below ~USD 20,000, with a slight uptick around the USD 40,000 - 50,000 range.  There are a very large number of outliers on the high end, at ~USD 35,000 and above.


# Bivariate Analysis (charges)

### Age

Using charges as our independent variable of interest, let's take a look at how charges vary with the features.

In [None]:
sns.scatterplot(x=df['age'], y=df['charges'])

Slight positive trend can be seen as age increases, charges tend to increase as well.

In [None]:
women = df[['sex', 'charges']][df['sex'] == 'female']
men = df[['sex', 'charges']][df['sex'] == 'male']

women_median = np.median(women['charges'])
men_median = np.median(men['charges'])
print('Median Charges for Men: ', men_median, '\nMedian Charges for Women: ', women_median, '\nDifference: ', women_median-men_median)

In [None]:
sns.barplot(data=df, x='sex', y='charges')

In [None]:
plt.hist(df['charges'][df['sex']=='female'], alpha=0.4, label='Female')
plt.hist(df['charges'][df['sex']=='male'], alpha=0.4, label='Male')
plt.legend()

In [None]:
sns.boxplot(data=df, x='sex', y='charges')

We see that although males and females have roughly the same average charges, a higher percentage of males have higher charges than females, and the max value for charges is higher for males by ~USD 10,000

### BMI

In [None]:
sns.scatterplot(x=df['bmi'], y=df['charges'])

We see more high-charge values with higher BMI values, but we can see heteroskedasticity in the data.

### Children

In [None]:
sns.barplot(data=df, x='children', y='charges')

In [None]:
sns.boxplot(data=df, x='children', y='charges')

In [None]:
sns.kdeplot(df['charges'][df['children']==0], shade=True, label = '0 Children')
sns.kdeplot(df['charges'][df['children']==1], shade=True, label='1 Child')
sns.kdeplot(df['charges'][df['children']==2], shade=True, label='2 Children')
sns.kdeplot(df['charges'][df['children']==3], shade=True, label='3 Children')
sns.kdeplot(df['charges'][df['children']==4], shade=True, label='4 Children')
sns.kdeplot(df['charges'][df['children']==5], shade=True, label='5 Children')

plt.legend()

People with 5 children are concentrated around the ~USD 10,000 range with few outliers.  People with 2-4 children have the highest average and IQR for charges.

### Smoker

In [None]:
sns.barplot(data=df, x='smoker', y='charges')

In [None]:
sns.boxplot(data=df, x='smoker', y='charges')

In [None]:
sns.kdeplot(df['charges'][df['smoker']=='yes'], shade=True, label='Smoker')
sns.kdeplot(df['charges'][df['smoker']=='no'], shade=True, label='Non-Smoker')

plt.legend()

We can clearly see the significantly higher charges for smokers vs non-smokers from the plots above.  Seems pretty intuitive, but something to keep in mind as we move forward.

### Region

In [None]:
sns.barplot(data=df, x='region', y='charges')

In [None]:
sns.boxplot(data=df, x='region', y='charges')

In [None]:
sns.kdeplot(df['charges'][df['region']=='southwest'], shade=True, label='Southwest')
sns.kdeplot(df['charges'][df['region']=='southeast'], shade=True, label='Southeast')
sns.kdeplot(df['charges'][df['region']=='northwest'], shade=True, label='Northwest')
sns.kdeplot(df['charges'][df['region']=='northeast'], shade=True, label='Northeast')

plt.legend()

The breakdown by region is fairly similar, though we see a bit higher concentration of high-charge values in the southeast region.

## Multivariate Analysis

In [None]:
sns.heatmap(df.corr(), center=0, cmap='YlGnBu', robust=True, annot=True)

Let's take a look at age

In [None]:
sns.scatterplot(data=df, x='age', y='charges', hue='sex')

In [None]:
sns.scatterplot(data=df, x='age', y='charges', hue='bmi')

In [None]:
sns.scatterplot(data=df, x='age', y='charges', hue='children')

In [None]:
sns.scatterplot(data=df, x='age', y='charges', hue='smoker')

This chart clearly illustrates the higher charges that smokers typically pay regardless of age.

In [None]:
sns.scatterplot(data=df, x='age', y='charges', hue='region')

Since we have a large range for age values, let's bucket the ages into young, middle-age, and older age buckets.  This may help simplify a few things.

In [None]:
def age_bucket(df):
    if df['age'] > 45:
        return 'older'
    elif (df['age'] >= 31) & (df['age'] <=45):
        return 'middle age'
    else:
        return 'young'

df['age_bucket'] = df.apply(age_bucket, axis=1)

df['age_bucket'].value_counts()

In [None]:
sns.barplot(data=df, x='age_bucket',y='charges')

This looks nice and clean. Let's further breakdown by segments of each age group.

In [None]:
sns.barplot(data=df, x='age_bucket', y='charges', hue='sex')

In each age group, we see that males typically experience higher charges on average then females do.

In [None]:
sns.barplot(data=df, x='age_bucket', y='charges', hue='smoker')

This one seems pretty intuitive as well.

In [None]:
sns.barplot(data=df, x='age_bucket', y='charges', hue='children')

In [None]:
sns.barplot(data=df, x='age_bucket', y='charges', hue='region')

Something interesting to note is the degree of differential between the regions as the age bucket increases, e.g. northwest surpasses northeast in the older age bucket.  Let's take a look at BMI now.

In [None]:
df.head()

In [None]:
sns.scatterplot(data=df, x='bmi', y='charges', hue='sex')

In [None]:
sns.scatterplot(data=df, x='bmi', y='charges', hue='age_bucket')

We see the charges are clearly higher for older people vs middle age, and higher for middle age vs young, across all BMI types.

In [None]:
sns.scatterplot(data=df, x='bmi', y='charges', hue='children')

In [None]:
df['bmi'].describe()

Similarly to the age groups, we will now bucket the BMI into low, medium and high, based on the IQR for BMI.

In [None]:
def bmi_bucket(df):
    if df['bmi'] > 34:
        return 'high bmi'
    elif (df['bmi'] <=34) & (df['bmi'] >=26):
        return 'medium bmi'
    else:
        return 'low bmi'

df['bmi bucket'] = df.apply(bmi_bucket, axis=1)

df['bmi bucket'].value_counts()

In [None]:
sns.countplot(df['bmi bucket'].sort_values(), order=['low bmi', 'medium bmi', 'high bmi'])

In [None]:
sns.barplot(data=df, x='sex', y='charges', hue='bmi bucket')

Note the greater differential in charges for males of high bmi vs medium bmi compared with the differential between females of high bmi vs females of medium bmi. Let's explore the other features.

In [None]:
sns.barplot(data=df, x='sex', y='charges', hue='region')

In [None]:
sns.barplot(data=df, x='sex', y='charges', hue='age_bucket')

In [None]:
df.head()

In [None]:
sns.countplot(df['children'])

In [None]:
sns.boxplot(data=df, x='children', y='charges')

Let's bucket the number of children as well.

In [None]:
def children_bucket(df):
    if df['children'] == 0:
        return 'no children'
    elif (df['children'] >= 1) & (df['children']<=3):
        return '1-3 children'
    else:
        return '4+ children'
df['children bucket'] = df.apply(children_bucket, axis=1)

In [None]:
df['children bucket'].value_counts()

In [None]:
df.head()

# Data Preprocessing & Modeling

In [None]:
df_cat = df[['age_bucket', 'sex', 'bmi bucket', 'children bucket', 'smoker', 'region']]
li = ['age', 'sex', 'bmi', 'children', 'smoker', 'region']
df_onehot_cat = pd.get_dummies(df_cat, prefix = li)
df_onehot_cat.head()

In [None]:
X=df_onehot_cat
y=df[['charges']]

In [None]:
#feature scaling for charges

#Standard Scaler

sc = StandardScaler()
y = sc.fit_transform(y)

In [None]:
# split data between training and testing datasets using SKLearn's train test split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state=1)

In [None]:
X_train

In [None]:
y_train

In [None]:
y_test

Regression Model 1 - the first model will use only categorical features, including the buckets created from the numerical features.

In [None]:
regr = linear_model.LinearRegression()
regr.fit (X_train, y_train)
# The coefficients & intercept
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)

Let's see the performance of our first model.

In [None]:
# training data
regr.score(X_train, y_train)

In [None]:
# testing data

regr.score(X_test, y_test)

Regression Model 2 - the second model will retain the numerical age and bmi variables

In [None]:
df_cat2 = df[['age', 'sex', 'bmi', 'children', 'smoker', 'region']]
df_onehot_cat2 = pd.get_dummies(df_cat2, columns=['sex', 'children', 'smoker', 'region'])
df_onehot_cat2.head()

In [None]:
df_onehot_cat2_temp = df_onehot_cat2[['age', 'bmi']]
df_onehot_cat2_temp2 = df_onehot_cat2.drop(columns=['age', 'bmi'])

df_onehot_cat2_temp = sc.fit_transform(df_onehot_cat2_temp)
df_onehot_cat2_temp = pd.DataFrame(df_onehot_cat2_temp, columns=['age', 'bmi'])
df_onehot_cat2_temp.head()

In [None]:
df_onehot_cat2_temp = pd.concat([df_onehot_cat2_temp, df_onehot_cat2_temp2], axis=1)
df_onehot_cat2_temp

In [None]:
y

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(df_onehot_cat2_temp,y,test_size = 0.2, random_state=1)

In [None]:
regr2 = linear_model.LinearRegression()
regr2.fit (X_train2, y_train2)
# The coefficients
print ('Coefficients: ', regr2.coef_)
print ('Intercept: ',regr2.intercept_)

Let's check how the  second model performed.

In [None]:
# performance of training set

regr2.score(X_train2, y_train2)

In [None]:
# testing set
regr2.score(X_test2, y_test2)

The highest score for my linear regression model came out to .76

I hope you enjoyed my EDA and basic model development on the insurance charges dataset. Any and all feedback would be greatly appreciated!