# Predicting insurance costs given the set of attributes in this data set using an extra-trees regressor model.

#### Features (Data Dictionary):
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoking
- region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

#### Target:
- charges: Individual medical costs billed by health insurance

#### We will use an "extra-trees regressor" in this project.
- This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

#### Import necessary tools.

In [None]:
# data preparation
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

# visualizations
import seaborn as sns
from itertools import cycle, islice

# modeling
from sklearn.ensemble import ExtraTreesRegressor
import sklearn.metrics

import warnings
warnings.filterwarnings('ignore')

#### Load and peek at the data.

In [None]:
df = pd.read_csv('/kaggle/input/insurance/insurance.csv')
df.head().append(df.tail())

#### Check for null values.

In [None]:
df.isnull().sum()

#### No nulls!

#### Let's print some general descriptors of the data in each feature.

In [None]:
df.describe()

#### Let's look at the unique values in each column.

In [None]:
for col in df:
    print(col)
    print(df[col].unique())
    print()

#### And let's peek at the correlations between each of the features.

In [None]:
df.corr()

#### Now let's visualize those correlations in a heatmap.

In [None]:
sns.set(style="white")

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(5, 5))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(240, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5},
            annot=True)

#### All of the features have acceptably small correlations with each other.
#### Age and BMI have by far the highest correlations with our target variable, charges.

#### First, we need to convert the bmi to integer values.

In [None]:
df['bmi_int'] = df['bmi'].apply(lambda x: int(x))

#### Let's look at the distribution of the data now.

In [None]:
variables = ['sex','smoker','region','age','bmi_int','children']

print('Data distribution analysis')
for v in variables:
    # Create a list of colors to cycle through in the visualizations.
    my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(df[v])))
    # Set the figure size.
    plt.figure(figsize=(20,5))
    # Now sort the values in each column.
    df = df.sort_values(by=[v])
    # Finally, plot bar graphs of each column's values.
    df[v].value_counts()\
        .plot(kind = 'bar',
              color=my_colors)
    plt.title(v)
    plt.show()

#### We sorted the values, so let's peek at the head and tail of the dataframe.

In [None]:
df.head().append(df.tail())

#### And at a sample...

In [None]:
df.sample(10)

#### Now let's look at the average cost per feature value.

In [None]:
print('Mean cost analysis:')
for v in variables:
    group_df = df.groupby(pd.Grouper(key=v)).mean()
    group_df = group_df.sort_index()
    group_df.plot(y = ['charges'],kind = 'bar', figsize=(20,5))
    plt.show()

- Men average a little bit higher charges than women.
- Smokers average a lot more than non-smokers.
- Regions don't really vary.
- Charges trend up with age.
- Need to look more carefully at bmi. Charges sort of trend up as bmi increases, but then the last four markers are really odd.
- Not much difference between numbers of children except five is significantly less.

#### Let's look at a pairs plot. It will allow us to see both the distribution of single variables as well as the relationships between two variables. Pair plots are a great method to identify trends for follow-up analysis.

In [None]:
print('Variables pairplot:')
variables = ['sex','smoker','region','age','bmi_int','children','charges']
sns_plot = sns.pairplot(df[variables])
plt.show()

- Age shows three distinct tiers of linear relationships with charges.
- Charges are pretty similar for 0-3 children, but drop off for 4 and especially 5.
- Max bmi drops as the number of children increases until 5 children.

#### We need to transform the categorical data and then store it back in the df dataframe.

In [None]:
le_sex = LabelEncoder()
le_smoker = LabelEncoder()
le_region = LabelEncoder()

df['sex'] = le_sex.fit_transform(df['sex'])
df['smoker'] = le_smoker.fit_transform(df['smoker'])
df['region'] = le_region.fit_transform(df['region'])

df.head().append(df.tail())

#### Now let's split the data into train and test sets and run our regressor model.

In [None]:
variables = ['sex','smoker','region','age','bmi','children']

X = df[variables]
# scale the data
sc = StandardScaler()
X = sc.fit_transform(X) 
Y = df['charges']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

#### Let's train our model and evaluate it. We will use an "extra-trees regressor."

#### This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [None]:
# create model
regressor = ExtraTreesRegressor(n_estimators = 200)
# fit model
regressor.fit(X_train,y_train)

# use model to obtain predictions and then evaluate the predictions
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)

print('ExtraTreesRegressor evaluating result:')
print()
print("The mean of the train:", y_train.mean())
print("The median of the train:", y_train.median())
print("Train MAE: ", sklearn.metrics.mean_absolute_error(y_train, y_train_pred))
print("Train RMSE: ", np.sqrt(sklearn.metrics.mean_squared_error(y_train, y_train_pred)))
print("mean/RMSE:", y_train.mean()/np.sqrt(sklearn.metrics.mean_squared_error(y_train, y_train_pred)))

print()
print("The mean of the test:", y_test.mean())
print("The median of the test:", y_test.median())
print("Test MAE: ", sklearn.metrics.mean_absolute_error(y_test, y_test_pred))
print("Test RMSE: ", np.sqrt(sklearn.metrics.mean_squared_error(y_test, y_test_pred)))
print("mean/RMSE:", y_test.mean()/np.sqrt(sklearn.metrics.mean_squared_error(y_test, y_test_pred)))

#### Now let's rank the importance of each feature using regressor.feature_importances.

In [None]:
print('Feature importance ranking\n\n')
importances = regressor.feature_importances_
std = np.std([tree.feature_importances_ for tree in regressor.estimators_],axis=0)
indices = np.argsort(importances)[::-1]

importance_list = []
for f in range(X.shape[1]):
    variable = variables[indices[f]]
    importance_list.append(variable)
    print("%d.%s(%f)" % (f + 1, variable, importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(importance_list, importances[indices],
       color="r", yerr=std[indices], align="center")
plt.show()

#### So whether a person smoked or not is by far the most important feature in predicting the insurance rates.
#### BMI drops off in importance, but is second.
#### And age is third.

#### Let's show some examples of predicting insurance rates on fictitious characters. Recall the ranges in values for each feature (note they have been encoded).

In [None]:
for col in df:
    print(col)
    print(df[col].min(), ' - ', df[col].max())
    print()

In [None]:
print('Predicting insurance rates on new fictitious characters:\n\n')

# Recall that the order of our variables is 
# ['sex','smoker','region','age','bmi','children']

# Create a character named Rock.
Rock = ['male','yes','southwest',25,30.5,2]
print('Rock - ')
print('\tsex:', Rock[0])
print('\tsmoker:', Rock[1])
print('\tregion:', Rock[2])
print('\tage:', Rock[3])
print('\tbmi:', Rock[4])
print('\tchildren:', Rock[5])
print()

# Transform the string data into numeric.
Rock[0] = le_sex.transform([Rock[0]])[0] 
Rock[1] = le_smoker.transform([Rock[1]])[0] 
Rock[2] = le_region.transform([Rock[2]])[0] 

# Scale the data using the StandardScaler() we previously created.
X = sc.transform([Rock])

# Predict the cost for Rock using the extra trees regressor we previously created.
cost_for_Rock = regressor.predict(X)[0]
print('Cost for Rock = $',cost_for_Rock,'\n\n')


Rockette = ['female','no','southeast',45,19,0]
print('Rockette - ')
print('\tsex:', Rockette[0])
print('\tsmoker:', Rockette[1])
print('\tregion:', Rockette[2])
print('\tage:', Rockette[3])
print('\tbmi:', Rockette[4])
print('\tchildren:', Rockette[5])
print()

# Transform the string data into numeric.
Rockette[0] = le_sex.transform([Rockette[0]])[0] 
Rockette[1] = le_smoker.transform([Rockette[1]])[0] 
Rockette[2] = le_region.transform([Rockette[2]])[0] 

# Scale the data using the StandardScaler() we previously created.
X = sc.transform([Rockette])

# Predict the cost for Rock using the extra trees regressor we previously created.
cost_for_Rockette = regressor.predict(X)[0]
print('Cost for Rockette = $',cost_for_Rockette, '\n\n')


FertileRockette = ['female','no','southeast',45,19,5]
print('FertileRockette - ')
print('\tsex:', FertileRockette[0])
print('\tsmoker:', FertileRockette[1])
print('\tregion:', FertileRockette[2])
print('\tage:', FertileRockette[3])
print('\tbmi:', FertileRockette[4])
print('\tchildren:', FertileRockette[5])
print()

# Transform the string data into numeric.
FertileRockette[0] = le_sex.transform([FertileRockette[0]])[0] 
FertileRockette[1] = le_smoker.transform([FertileRockette[1]])[0] 
FertileRockette[2] = le_region.transform([FertileRockette[2]])[0] 

# Scale the data using the StandardScaler() we previously created.
X = sc.transform([FertileRockette])

# Predict the cost for Rock using the extra trees regressor we previously created.
cost_for_FertileRockette = regressor.predict(X)[0]
print('Cost for FertileRockette = $',cost_for_FertileRockette)

Thank you to the following tutorial:
https://www.kaggle.com/flagma/health-care-cost-analysys-prediction-python/data