We will attempt to perform the following tasks on this dataset:
1.  Check for any missing values, or anything that needs to be imputed.
2.  Some Exploratory Data Analysis, specifically to look for linear seperability.
3.  Create a linear regression model and make predictions.
4.  Create a Random Forest regression model and make predictions.
5.  Document any insights obtained from the dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv('../input/insurance.csv')
df.head()

How many null values do we have in this dataset?

In [None]:
df.isnull().sum()

Do we have any outliers?  Let us check the numeric columns first, then check the categorical columns

In [None]:
df.describe()

In [None]:
df['sex'].unique()

In [None]:
df['smoker'].unique()

In [None]:
df['region'].unique()

Everything seems to be in order.  A clean dataset we can use.  The number of observations seems small.  I am not sure we will have enough data to do any deep neural network training?

### Let us first focus on the age feature.

In [None]:
plt.figure(figsize=(16,4))
ax1=plt.subplot(1,2,1)
ax2=plt.subplot(1,2,2)
ax1.set_title('histogram of age')
ax1.set_xlabel('age')
ax1.set_ylabel('count')
ax2.set_title('boxplot of age')
df['age'].hist(bins=15,ax=ax1)
df['age'].plot(kind='box', ax=ax2)

From the graphs above we see a high count for the ~20 year age group, but the mean age is just below 40.

### We will now investigate the sex feature.

In [None]:
plt.figure(figsize=(16,4))
ax1=plt.subplot(1,3,1)
ax2=plt.subplot(1,3,2)
ax3=plt.subplot(1,3,3)
ax1.set_title('charges per gender')
ax2.set_title('bmi per gender')
ax3.set_title('smokers per gender')
sns.boxplot(x=df['sex'],y=df['charges'],ax=ax1)
sns.boxplot(x=df['sex'],y=df['bmi'],ax=ax2)
sns.countplot(x=df['sex'],data=df,hue=df['smoker'],ax=ax3)

From the above plots we can see the mean values of both charges and bmi are very close for both genders.  However, the standard deviation is a bit larger for the male gender.  But this could just be an artifact of the dataset itself.

In [None]:
df.groupby('sex').count()

### Next we will investigate the BMI feature and its relation to some of the other features.

In [None]:
plt.figure(figsize=(16,4))
ax1=plt.subplot(1,2,1)
ax1.set_title('Distribution of bmi for both male and female genders')
sns.distplot(df[(df['sex']=='male')]['bmi'], ax=ax1, label='male bmi')
sns.distplot(df[(df['sex']=='female')]['bmi'], ax=ax1, label='female bmi')
plt.legend()
ax2=plt.subplot(1,2,2)
ax2.set_title('Distribution of bmi for both smokers and non-smokers')
sns.distplot(df[(df['smoker']=='no')]['bmi'], ax=ax2, label='non-smoker bmi')
sns.distplot(df[(df['smoker']=='yes')]['bmi'], ax=ax2, label='smoker bmi')
plt.legend()


In [None]:
plt.figure(figsize=(16,4))
ax1=plt.subplot(1,2,1)
plt.title("Charges per BMI value")
sns.regplot(x='bmi',y='charges',data=df, ax=ax1)
ax2=plt.subplot(1,2,2)
plt.xlabel("bmi")
plt.ylabel("charges")
plt.title("Charges per BMI value by smoker")
plt.scatter(x='bmi', y='charges', s=df['smoker']=='yes',data=df, c='red', label='smoker')
plt.scatter(x='bmi', y='charges', s=df['smoker']=='no',data=df, c='blue', label='non-smoker')
plt.legend()

So we are seeing a slight increase in charges with an increase in BMI.  Also, there seems to be somewhat of a linear separability between the charges for a smoker and the charges for a non-smoker.  As a general observation it seems that charges are higher for a smoker than for a non-smoker considering all bmi values.

In [None]:
df_bmi_age = df.groupby(['age'])['bmi'].mean()

In [None]:
plt.figure(figsize=(16,4))
ax = plt.subplot(1,2,1)
ax.set_title('Mean BMI value per year of age')
plt.ylabel('bmi')
df_bmi_age.plot(linewidth=3.3, color='black',ax=ax)
sns.regplot(x='age',y='bmi',data=df, x_estimator=np.mean, color='g', ax=ax)
ax1 = plt.subplot(1,2,2)
ax1.set_title('Mean BMI value per year of age for each gender')
plt.ylabel('bmi')
sns.regplot(x='age',y='bmi',data=df[df['sex']=='female'], x_estimator=np.mean, color='orange', label='female',ax=ax1)
sns.regplot(x='age',y='bmi',data=df[df['sex']=='male'], x_estimator=np.mean, color='b', label='male',ax=ax1)
plt.legend()

So we are seeing a slight increase in BMI with an increase in age.  And the male BMI is increasing at a slightly faster rate than the female BMI.

### We will now investigate the children feature and its relation to some of the other features.

In [None]:
plt.figure(figsize=(16,4))
ax1 = plt.subplot(1,2,1)
ax1.set_title('Count of individuals by the number of children they have')
sns.countplot(x='children', data=df, hue='smoker', ax=ax1)
ax2 = plt.subplot(1,2,2)
ax2.set_title('Age of individuals by the number of children they have')
sns.boxplot(x='children',y='age', hue='smoker', palette='Set3', data=df, ax=ax2)

### We will now investigate the region feature and its relation to some of the other features.

In [None]:
plt.figure(figsize=(16,4))
ax1 = plt.subplot(1,2,1)
plt.title("Distribution of Charges by Region")
for i in df['region'].unique():
    sns.distplot(df[(df['region']==i)]['charges'], hist=False, kde=True, label=i, ax=ax1)
ax2 = plt.subplot(1,2,2)
ax2.set_title('Distribution of Charges by Region comparing smokers and non-smokers')
sns.boxplot(x='charges',y='region',data=df, hue='smoker')

We will ask a question:  There is a small increase in most of the regions around the 40000 charges region.  What is the reason for this increase?

The difference between the mean value of charges between smokers and non-smokers seems significant for all regions.  But the difference appears to be largest for the southeast region and smallest for the northeast region.

### We will now investigate the charges feature.

In [None]:
plt.figure(figsize=(16,4))
plt.hist(df['charges'], bins='auto')
plt.xlabel("charges")
plt.title("Distribution of charges")

In [None]:
plt.figure(figsize=(16,4))
sns.boxplot(x='charges',y='smoker',data=df)
plt.title("Overall distribution of charges comparing smokers and non-smokers")

### Let us create a linear regression model to attempt to predict the charges based on all the other features.

In [None]:
sns.pairplot(df)

In [None]:
df.corr()

First we will change all categorical features to numeric features

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(df.sex.drop_duplicates()) 
df.sex = le.transform(df.sex)
le.fit(df.smoker.drop_duplicates()) 
df.smoker = le.transform(df.smoker)
le.fit(df.region.drop_duplicates()) 
df.region = le.transform(df.region)
#smoker gets -1, non-smoker gets +1
def smoker(x):
    if x==0:
        x = -1 # not a smoker
    else:
        x = 1 # a smoker
    return x
df['smoker'] = df['smoker'].apply(lambda x: smoker(x))

In [None]:
df.head()

In [None]:
X = df.drop(['charges'], axis=1)
y = df['charges']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import r2_score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.2)

In [None]:
regr = LinearRegression()
regr.fit(X_train, y_train)

What is the R^2 score?

In [None]:
y_train_pred = regr.predict(X_train).ravel()
y_test_pred = regr.predict(X_test).ravel()

In [None]:
print("The R2 score on the Train set is:\t{:0.3f}".format(r2_score(y_train, y_train_pred)))
print("The R2 score on the Test set is:\t{:0.3f}".format(r2_score(y_test, y_test_pred)))

In [None]:
predicted = cross_val_predict(regr, X, y, cv=5)

In [None]:
plt.figure(figsize=(16,8))
plt.scatter(y, predicted, edgecolors=(0,0,0))
plt.plot([y.min(),y.max()],[y.min(),y.max()],'k--',lw=4)
plt.xlabel("Measured Charges")
plt.ylabel("Predicted Charges")
plt.title("Cross-validated Prediction of Charges")

A pretty good fit for a simple linear regression model.  R^2 score is not great, but we will see if we can find a new feature that will impove the prediction accuracy.  First we will try the product of age,  bmi and smoker.

In [None]:
df['age_bmi_smoker'] = df['age']*df['bmi']*df['smoker']

In [None]:
df.head()

In [None]:
X = df.drop(['charges'], axis=1)
y = df['charges']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.2)

In [None]:
regr = LinearRegression()
regr.fit(X_train, y_train)

In [None]:
y_train_pred = regr.predict(X_train).ravel()
y_test_pred = regr.predict(X_test).ravel()

In [None]:
print("The R2 score on the Train set is:\t{:0.3f}".format(r2_score(y_train, y_train_pred)))
print("The R2 score on the Test set is:\t{:0.3f}".format(r2_score(y_test, y_test_pred)))

Now we will try a Random Forrest regression on this dataframe to see if that improves the accuracy.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(n_estimators = 10, criterion = 'mse', max_depth = 5, random_state = 1, n_jobs = -1)

In [None]:
rf.fit(X_train, y_train);

In [None]:
y_train_pred = rf.predict(X_train).ravel()
y_test_pred = rf.predict(X_test).ravel()

In [None]:
print("The R2 score on the Train set is:\t{:0.3f}".format(r2_score(y_train, y_train_pred)))
print("The R2 score on the Test set is:\t{:0.3f}".format(r2_score(y_test, y_test_pred)))

R^2 seems to be a bit better using this type of regression.  But we are probably overfitting with such a small dataset.  Let us run a 10-fold cross-validation to see how close we can get.

In [None]:
predicted = cross_val_predict(rf, X, y, cv=10)

In [None]:
plt.figure(figsize=(16,8))
plt.scatter(y, predicted, edgecolors=(0,0,0))
plt.plot([y.min(),y.max()],[y.min(),y.max()],'k--',lw=4)
plt.xlabel("Measured Charges")
plt.ylabel("Predicted Charges")
plt.title("Cross-validated Prediction accuracy of Charges")

From the above graph it seems we can make predictions pretty close to the actual values when the charges are at 10000 or below.  And from the earlier graph, it is shown that most charges fall in the less than 10000 bucket for charges.