## Objective:
Hi! Welcome to this kernel. The data comes from 'Medical Cost Personal Datasets'. Here, I am interested in finding any correlation of medical charges with features listed in the dataset. Read along if you are curious about this exploration.  

### Lets begin! 
We start our analysis by importing necessary modules.

In [None]:
%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', 500)
data=pd.read_csv('../input/insurance.csv')
import os
print(os.listdir("../input"))
import warnings
warnings.filterwarnings('ignore')

# Any results you write to the current directory are saved as output.

Lets do some exploratory data analysis. Here I check a subset of data by the `head` method and check if there is any missing data using the `info` method. The columns are- age, sex, bmi, children, region, charges.  'Age'- age of the insured (primary beneficiery). 'Sex'- Gender of the beneficiery. 'BMI' is body mass index which is obtained by dividing weight in kilograms with the square of height in meters. 'Region'- means residence of the beneficiery in any of the 4 US regions- northeast, southeast, southwest, northwest. 'Children' - Number of kids of the primary beneficiery. 'Smoker' - if the person smokes or not. 'Charges' include the medical charges. 

In [None]:
print(data.head())

In [None]:
print(data.info())

Cool! It seems we have no missing values in this dataset. 

Next, I check the statistics for the available data. It can be seen that primary beneficieries age runs from 18-64 years with BMI ranging from 15.9 to 53.13 and with 0 to 5 kids. The charges incurred range from $1121.87-$63770.

In [None]:
print(data.describe())

### Time to check some plots
The obvious question that comes to mind is- how are age and medical charges related. I check for it in the plot shown below.  If  sex can have any influence on the relationship is also checked. As can be observed, we see 3 prominent, parallel lines like pattern. In all of the three, one can see an increase in charges with increasing age.

In [None]:
sns.relplot(x='age', y='charges', hue= 'sex', data=data, palette='husl')
plt.title('Effect of Age on Charges')

As, gender specific pattern is not clearly distinguishable in the plot above. I plot separately, for 'sex'. As above, similar pattern is observed. 

In [None]:
sns.relplot(x='age', y='charges', col='sex',data=data, palette='husl')

From the pattern observed in the plot below, I decide to include one of our obvious features- that can determine medical charges- 'Smoking'. As can be seen here that with increasing age, smokers show bigger increase in the medical charges compared to non-smokers.  It seems that age and smoking are showing a **synergistic relationship** , as with increasing age, smokers show higher charges compared to non-smokers. 

In [None]:
sns.relplot(x='age', y='charges', hue='smoker', style= 'sex', data=data, palette='husl')
plt.title('Combined effect of Age and Smoking on Charges')

Next, I try to check the relationship between charges and age as a function of smokers/non-smokers separately for the two genders. As can be seen from the lmplot below, there are lower charges for non-smokers compared to smokers with increasing age for both the genders.

In [None]:
sns.lmplot(x='age', y='charges', hue='smoker', col='sex',data=data, palette='husl')

This relationship between smoker/non-smoker vs charges for both the genders become more clear from the violin plot shown below. With both the genders showing that for most of the non smokers the charges are below $20,000, while its higher for most of the smokers.

In [None]:
sns.violinplot(x="sex", y='charges', hue="smoker", data=data, palette='Dark2')
plt.title('Effect of Smoking on Charges of males and females')

Also, I am interested in checking the number of smokers/non-smokers males and females as well as the mean charges for them.  The data contains non-smokers 547 females and 517 males while there are 115 smokers females and 159 smokers males. The mean charges are 8762; 8087; 30678 and 33042 for non-smoker females, males, smoker females, males, respectively. The distribution can be seen in the following two bar-plots.

In [None]:
data_grouped=data.groupby(['smoker', 'sex']).agg({'charges':'sum','sex':'count'})
data_grouped['mean_charges']= data_grouped['charges']/data_grouped['sex']
data_grouped=data_grouped.rename(columns={'sex':'number_in_gender'})
data_grouped.index=[0,1,2,3]
data_grouped['smoker']=['no','no','yes','yes']
data_grouped['sex']=['female','male','female','male']
data_grouped=data_grouped[['smoker', 'sex','number_in_gender','charges','mean_charges']]
data_grouped

In [None]:
sns.catplot(x='sex',y='mean_charges',hue='smoker',kind='bar',data=data_grouped, palette='Dark2')

In [None]:
sns.catplot(x='sex',y='number_in_gender',hue='smoker',kind='bar',data=data_grouped, palette='Dark2')

Next, lets check if BMI has any influence on the medical charges.  BMI as explained above is Body Mass Index. Higher BMI is correlated to higher body fat and thus, also correlated to metabolic diseases. BMI is thus a measure of body weight status.  For adults, BMI below 18.5 is considered underweight, BMI of 18.5 – 24.9 is considered normal, BMI of 25.0 – 29.9 is considered overweight while BMI above 30 is considered obese. 
Therefore, I am interested to check if higher BMI can be correlated to higher medical charges.

Although, in the first scatterplot (shown below) the influence of BMI on charges is not clearly visible. But, BMI's influence on medical charges become clear when I involve 'smoker' feature (second scatterplot shown below). It shows that increasing BMI combined with smoking leads to higher charges compared to non-smoker with a higher BMI.  This indicates, a probable synergistic relationship between  smoking and BMI.

In [None]:
sns.relplot(x='bmi',y='charges',style='sex',data=data)
plt.title('Effect of BMI on Charges')

As indicated in the scatter plot below, obesity combined with smoking can lead to higher medical expenses.

In [None]:
sns.relplot(x='bmi',y='charges',hue='smoker',style='sex',data=data)

Here, I check the combined influence of BMI and smoking on genders separately. Both males and females show similar influence of smoking and BMI.  Smoking combined with an increasing BMI leads to increase in medical charges.

In [None]:
sns.lmplot(x='bmi',y='charges',hue='smoker', col='sex',data=data)

In order to find any correlation between the features, I check the paiplot (Shown below). One of the peculiar observations is that parents of 5 kids are incurring lesser medical expenses.

In [None]:
sns.pairplot(data, vars= ['age','bmi','children','charges'], hue='sex')

To dig deeper into the above mentioned observation, I check charges of fathers and mothers. As, observed in the previous plot moms and dads of 5 kids are indeed getting less medical billings as seen in the boxplot (shown below). They have less mean medical charges compared to the others. 

In [None]:
sns.catplot(x="children", y='charges', hue='sex', kind='box',data=data, palette= 'Accent')
plt.title('Charges vs number of children')

Also, as expected parents who smoke have more medical expenses, as seen in the boxplot below.

In [None]:
sns.catplot(x="children", y='charges', hue='smoker', kind='box',data=data , palette= 'Paired')

Here, I figure out mean and median charges for parents with different number of kids. Mean charges for 5 kids parents is less while median charges for both 1 kid and 5 kids parents are low.

In [None]:
data_grouped2=data.groupby('children').agg({'charges':'sum','sex':'count'})
#data_grouped['mean_charges']= data_grouped['charges']/data_grouped['sex']
data_grouped2['mean_charges2']=data_grouped2['charges']/data_grouped2['sex']
data_grouped2['median_charges']=data.groupby('children')['charges'].median()
data_grouped2



I am curious about the single line obtained for the 5 kids parent obtained in the previous boxplot for smoking/non-smoking parents vs charges. To confirm this observation as well as to check how the data looks like for this plot, I made the following table.

In [None]:
data_grouped3=data.groupby(['children','sex','smoker']).agg({'sex':'count', 'charges':'sum'})
data_grouped3['mean_charges2']=data_grouped3['charges']/data_grouped3['sex']
data_grouped3

One feature that have not been yet looked at is -'Region'. I check if this has any influence on the charges. There are slightly more charges for people living in southeast region compared to the rest (as shown in the violoin plots below) . 

In [None]:
sns.violinplot(x="region", y='charges', data=data)

For smoker and non-smoker groups in different regions, southeast agin shows higher charges.

In [None]:
sns.violinplot(x="region", y='charges', hue="smoker", data=data)

To check how the data looks like when grouped by regions, I prepared the following tables. Confirming that mean charges for southeast are slightly higher than the other regions.

In [None]:
data_grouped4=data.groupby('region').agg({'charges':'sum','sex':'count'})
data_grouped4['mean_charges3']=data_grouped4['charges']/data_grouped4['sex']
data_grouped4

In [None]:
data_grouped5=data.groupby(['region','smoker']).agg({'sex':'count', 'charges':'sum'})
data_grouped5['mean_charges']=data_grouped5['charges']/data_grouped5['sex']
data_grouped5

### Predictive modelling
Before working on our model,  we will need to do some data preprocessing. For that lets check which features are categorical. Looking at the dtypes of the features (as done below), we find that we have 3 out of 7 features that have 'object' dtype. These are - sex, smoker and region. If we want to use them for predictive modelling then these need to be encoded . For features that have just two values like in case of smoker and sex, we use Label Encoder and for region we use one hot encoding .One hot encoding is better than label encoding when we have more than two values for any categorical feature. Below, you can see that I have encoded these 3 variables and checked the results.

In [None]:
print(data.dtypes)

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
encoder.fit(data['sex'].drop_duplicates())
data['sex']=encoder.transform(data['sex'])
encoder.fit(data['smoker'].drop_duplicates())
data['smoker']=encoder.transform(data['smoker'])
data1=pd.get_dummies(data['region'], prefix='region')
data= pd.concat([data,data1], axis=1).drop(['region'],axis=1)
print(data.head(2))
print(data.dtypes)

Lets begin building a simple Linear Regression model. Model quality is checked by Root Mean Square Error (RMSE).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
y= data['charges']
X = data.drop(['charges'], axis=1)
lin_reg=LinearRegression()
train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25, random_state=21)
lin_reg.fit(train_X,train_y)
pred_y=lin_reg.predict(test_X)
rmse = np.sqrt(mean_squared_error(test_y, pred_y))
print("RMSE: %f" % (rmse))

  Next, we move to XGBoost which shows lower RMSE compared to the previous model. Also, we find the important features using `plot_importance`.

In [None]:
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from xgboost import plot_importance
import numpy as np

y= data['charges']
X = data.drop(['charges'], axis=1)
train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25, random_state=21)
train_X = pd.DataFrame(data=train_X, columns=X.columns)
test_X = pd.DataFrame(data=test_X, columns=X.columns)

model_x = XGBRegressor(n_estimators=1000, learning_rate=0.05)

model_x.fit(train_X, train_y, early_stopping_rounds=5,eval_set=[(test_X, test_y)], verbose=False)
predictions = model_x.predict(test_X)

rmse = np.sqrt(mean_squared_error(test_y, predictions))
print("RMSE: %f" % (rmse))
plot_importance(model_x)

Further, I remove the following not-very important features- 'sex',  'region_northeast', 'region_northwest', 'region_southeast', 'region_southwest', and check the RMSE.  RMSE reduces to 4781 from previous value of 4793 suggesting that age, bmi, smoker and children are the important features.

In [None]:
y= data['charges']
X = data.drop(['charges','sex','region_northeast','region_northwest','region_southeast','region_southwest'], axis=1)
train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25, random_state=21)
train_X = pd.DataFrame(data=train_X, columns=X.columns)
test_X = pd.DataFrame(data=test_X, columns=X.columns)

model_x = XGBRegressor(n_estimators=1000, learning_rate=0.05)

model_x.fit(train_X, train_y, early_stopping_rounds=5,eval_set=[(test_X, test_y)], verbose=False)
predictions = model_x.predict(test_X)

rmse = np.sqrt(mean_squared_error(test_y, predictions))
print("RMSE: %f" % (rmse))


Additionally, if I drop 'children' feature, the RMSE reduces further.  Its worth mentioning here that if I drop any of the other three features, RMSE increases substantially suggesting that they are the key predictors of charges.

In [None]:
y= data['charges']
X = data.drop(['charges','sex','region_northeast','region_northwest','region_southeast','region_southwest','children'], axis=1)
train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25, random_state=21)
model_x = XGBRegressor(n_estimators=1000, learning_rate=0.05)
model_x.fit(train_X, train_y, early_stopping_rounds=5,eval_set=[(test_X, test_y)], verbose=False)
predictions = model_x.predict(test_X)
rmse = np.sqrt(mean_squared_error(test_y, predictions))
print("RMSE: %f" % (rmse))

### Conclusion
The key predictors of medical charges (obtained from this dataset) are age, bmi and smoking.  Although, we do not have control on our aging, but we can try taking care of the other two. 
I hope you enjoyed this dataset exploration!

Thanks for reading. 
Any suggestion is highly apreciated.
