<img src="https://media3.giphy.com/media/L3nWlmgyqCeU8/giphy.gif" width="250px"/>

HI ~ This is where we will use machine learning techniques (it's simple! I promise>< ) to predict and learn about customer's medical cost that they have on their health insurance policies. 
We will look into three methods to play around the data!We will explore only the less-sensitive information, such as BMI, region, age, number of children and smoking habit. 


We will explore three techniques:
* Linear Regression
* K-means Clustering
* Decision Tree

Will they work? Let's see!

In [None]:

import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as pl
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('../input/insurance/insurance.csv')


In [None]:
data.describe()

In [None]:
data.head()

-**age**: age of primary beneficiary

-**sex**: insurance contractor gender, female, male

-**bmi**: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

-**children**: Number of children covered by health insurance / Number of dependents

-**smoker**: Smoking

-**region**: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

-**charges**: Individual medical costs billed by health insurance




Since we are primarily interested in the amount of costs see what posts are more correlated with charges. For a start, we will encode categorical features into numeriacal ones, i.e. sex can be encoded as 0:non-smoker and 1:smoker. That is also called **one hot encoding**.


In [None]:
from sklearn.preprocessing import LabelEncoder
#sex
le = LabelEncoder()
le.fit(data.sex.drop_duplicates()) 
data.sex = le.transform(data.sex)
# smoker or not
le.fit(data.smoker.drop_duplicates()) 
data.smoker = le.transform(data.smoker)
#region
le.fit(data.region.drop_duplicates()) 
data.region = le.transform(data.region)

Let's look at the correlation between charges and other beneficiary information

In [None]:
data.corr()['charges'].sort_values(ascending=False)

In [None]:
f, ax = pl.subplots(figsize=(10, 8))
corr = data.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(240,10,as_cmap=True),
            square=True, ax=ax)

A strong correlation between smoking and our patience.  Well. Let's investigate smoking in more detail.
![![image.png](attachment:image.png)](https://media.giphy.com/media/l0ExdMHUDKteztyfe/giphy.gif)

In [None]:
f= pl.figure(figsize=(12,5))

ax=f.add_subplot(121)
sns.distplot(data["charges"],color='r',ax=ax)
ax.set_title('Distribution of charges for smokers & non-smokers')

In [None]:
f= pl.figure(figsize=(12,5))

ax=f.add_subplot(121)
sns.distplot(data[(data.smoker == 1)]["charges"],color='c',ax=ax)
ax.set_title('Distribution of charges for smokers')

ax=f.add_subplot(122)
sns.distplot(data[(data.smoker == 0)]['charges'],color='b',ax=ax)
ax.set_title('Distribution of charges for non-smokers')

Smoking patients spend more on treatment. But there is a feeling that the number of non-smoking patients is greater. Going to check it.

In [None]:

sns.catplot(x="smoker", kind="count",hue = 'sex', palette="pink", data=data)

Please note that women are coded with the symbol " 0 "and men - "1". Thus we can notice that more male smokers than women smokers. It can be assumed that the total cost of treatment in men will be more than in women, given the impact of smoking. Maybe we'll check it out later.
And some more useful visualizations. 

In [None]:
sns.catplot(x="sex", y="charges", hue="smoker",
            kind="violin", data=data, palette = 'magma')

It seems that regardless of gender, the medical cost is highly related to smoking habbit with similiar patterns in male and female group

In [None]:
pl.figure(figsize=(12,5))
pl.title("Box plot for charges of women")
sns.boxplot(y="smoker", x="charges", data =  data[(data.sex == 1)] , orient="h", palette = 'magma')

In [None]:
pl.figure(figsize=(12,5))
pl.title("Box plot for charges of men")
sns.boxplot(y="smoker", x="charges", data =  data[(data.sex == 0)] , orient="h", palette = 'rainbow')

Now let's pay attention to the age of the patients.  First, let's look at how age affects the cost of treatment, and also look at patients of what age more in our data set.

In [None]:
pl.figure(figsize=(12,5))
pl.title("Distribution of age")
ax = sns.distplot(data["age"], color = 'g')

We have patients under 20 in our data set. Im 18 years old. This is the minimum age of patients in our set. The maximum age is 64 years. 
My personal interest is whether there are smokers among patients 18 years.

18 years old - a very young age. Does smoking affect the cost of treatment at this age?


In [None]:
pl.figure(figsize=(12,5))
pl.title("Box plot for charges 18 years old smokers")
sns.boxplot(y="smoker", x="charges", data = data[(data.age == 18)] , orient="h", palette = 'pink')

![![image.png](attachment:image.png)](https://media.giphy.com/media/Nm8ZPAGOwZUQM/giphy.gif)

Oh. As we can see, even at the age of 18 smokers spend much more on treatment than non-smokers. Among non-smokers we are seeing some " tails." I can assume that this is due to serious diseases or accidents.
Now let's see how the cost of treatment depends on the age of smokers and non-smokers patients.

In [None]:
g = sns.jointplot(x="age", y="charges", data = data[(data.smoker == 0)],kind="kde", color="m")
g.plot_joint(pl.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$")
ax.set_title('Distribution of charges and age for non-smokers')

In [None]:
g = sns.jointplot(x="age", y="charges", data = data[(data.smoker == 1)],kind="kde", color="c")
g.plot_joint(pl.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$")
ax.set_title('Distribution of charges and age for smokers')

hummm. What's these two clusters in smokers' group, which doesn't have the same shape as non-smokers? (We will explore it later)

In [None]:
#non - smokers
from bokeh.plotting import figure, show, output_file
p = figure(plot_width=500, plot_height=450)
p.circle(x=data[(data.smoker == 0)].age,y=data[(data.smoker == 0)].charges, size=7, line_color="navy", fill_color="pink", fill_alpha=0.9)
show(p)

In [None]:
#smokers
p = figure(plot_width=500, plot_height=450)
p.circle(x=data[(data.smoker == 1)].age,y=data[(data.smoker == 1)].charges, size=7, line_color="navy", fill_color="red", fill_alpha=0.9)
show(p)

In [None]:
sns.lmplot(x="age", y="charges", hue="smoker", data=data, palette = 'inferno_r', size = 7)
ax.set_title('Smokers and non-smokers')

In non-smokers, the cost of treatment increases with age. That makes sense. So take care of your health, friends!  In smoking people, we do not see such dependence.
I think that it is not only in smoking but also in the peculiarities of the dataset. Such a strong effect of Smoking on the cost of treatment would be more logical to judge having a set of data with a large number of records and signs.
But we work with what we have!
Let's pay attention to bmi.Are we on a diet for nothing?
![![image.png](attachment:image.png)](https://media.giphy.com/media/co9IXVaipZ0Yg/giphy.gif)

In [None]:
pl.figure(figsize=(12,5))
pl.title("Distribution of bmi")
ax = sns.distplot(data["bmi"], color = 'green')

There's something insanely beautiful about this distribution, isn't there?  
The average BMI in patients is 30. How about yours?
![![image.png](attachment:image.png)](http://1j4g1pasf991x0osxuqz6d10.wpengine.netdna-cdn.com/wp-content/uploads/2017/03/BMI-CHART-1024x791.png)
With a value equal to 30 starts obesity.  I also calculated my BMI and now I can safely drink bubble tea today. Let's start to explore!
First, let's look at the distribution of costs in patients with BMI greater than 30 and less than 30

![![image.png](attachment:image.png)](https://media.giphy.com/media/7vUwiZ4MJRNDy/giphy.gif)

In [None]:
pl.figure(figsize=(12,5))
pl.title("Distribution of charges for patients with BMI greater than 30")
ax = sns.distplot(data[(data.bmi >= 30)]['charges'], color = 'm')

In [None]:
pl.figure(figsize=(12,5))
pl.title("Distribution of charges for patients with BMI less than 30")
ax = sns.distplot(data[(data.bmi < 30)]['charges'], color = 'b')

Patients with BMI above 30 spend more on treatment!

In [None]:
g = sns.jointplot(x="bmi", y="charges", data = data,kind="kde", color="r")
g.plot_joint(pl.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$")
ax.set_title('Distribution of bmi and charges')


In [None]:
pl.figure(figsize=(10,6))
ax = sns.scatterplot(x='bmi',y='charges',data=data,palette='magma',hue='smoker')
ax.set_title('Scatter plot of charges and bmi')

sns.lmplot(x="bmi", y="charges", hue="smoker", data=data, palette = 'magma', size = 8)

Still, smoking exhibits a strong correlation among insurance beneficiaries.

Let's pay attention to children. First, let's see how many children our customers have.


In [None]:
sns.catplot(x="children", kind="count", palette="ch:.25", data=data, size = 6)

Most patients do not have children. Perfectly that some have 5 children! Children are happiness:)
I wonder if people who have children smoke.

In [None]:
sns.catplot(x="smoker", kind="count", palette="rainbow",hue = "sex",
            data=data[(data.children > 0)], size = 6)
ax.set_title('Smokers and non-smokers who have childrens')

Oh oh oh.....
![![image.png](attachment:image.png)](https://www.az-jenata.bg/media/az-jenata/files/galleries/640x480/4c0373972cdd156a2e2c008dc5c0a93a.jpg)
But I am glad that non-smoking parents are much more!

Now we are going to predict the cost of treatment.
Let's start with the usual linear regression.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.ensemble import RandomForestRegressor

In [None]:
x = data.drop(['charges'], axis = 1)
y = data.charges

x_train,x_test,y_train,y_test = train_test_split(x,y, random_state = 0)
lr = LinearRegression().fit(x_train,y_train)

y_train_pred = lr.predict(x_train)
y_test_pred = lr.predict(x_test)

print(lr.score(x_test,y_test))

Not bad for such a lazy implementation, even without data normalization:D
After all, the data will not always be so "good". So don't forget to pre-process the data.
I'll show you all this later when I try to implement my own linear regression. So don't be mad at me please :)
Now let's add polynomial signs. And look at the result.

In [None]:
X = data.drop(['charges','region'], axis = 1)
Y = data.charges



quad = PolynomialFeatures (degree = 2)
x_quad = quad.fit_transform(X)

X_train,X_test,Y_train,Y_test = train_test_split(x_quad,Y, random_state = 0)

plr = LinearRegression().fit(X_train,Y_train)

Y_train_pred = plr.predict(X_train)
Y_test_pred = plr.predict(X_test)

print(plr.score(X_test,Y_test))

In [None]:
poly_names=quad.get_feature_names(X.columns)
poly_coef=plr.coef_
poly_function=pd.DataFrame({"name":poly_names,"coef":poly_coef})

Already good. Our model predicts well the cost of treatment of patients. I think we could limit ourselves to creating two or three polynomial features, but the data set is so small, so we went the easy way.
And finally try RandomForestRegressor. I've never used this algorithm in regression analysis.

In [None]:
forest = RandomForestRegressor(n_estimators = 100,
                              criterion = 'mse',
                              random_state = 1,
                              n_jobs = -1)
forest.fit(x_train,y_train)
forest_train_pred = forest.predict(x_train)
forest_test_pred = forest.predict(x_test)

print('MSE train data: %.3f, MSE test data: %.3f' % (
mean_squared_error(y_train,forest_train_pred),
mean_squared_error(y_test,forest_test_pred)))
print('R2 train data: %.3f, R2 test data: %.3f' % (
r2_score(y_train,forest_train_pred),
r2_score(y_test,forest_test_pred)))

Good result. But we see a noticeable overfitting as the noticable gap between testing accuracy and training accuracy.

Now let's look at KMeans. This is an algorithm focusing on clustering similiar data points into K-groups. Let's use it to see who are our customers in a macroview.

As the K need to be pre-defined, we first explore how well is the algorithm in spliting our current customers and what's the optimal K to split.

In [None]:
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

fig = pl.figure(figsize=(12,8))

# KNears Neighbors 
#df.head()
#original_df.head()

X = data

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,10))

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.poof()  

Did you find the  "Elbow"? This is where we have our optimal K ->4. It could be subjective sometimes, but we will use k=4 to have a try! 

In [None]:
kmeans = KMeans(n_clusters=4)  
kmeans.fit(X)

In [None]:
pd.DataFrame(kmeans.cluster_centers_,columns=data.columns)

We obtain the 4 clusters' average profile of customers (for now we will ignore region as it doesn't vary much)

Some takeaway:


1.   Group 0: Middle-age customers, tends to smoke -> mid to high medical cost
2.   Group 1: Young customers, have 1 child on average, very unlikely to smoke -> ver low medical cost
3.   Group 2: Middle age/Elders, seldom smoke -> middle medical cost
4.   Group 3: Middle-age customers, likely to smke, tends to overweight -> high medical cost





In [None]:
fig = pl.figure(figsize=(12,8))

pl.scatter(X.values[:,0], X.values[:,6], c=kmeans.labels_, cmap="Set1_r", s=25)
pl.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,6], color='black', marker="x", s=250)
pl.title("Kmeans Clustering \n Finding Unknown Groups in the Population (Charges vs. Age)", fontsize=16)
pl.show()

In [None]:
#colorlist=['m', 'b', 'g', 'r']
X['label']=kmeans.labels_
fig = pl.figure(figsize=(12,8))

pl.scatter(X.values[:,2], X.values[:,6], c=X['label'], cmap="Set1_r", s=25)
pl.scatter(kmeans.cluster_centers_[:,2] ,kmeans.cluster_centers_[:,6], color='black', marker="x", s=250,label='Center')
pl.title("Kmeans Clustering \n Finding Unknown Groups in the Population (BMI vs. Charges)", fontsize=16)
pl.legend()
pl.show()

In [None]:
from mpl_toolkits.mplot3d import Axes3D

In [None]:

threedee = pl.figure().gca(projection='3d')
threedee.scatter(X[(X.smoker == 1)]['bmi'], X[(X.smoker == 1)]['age'], X[(X.smoker == 1)]['charges'],c=X[(X.smoker == 1)]['label'], cmap="Set1_r")
threedee.set_xlabel('BMI')
threedee.set_ylabel('Age')
threedee.set_zlabel('Charges')
pl.show()


Remember the two clusters we had on smokers? It seems that the above graph help us to explain, (red）High BMI smoker group is more likely to have high medical cost than (green/grey) Low BMI smoker group. So, let's meet at gym :)

Finally the most lazy model I would like to use！Decision Tree Regressor - it split in every node with a condition - Yes or No to predict the final medical cost for each customers

In [None]:
from sklearn.tree import DecisionTreeRegressor # Import Decision Tree 
# Create Decision Tree classifer object
DTR = DecisionTreeRegressor(max_depth=4)

# Train Decision Tree 
DTR = DTR.fit(x_train,y_train)

#Predict the response for test dataset
y_pred = DTR.predict(x_test)

In [None]:
pl.scatter(x_test['age'],y_test, color = "red")
pl.scatter(x_test['age'], y_pred, color = "blue")
pl.title("Truth(Red) or Predict (Blue) by age vs charges")
pl.xlabel("age")
pl.ylabel("charges")
pl.show

Wow, the prediction (blue) and the real charges(red) are quite close for each datapoint

It seems the Kaggle doesn't support a library I used. Nevermind, let's import it from my local program!
![](http://)

In [None]:
from IPython.display import Image

You may scroll down to the page bottom for viewing the actual decision tree model :) as the picture is too large to display here~
->Data Sources
->DTR_VIZ
->DTR.png

In [None]:
print(DTR.score(x_test,y_test))

We now have the highest accuracy with the simpler model - Less is more. I hope you don't get bored reading this notebook and feel it acceptable as a exploratory reading for data analytics.