**Introduction:**
 In this Jupyter notebook, we are looking into how medical charges are determined by various factors such as sex, smoker, bmi and ages. In order to investigate such relations, the main tools we are using are heatmaps, K-means and linear regression. Using those tools, we found that the patients are divided into three groups depending on their smoking condition and bmi, and medical charges are determined accordingly.
  
  A project like this can help us understand how a hospital or insurance company determines a patient's medical charges and how it can be improved once we know the logic behind it.

In [None]:
import numpy as np
import pandas as pd 
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

In [None]:
df = pd.read_csv("/kaggle/input/insurance/insurance.csv")

Load data

In [None]:
df.head(5)

In [None]:
df.dtypes

In [None]:
from sklearn.preprocessing import LabelEncoder
#sex
le = LabelEncoder()
le.fit(df.sex.drop_duplicates()) 
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates()) 
df.smoker = le.transform(df.smoker)

In [None]:
df.head(5)

In [None]:
df1 = df[["age","sex","bmi","children","smoker","charges"]]

Consider factors except for region

In [None]:
df1.head(5)

In [None]:
cor = df1.corr() #Calculate the correlation of the above variables
sns.heatmap(cor, square = True) #Plot the correlation as heat map

Here we check the correlation between different factors.
It looks like smoker, bmi and age are three important factors.

Therefore, I want to use K-means method on bmi and age. Can't apply k-means to smoker since smoker is not a continuous value. Using K-means can help us understand if we can categorize the patients and therefore learn the relations among the variables.

First we look at clusters of bmi and charges.

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit_transform(df1)

In [None]:
from sklearn.cluster import KMeans
def doKmeans(X, nclust=2):
    model = KMeans(nclust)
    model.fit(X)
    clust_labels = model.predict(X)
    cent = model.cluster_centers_
    return (clust_labels, cent)

In [None]:
clust_labels, cent = doKmeans(df1, 2)
kmeans = pd.DataFrame(clust_labels)

fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(df1['bmi'],df1['charges'],
                     c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('bmi')
ax.set_ylabel('charges')
plt.colorbar(scatter)

When using two clusters, we get the graph above.

In [None]:
clust_labels, cent = doKmeans(df1, 3)
kmeans = pd.DataFrame(clust_labels)

fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(df1['bmi'],df1['charges'],
                     c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('bmi')
ax.set_ylabel('charges')
plt.colorbar(scatter)

Case of three clusters. Clearly there are other factors affecting charges except for bmi. We need to find those factors.

Below we look at clusters of age and charges.

In [None]:
clust_labels, cent = doKmeans(df1, 3)
kmeans = pd.DataFrame(clust_labels)

fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(df1['age'],df1['charges'],
                     c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('age')
ax.set_ylabel('charges')
plt.colorbar(scatter)

Case of three clusters for age. Combined with the three cluster graph above, we can conclude that categorize the whole data set into three categories is reasonable.

To discover how the data is categorized into such three groups, consider smoker first, since it is relatively easy to filter value of 0 and 1.

In [None]:
sns.lmplot(x="age", y="charges", hue="smoker", data=df1, palette = 'inferno_r', height = 7)
ax.set_title('Smokers and non-smokers')

Above is the graph when we filter smoker. When smoker == 0, we can easily see the pattern of the curve on the bottom. So one pattern is found.
We need to find out the pattern on the top. (When smoker == 1)

Below I am using df2 for patients who do smoke, the data where smoker == 1. Since pattern for smoker == 0 is found.

In [None]:
df2 = df1[(df1.smoker == 1)]

In [None]:
cor = df2.corr() #Calculate the correlation of the above variables
sns.heatmap(cor, square = True) #Plot the correlation as heat map

concentrating on people who smoke, heatmap here shows bmi and age are relevant.
Since we have discovered pattern for smoker, we need to look at how bmi and age affect charges.

We will look at bmi first.

In [None]:
sns.lmplot(x="bmi", y="charges", hue="sex", data=df2, palette = 'inferno_r', height = 7)

Above we can see that the graph is divided into two groups. Perform K-means can help us verify it.

In [None]:
clust_labels, cent = doKmeans(df2, 2)
kmeans = pd.DataFrame(clust_labels)

fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(df2['bmi'],df2['charges'],
                     c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('bmi')
ax.set_ylabel('charges')
plt.colorbar(scatter)

Clearly, the two cluster perform very differently, depending on bmi is below or above 30.

Therefore, below we can divide the patients who smoke into two groups: bmi > 30 or bmi <= 30.

In [None]:
df3 = df2[(df2.bmi > 30)]

In [None]:
sns.lmplot(x="age", y="charges", hue="sex", data=df3, palette = 'inferno_r', height = 7)

In [None]:
df4 = df2[(df2.bmi <= 30)]
sns.lmplot(x="age", y="charges", hue="sex", data=df4, palette = 'inferno_r', height = 7)

In [None]:
cat1 = df1[df1.smoker == 0]
cat2 = df3
cat3 = df4

This way, all three categories, cat1, cat2, cat3 are obtained.
Now let's perform machine learning on these three individually. As they are all very linear, linear regression would suffice.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Test when we train the whole data set with one linear reg.

In [None]:
df_x = df1[["age","sex","bmi","children","smoker"]]
df_y = df1[["charges"]]
X_train, X_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2)
reg = LinearRegression().fit(X_train, y_train)

Linear regression for category 1

In [None]:
cat1_x = cat1[["age","sex","bmi","children","smoker"]]
cat1_y = cat1[["charges"]]
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(cat1_x, cat1_y, test_size=0.2)
reg_1 = LinearRegression().fit(X_train_1, y_train_1)

y_pred_1 = reg_1.predict(X_test_1)

Linear regression for category 2

In [None]:
cat2_x = cat2[["age","sex","bmi","children","smoker"]]
cat2_y = cat2[["charges"]]
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(cat2_x, cat2_y, test_size=0.2)
reg_2 = LinearRegression().fit(X_train_2, y_train_2)

Linear regression for category 3

In [None]:
cat3_x = cat3[["age","sex","bmi","children","smoker"]]
cat3_y = cat3[["charges"]]
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(cat3_x, cat3_y, test_size=0.2)
reg_3 = LinearRegression().fit(X_train_3, y_train_3)

It would be great if we can have a mixed linear regression model.

In [None]:
def mix_model(df):
    result = []
    
    for i in range(0,df.shape[0] - 1):
        x = df.iloc[i]
        xx = df.iloc[i:i+1]
        if x.smoker == 0:
            result.append(reg_1.predict(xx))
        elif x.bmi < 30:
            result.append(reg_2.predict(xx))
        else:
            result.append(reg_3.predict(xx))
    
    return result

Now let us compare the original linear reg. with the new model categorized into three groups.

In [None]:
print(reg.score(X_test_1,y_test_1),reg_1.score(X_test_1,y_test_1))
print(reg.score(X_test_2,y_test_2),reg_2.score(X_test_2,y_test_2))
print(reg.score(X_test_3,y_test_3),reg_3.score(X_test_3,y_test_3))
mean_squared_error(y_test_1,y_pred_1)

We can see that the scores are greatly improved

In [None]:
categories = ['sex','smoker','region']
for col in categories:
    df[col] = df[col].astype('category') 
df.info()

In [None]:
X = df.drop(columns=['charges'])
y = df['charges']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=10)

In [None]:
import lightgbm as lgb
lgb_train = lgb.Dataset(X_train, label=y_train, categorical_feature = categories)
params = {}
params['learning_rate'] = 0.03
params['boosting_type'] = 'gbdt'
params['objective'] = 'regression'
params['sub_feature'] = 0.5
params['num_leaves'] = 10
params['min_data'] = 50
params['max_depth'] = 10
params['n_estimators'] = 50

lgb_model = lgb.train(params, lgb_train,categorical_feature = categories)
#Prediction
y_pred=lgb_model.predict(X_test)

In [None]:
y_pred

In [None]:
import shap
import matplotlib.pylab as pl
explainer = shap.TreeExplainer(lgb_model)
shap_values = explainer.shap_values(X)

In [None]:
shap.summary_plot(shap_values, X) 

In [None]:
shap.dependence_plot('age', shap_values, X, dot_size=32, show=False)

In [None]:
shap.dependence_plot('bmi', shap_values, X, dot_size=32, show=False)

In [None]:
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[200], X.iloc[200,:])

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)

In [None]:
lgb.plot_tree(lgb_model,figsize = (15,15))