## Introduction

In this notebook we will go through the data of the "Medical Cost Personal" dataset. Starting with some exploratory data analysis, we will find some patterns that'll help us divide the patients into 3 solid clusters. These clusters will help us later on predict the charges of the patients more accurately using simple polynomial regression.

Walkthrough the notebook:
1. <a href="#eda">Exploratory Data Analysis</a>
2. <a href="#cluster">Culstering the patients</a> 
3. <a href="#pred">Classification and Regression</a>
    * <a href="#reg">Regression</a> 99% accuracy
    * <a href="#class">Classification</a> 95% accuracy
    * <a href="#merge">Merging the models</a> 90% accuracy

## 1. Exploratory Data Analysis:<a id="eda"></a>

In [None]:
# Imports and data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random as rd

data = pd.read_csv('/kaggle/input/insurance/insurance.csv')
data['smoker'] = [bool(sm == 'yes') for sm in data['smoker']]
data['age_cat'] = pd.cut(data['age'], 8, labels=False)
data['age_cat'] = [int(18+i*(64-18)/8) for i in data['age_cat']]

In [None]:
# Setting up graphics and color palette
from pylab import rcParams
rcParams['figure.figsize'] = 9, 7

sns.set_context('notebook')
sns.set_style('whitegrid')
pal = sns.color_palette('Set3')
rd.shuffle(pal, lambda: .4)
pal[0], pal[1], pal[2], pal[4] = pal[1], pal[0], pal[4], pal[2]
sns.set_palette(pal)

### *  Charges in relation to smoking

One of the first obvious corrolations to explore is the influance of smoking on the overall charges of the individuals:

In [None]:
sns.histplot(data, x='charges', hue='smoker', kde=True, hue_order=[True, False], alpha=.7)
plt.title('Distribution of charges in relation to smoking')
plt.show()

In [None]:
sns.violinplot(data=data, y='charges', x='smoker', order=[True, False])
plt.title('Distribution of charges in relation to smoking')
plt.show()

Even though the data is pretty unbalanced when it comes to the 'smoker' feature, it's clear that smokers in general have heavier charges.

### * BMI, charges (and smoking)

Distribution of the BMI along the dataset:

In [None]:
sns.histplot(data=data, x='bmi', stat='density', kde=True)
plt.show()

In [None]:
sns.lmplot(data=data, x='bmi', y='charges', hue='smoker', scatter_kws={"alpha": .3, "s": 20}, height=7, aspect=1.15)
plt.show()

Smokers' charges roughly increse with BMI.

### * Number of childre, age, and charges:

In [None]:
sns.barplot(data=data, x='age_cat', y='children', estimator=np.mean)
plt.show()

In [None]:
sns.barplot(data=data, x='children', y='charges', estimator=np.mean)
plt.show()

### * Age, charges (and smoking)

In [None]:
sns.barplot(data=data, x='age_cat', y='charges', estimator=np.mean)
plt.show()

In [None]:
sns.scatterplot(data=data, x='age', y='charges', hue='smoker')
plt.show()

Even tho *the higher the age the higher the charges*, from this scatter plot we can recognize 3 clusters in the individuals, which can be interpreted as the **upper middle and lower social classes**.<br>
Moreover it's pretty much understandable that smoking is more frequent in the upper class than in the lower class.

In the next section we will dive more into these clusters.

## 2. Clustering:<a id="cluster"></a>

In this section we'll be using the **K-means algorithm** to classify the individuals into 3 different clusters: upper, middle, and lower class.<br>
We'll then explore the consequences of clustering.

In [None]:
from sklearn.cluster import KMeans

data['smoker'] = [int(sm) for sm in data['smoker']]
data['charges/age'] = data['charges'] - data['age']*200
data['sex_male'] = [int(s == 'male') for s in data['sex']]

X = np.array(data[['charges/age']])
kmeans = KMeans(n_clusters=3, max_iter=500, init=np.array([16000, 28000, 48000]).reshape(-1, 1), n_init=1)
kmeans = kmeans.fit(X.reshape(-1, 1))
data['cluster'] = kmeans.labels_

sns.scatterplot(data=data, x='age', y='charges', hue='cluster', palette=pal[0:3])
plt.title('Clusters of Patients')
plt.show()

After running the clustering algorithm, we will take a look at the different manifestations of the clusters along our dataset.

In [None]:
sns.lmplot(data=data, x='age', y='charges', hue='cluster', scatter_kws={"alpha": .3, 's': 20}, height=7, aspect=1.15)
plt.title('Regression within clusters')
plt.show()

It's clear that the regression (and later on the prediction) has become more fitting to the data.

In [None]:
sns.scatterplot(data=data, x='bmi', y='charges', hue='cluster', palette=pal[0:3])
plt.title('BMI and charges distribution within clusters')
plt.show()

## 3. Prediciton:<a id="pred"></a>

First we have to divide the data into **training** 70%, **crossvalidation** 15%, and **testing** 15% parts:

In [None]:
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(data, test_size=.3, random_state=6)
data_cv, data_test = train_test_split(data_test, test_size=.5, random_state=6)

In [None]:
pd.options.mode.chained_assignment = None
data_train['cat'] = 'train'
data_cv['cat'] = 'cv'
data_test['cat'] = 'test'
data_cat = pd.concat([data_train, data_cv, data_test])
fig, axes = plt.subplots(2, 2)
sns.barplot(data=data_cat, x='cat', y='smoker', estimator=np.mean, ax=axes[0, 0])
sns.barplot(data=data_cat, x='cat', y='charges', estimator=np.mean, ax=axes[0, 1])
sns.barplot(data=data_cat, x='cat', y='age', estimator=np.mean, ax=axes[1, 0])
sns.barplot(data=data_cat, x='cat', y='bmi', estimator=np.mean, ax=axes[1, 1])
for ax in axes[:, 1]:
    ax.yaxis.tick_right()
for ax in axes.flatten():
    ax.set(xlabel='')
plt.gcf().suptitle('Mean of the features across the data categories')
plt.show()

The data is pretty much balanced between all the categories.

### *  Regression:<a id="reg"></a>

In this section we will predict the charges given the rest of the features, including the cluster.
We will use polynomial regression of degree 3.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

# Features Preprocessing
reg_features = ['cluster', 'age', 'smoker', 'bmi', 'children']

X_train = np.array(data_train[reg_features])
poly = PolynomialFeatures(3)
X_train = poly.fit_transform(X_train)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
y_train = np.array(data_train[['charges']])

X_cv = np.array(data_cv[reg_features])
X_cv = scaler.transform(poly.transform(X_cv))
y_cv = np.array(data_cv[['charges']])

X_test = np.array(data_test[reg_features])
X_test = scaler.transform(poly.transform(X_test))
y_test = np.array(data_test[['charges']])

# Tuning regularization parameter
scores = np.array([])
scores = scores.reshape(-1, 2)
for alpha in np.arange(0, 1, .1):
    reg = Ridge(alpha=alpha)
    reg.fit(X_train, y_train)
    scores = np.append(scores, np.array([[reg.score(X_cv, y_cv), alpha]]), axis=0)
    print('Alpha: ' + str(alpha) + '\tacc: ' + str(round(scores[-1, 0], 5)))

ind = np.argmax(scores, axis=0)[0]
alpha = scores[ind, 1]
reg = Ridge(alpha=alpha)
reg.fit(X_train, y_train)
print('\nChosen alpha: ' + str(alpha) + '\tfor accuracy: ' + str(reg.score(X_cv, y_cv)))

# Testing the model
print('\nAccuracy of the model:')
print(reg.score(X_test, y_test))


Knowing the cluster of the patients we can predict their charges to up to an accuracy of ***98%***

### *  Classification:<a id="class"></a>

In this section we will use SVM with gaussian kernel to classify the individuals into their correct clusters (ie social class) given the rest of the features:

In [None]:
from sklearn import svm
from sklearn.metrics import classification_report


# Features Preprocessing
svc_features = ['age', 'smoker', 'bmi', 'children', 'sex_male']

X_train = np.array(data_train[svc_features])
y_train = data_train['cluster']

X_cv = np.array(data_cv[svc_features])
y_cv = data_cv['cluster']

X_test = np.array(data_test[svc_features])
y_test = data_test['cluster']

# Tuning in regularization and gamma params
scores = np.array([])
scores = scores.reshape(-1, 3)
for c in range(8, 13):
	for gamma in np.arange(.0015, .003, .0005):
		svclassifier = svm.SVC(C=c, kernel='rbf', gamma=gamma, decision_function_shape='ovo')
		svclassifier.fit(X_train, y_train.values.ravel())
		scores = np.append(scores, np.array([[svclassifier.score(X_test, y_test), c, gamma]]), axis=0)
		print('C: ' + str(c) + '\tgamma: ' + str(round(gamma,4)) + '\tacc: ' + str(round(scores[-1, 0], 5)))
ind = np.argmax(scores, axis=0)[0]
c = scores[ind, 1]
gamma = scores[ind, 2]
svclassifier = svm.SVC(C=c, kernel='rbf', gamma=gamma)
svclassifier.fit(X_train, y_train.values.ravel())
print(
	'\nChosen C: ' + str(c) +
	'\twith gamma: ' + str(gamma) +
	'\tfor accuracy: ' + str(svclassifier.score(X_test, y_test))
)

# Testing the model
y_pred = svclassifier.predict(X_test)
print()
print(classification_report(y_pred, y_test))


### *  Merging the two models:<a id="merge"></a>

Here, we will assess both our models in a sequence on our test set:
* First we'll use the classifier to predict the social class of the individual;
* Then we will use regression model to predict the charges of the individual.

In [None]:
from sklearn.metrics import r2_score

y_test = data_test['charges']

# Classifier to predict the cluster
X_test = data_test[svc_features]
X_test['cluster'] = svclassifier.predict(X_test)

# Polynomial regression to predict the charges
X_test = np.array(X_test[reg_features])
X_test = scaler.transform(poly.transform(X_test))
y_pred = reg.predict(X_test)

print('Accuracy of the merged models: ')
print(r2_score(y_test, y_pred))