# Medical Treatment Cost Prediction

## Step 1: Frame the Problem

- Objective:  Predict the cost of treatment.

- Description:
    - Given data contains records cost of treatment of different patients. 
    - The cost of treatment depends on many factors: Disease, severity of disease, type of treatment, age, diagnosis, type of clinic, city of residence and so on. 
    - In given dataset, few following factors available with us: age, sex, bmi, children, smoker, and region.
    - General understanding/common sense told us that in following scenarios the cost of treatment will be higher:
        1. if person is smoker 
        2. person is having BMI > 30

- Problem Type: Supervise & Regression
- Batch Learning. Since data is not changing rapidly.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv("../input/insurance/insurance.csv")
print(data.shape)
data.head()

## Step 2: Data Exploration

Our main focus will be on following factors: sex, age, smoker, bmi, region. Number of children will never affect the cost of treatment.

In [None]:
data_explore = data.copy()

### Statistical Overview

In [None]:
data_explore.info()

There are no columns having null values.

In [None]:
data_explore.describe()

### Outliers

In [None]:
Q1 = data_explore.quantile(0.25)
Q3 = data_explore.quantile(0.75)
IQR = Q3 - Q1
outliers = ((data_explore < (Q1 - 1.5 * IQR)) | (data_explore > (Q3 + 1.5 * IQR))).sum()
outliers[outliers>0]

In [None]:
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.boxplot(x='charges', data=data_explore)
plt.subplot(1, 2, 2)
sns.boxplot(x='bmi', data=data_explore)
plt.show()

There are some categorical features. Lets encode those features before we use them for analysis.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
data_explore["sex_enc"] = label_encoder.fit_transform(data_explore["sex"])
print(label_encoder.classes_)
data_explore["smoker_enc"] = label_encoder.fit_transform(data_explore["smoker"])
print(label_encoder.classes_)
data_explore["region_enc"] = label_encoder.fit_transform(data_explore["region"])
print(label_encoder.classes_)

### Analysis 1: Observe Distribution of all features.

In [None]:
sns.pairplot(data_explore)

- Graphs on diagonal represent histogram of each feature.
- Dataset contains more number of treatment records involving peoples having age around 20.
- We can see that our taget variable 'charges' is completely right-skewed. Most of the treatment records have charges less than 12000.
- There is many records of people who are non-smoker than ones who are.
- There is not somuch diffence in number of records of male and female patients. Similar is the case for region-wise records.
- Looking at distribution of charges against sex, region and smoker, the charges are fairly same accross all categories in sex and in region. For smokers the tratement charges are on higher side compare to non-smokers.
- We some growing patterns in graphs of Charges vs BMI, Charges vs Age.

We will now explore each feature in more depth.

In [None]:
data_explore['charges'].hist()
plt.xlabel('Charges')

### Analysis 2: Distribution of Smokers on Basis of Gender

In [None]:
sns.catplot(x="smoker", kind="count",hue = 'sex', data=data_explore, legend_out=False )
plt.title("Distribution of Smokers on Basis of Gender")
plt.show()

In [None]:
data_explore[(data_explore['smoker']=='yes') & (data_explore['sex']=='female')]['charges'].count(), data_explore[(data_explore['smoker']=='yes') & (data_explore['sex']=='male')]['charges'].count()

There are almost 30% more male smokers than female smokers.

### Analysis 3: Distribution of treatment charges over age.

In [None]:
data_explore_male = data_explore[data_explore["sex"]=="male"]
data_explore_female = data_explore[data_explore["sex"]=="female"]
data_explore_non_smoker = data_explore[data_explore["smoker"]=="no"]
data_explore_smoker = data_explore[data_explore["smoker"]=="yes"]

In [None]:
data_explore_smoker.age.hist()
plt.xlabel('Age')

In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
data_explore_smoker['charges'].hist()
plt.title('Distribution of Charges for Smokers')
plt.subplot(1, 2, 2)
data_explore_non_smoker['charges'].hist()
plt.title('Distribution of Charges for Non-Smokers')
plt.show()

In [None]:
data_explore_smoker['charges'].min()

In [None]:
fig, ax = plt.subplots()
data_explore.plot(kind="scatter", x="age", y="charges", alpha=0.5, c="smoker_enc", cmap=plt.get_cmap("brg"), colorbar=False, ax=ax, figsize=(8, 4))
plt.title("Distribution of treatment charges over ages\nSmokers - Green   Non-smokers: Blue")
plt.show()

- It seems that treatment charges increases with increase in age, but there are many treatment records shows high treatment charges for younger peoples.
- There are many treatment records having charges less than 15000. Many of those records are of non-smoking peoples. Above 15000, there are many smokers than non-smokers.

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2)
fig.set_figheight(6)
fig.set_figwidth(12)
data_explore_smoker.plot(kind="scatter", x="age", y="charges", ax=ax[0])
ax[0].set_title("Smokers treatment charges distribution over age")
data_explore_non_smoker.plot(kind="scatter", x="age", y="charges", ax=ax[1])
ax[1].set_title("Non Smokers treatment charges distribution over age")
plt.show()

- For non-smokers, there are less treatment records having treatment charges more than 30000. Whereas there are many treatment records of smokers having treatment charges more than 30000.

In [None]:
fig, ax = plt.subplots()
data_explore.plot(kind="scatter", x="age", y="charges", alpha=0.7, c="sex_enc", cmap=plt.get_cmap("brg"), colorbar=False, ax=ax, figsize=(8, 4))
plt.title("Distribution of charges over ages\nMale - Green   Female: Blue")
plt.show()

- There are many treatment records of male peoples, having treatment charges more than 15000 than females.

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2)
fig.set_figheight(6)
fig.set_figwidth(15)
data_explore_male.plot(kind="scatter", x="age", y="charges",alpha=0.7,c="smoker_enc", cmap=plt.get_cmap("brg"), colorbar=False, ax=ax[0])
ax[0].set_title("Males treatment charges distribution over age\nSmokers - Green   Non-smokers: Blue")
data_explore_female.plot(kind="scatter", x="age", y="charges",alpha=0.7,c="smoker_enc", cmap=plt.get_cmap("brg"), colorbar=False, ax=ax[1])
ax[1].set_title("Females treatment charges distribution over age\nSmokers - Green   Non-smokers: Blue")
plt.show()

- This distubution shows that irrespective of gender, charges for smoker is high compare to non-smokers.

### Analysis 4: Distribution of Charges over BMI

In [None]:
fig, ax = plt.subplots()
data_explore.plot(kind="scatter", x="bmi", y="charges", alpha=0.7, c="smoker_enc", cmap=plt.get_cmap("PiYG"), colorbar=False, ax=ax)
plt.title("Distribution of charges over bmi\nSmokers - Green   Non-smokers: Red")
plt.show()

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2)
fig.set_figheight(6)
fig.set_figwidth(15)
data_explore_smoker.plot(kind="scatter", x="bmi", y="charges", c="age", cmap=plt.get_cmap("jet"), colorbar=False, ax=ax[0])
ax[0].set_title("Smokers treatment charges distribution over BMI")
data_explore_non_smoker.plot(kind="scatter", x="bmi", y="charges", c="age", cmap=plt.get_cmap("jet"), colorbar=True, ax=ax[1])
ax[1].set_title("Non-Smokers treatment charges distribution over BMI")
plt.show()

- We know that having BMI above 30 is sign of being unhealthy and having smoking habit along with it, makes health much more worse.
- We can see that for smoking peoples having BMI over 30 has treatment charges more than 30000.
- For non-smoking peoples having BMI above 30 are having very less(almost half) treatment charges compare to the smoking people having BMI in similar range. Also, for non-smoking peoples having BMI in range of 25 to 40, treatment charges increases with increase in age. This statement is not completely true as there are some records which indicates that there are high treatment charges for young non-smoking people.
- Here we can say that for peoples having BMI greater than 30 and also having smoking habit is very bad. These peoples will have very high treatment charges.

### Analysis 5: Correlation Plot

In [None]:
corr_matrix = data_explore.corr()

plt.figure(figsize=(12, 6))
sns.heatmap(corr_matrix, mask=np.zeros_like(corr_matrix, dtype=np.bool), annot=True, square=True)
plt.show()

- Treament charges is strongly correlated with Smoker feature. Also charges are fairly correlated with Age and BMI.
- There is no correlation among independent features.

## Step 3: Data Preprocessing

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

stratified_data = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

In [None]:
for train_idx, test_idx in stratified_data.split(data, data["smoker"]):
    stratified_train_set = data.iloc[train_idx]
    stratified_test_set = data.iloc[test_idx]
    
stratified_train_set.shape, stratified_test_set.shape

In [None]:
y_train = stratified_train_set['charges'].copy()
X_train = stratified_train_set.drop(columns='charges', axis=1)

y_test = stratified_test_set['charges'].copy()
X_test = stratified_test_set.drop(columns='charges', axis=1)

Now we will create preprocessing pipeline which will do following things:
- Encoding of categorical features
- Standardization of numerical attributes

In [None]:
cat_attrs = ['sex', 'smoker', 'region']
num_attrs = ['age', 'bmi', 'children']

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [None]:
pre_process = ColumnTransformer([('scaler', StandardScaler(), num_attrs),
                                ('encode', OneHotEncoder(), cat_attrs)], remainder='passthrough')

In [None]:
X_train_transformed = pre_process.fit_transform(X_train)
X_test_transformed = pre_process.transform(X_test)

In [None]:
X_train_transformed.shape, X_train.iloc[0, :], X_train_transformed[0]

In [None]:
feature_columns = list(X_train.columns)
new_col_name = ['female', 'male', 'smoker_no', 'smoker_yes', 'northeast', 'northwest', 'southeast', 'southwest']
feature_columns.extend(new_col_name)
feature_columns = [ col for col in feature_columns if not col in cat_attrs]
feature_columns

## Step 4: Select and Train a Model

To evaluate each model we will be using RMSE as evaluation metric.

In [None]:
from sklearn.model_selection import cross_val_score

results = []

def cv_results(model, X, y):
    scores = cross_val_score(model, X, y, cv = 7, scoring="neg_mean_squared_error", n_jobs=-1)
    rmse_scores = np.sqrt(-scores)
    rmse_scores = np.round(rmse_scores, 2)
    print('CV Scores: ', rmse_scores)
    print('rmse: {},  S.D.:{} '.format(np.mean(rmse_scores), np.std(rmse_scores)))
    results.append([model.__class__.__name__, np.mean(rmse_scores), np.std(rmse_scores)])

### Linear Regression - Analytical approach

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linear_reg = LinearRegression()
linear_reg.fit(X_train_transformed, y_train)

In [None]:
feature_imp = [ col for col in zip(feature_columns,linear_reg.coef_)]
feature_imp.sort(key=lambda x:x[1], reverse=True)
feature_imp

In [None]:
cv_results(linear_reg, X_train_transformed, y_train)

The RMSE obtained is large. The model is clearly underfitting.

Lets try to increase model's complexity by adding polynomial features.

### Polynomial Regression 

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
poly_features = PolynomialFeatures(degree=2, include_bias=False)

In [None]:
from sklearn.pipeline import Pipeline

poly_reg = Pipeline([('poly_features', poly_features),
                    ('linear_reg', LinearRegression(n_jobs=-1))])

In [None]:
poly_reg.fit(X_train_transformed, y_train)

In [None]:
cv_results(poly_reg, X_train_transformed, y_train)

Well this is very good improvement in model's performance. Increasing model's complexity has certainly reduced the overfitting.

### SVR with RBF Kernel

In [None]:
from sklearn.svm import SVR

In [None]:
svr_reg = SVR(C=1, kernel='rbf')
svr_reg.fit(X_train_transformed, y_train)

In [None]:
cv_results(svr_reg, X_train_transformed, y_train)

Lets see if we can reduce RMSE by using tree models or not.

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree_reg = DecisionTreeRegressor(criterion='mse', random_state=42)
tree_reg.fit(X_train_transformed, y_train)

In [None]:
feature_imp = [ col for col in zip(feature_columns,tree_reg.feature_importances_)]
feature_imp.sort(key=lambda x:x[1], reverse=True)
feature_imp

In [None]:
cv_results(tree_reg, X_train_transformed, y_train)

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
forest_reg = RandomForestRegressor(n_estimators=100, criterion='mse', n_jobs=-1, random_state=42)
forest_reg.fit(X_train_transformed, y_train)

In [None]:
feature_imp = [ col for col in zip(feature_columns,forest_reg.feature_importances_)]
feature_imp.sort(key=lambda x:x[1], reverse=True)
feature_imp

In [None]:
cv_results(forest_reg, X_train_transformed, y_train)

### AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostRegressor

In [None]:
ada_reg = AdaBoostRegressor(loss='linear', n_estimators=100, learning_rate=0.01, random_state=42)
ada_reg.fit(X_train_transformed, y_train)

In [None]:
cv_results(ada_reg, X_train_transformed, y_train)

### XGBoost

In [None]:
from xgboost import XGBRegressor

In [None]:
xgb_reg = XGBRegressor(max_depth=3, n_estimators=100, learning_rate=0.1, objective='reg:squarederror', random_state=42)
xgb_reg.fit(X_train_transformed, y_train)

In [None]:
cv_results(xgb_reg, X_train_transformed, y_train)

In [None]:
result_df = pd.DataFrame(data=results, columns=['Model', 'RMSE', 'S.D'])
result_df

Among all implemented ML algorithms, XGBoost Regression has given us a better result. Now, lets tune hyperparameters of XGBoost.

## Step 5: Fine Tune a Model

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
xgb_grid_parm=[{'n_estimators':[25, 50, 75, 100], 'learning_rate':[0.001, 0.01, 0.1, 0.5, 1], 'max_depth':[3, 6, 8, 12] }]
xgb_grid_search = GridSearchCV(XGBRegressor(objective='reg:squarederror', n_jobs=-1, random_state=42), xgb_grid_parm, cv=5, scoring="neg_mean_squared_error", return_train_score=True, n_jobs=-1)
xgb_grid_search.fit(X_train_transformed, y_train)

In [None]:
xgb_grid_search.best_params_

In [None]:
cvres = xgb_grid_search.cv_results_
print("Results for each run of XGBoost Regression...")
for train_mean_score, test_mean_score, params in zip(cvres["mean_train_score"], cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-train_mean_score), np.sqrt(-test_mean_score), params)

We can observe that as the depth of decision tree is increasing, model is overfitting to the dataset.

In [None]:
best_xgb_reg = xgb_grid_search.best_estimator_
best_xgb_reg

## Step 6: Model Evaluation

In [None]:
cv_results(best_xgb_reg, X_test_transformed, y_test)

In [None]:
# R2-Score
best_xgb_reg.score(X_train_transformed, y_train), best_xgb_reg.score(X_test_transformed, y_test)

Before saving the model, lets observe predictions made by model on overall dataset. This analysis will help us to know where actually model has underperformed.

In [None]:
combine_data = pd.concat([stratified_train_set, stratified_test_set], axis=0)

In [None]:
combine_data.shape

In [None]:
combine_data['smoker_enc'] = label_encoder.fit_transform(combine_data['smoker'])

In [None]:
y_train_pred = best_xgb_reg.predict(X_train_transformed)
y_test_pred = best_xgb_reg.predict(X_test_transformed)

In [None]:
y_pred = np.concatenate([y_train_pred, y_test_pred], axis=0)

In [None]:
combine_data['predicted_charges'] = y_pred

In [None]:
combine_data.head()

In [None]:
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
plt.scatter(combine_data['age'], combine_data['charges'], c=combine_data["smoker_enc"], cmap=plt.get_cmap("brg"), alpha=0.7)
plt.title("Distribution of Observed Charges\nNon-Smoker: blue, Smoker: green")
plt.subplot(1, 2, 2)
plt.scatter(combine_data['age'], combine_data['predicted_charges'], c=combine_data["smoker_enc"], cmap=plt.get_cmap("brg"), alpha=0.7)
plt.title("Distribution of Predicted Charges\nNon-Smoker: blue, Smoker: green")
plt.show()

In [None]:
combine_data_smoker = combine_data[combine_data['smoker']=='yes']
combine_data_non_smoker = combine_data[combine_data['smoker']=='no']

In [None]:
combine_data_smoker.describe()

In [None]:
combine_data_non_smoker.describe()

In [None]:
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
combine_data_smoker['charges'].hist()
plt.title('Observed Charges for Smokers')
plt.subplot(1, 2, 2)
combine_data_smoker['predicted_charges'].hist()
plt.title('Predicted Charges for Non-Smokers')
plt.show()

In [None]:
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
combine_data_non_smoker['charges'].hist()
plt.title('Observed Charges for Non-Smokers')
plt.subplot(1, 2, 2)
combine_data_non_smoker['predicted_charges'].hist()
plt.title('Predicted Charges for Non-Smokers')
plt.show()

In [None]:
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
plt.scatter(combine_data_smoker['age'], combine_data_smoker['charges'], c='green')
plt.scatter(combine_data_smoker['age'], combine_data_smoker['predicted_charges'], c='red')
plt.title("Analysis of Predicted Charges for Smokers\nObserved Charges: green, Predicted Charges: red")
plt.subplot(1, 2, 2)
plt.scatter(combine_data_non_smoker['age'], combine_data_non_smoker['charges'], c='green')
plt.scatter(combine_data_non_smoker['age'], combine_data_non_smoker['predicted_charges'], c='red')
plt.title("Analysis of Predicted Charges for Non-Smokers\nObserved Charges: green, Predicted Charges: red")
plt.show()

- Observation
    - Our best model has able  predict almost accurately the charges for following:
        - Non-smoker having treatment charges less than 15000.
        - Most of the smoking patients.
    - Model has failed badly to give accurate predictions for following:
        - Non-smokers having treatment charges above 15000.
        - For some smokers having treatment charges above 50000.