# Medical Costs : How your profile affects your medical charges?


Today we will explore a data set dedicated to the cost of treatment of different patients. The cost of treatment depends on many factors: diagnosis, type of clinic, city of residence, age and so on. We have no data on the diagnosis of patients. But we have other information that can help us to make a conclusion about the health of patients and practice regression analysis. In any case, I wish you to be healthy! Let's look at our data.

### Kaggle Notebook: Click [here](https://www.kaggle.com/hirenkelaiya/medical-insurance-prediction)

### Columns

   - age: age of primary beneficiary

   - sex: insurance contractor gender, female, male

   - bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

   - children: Number of children covered by health insurance / Number of dependents

   - smoker: Smoking

   - region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

   - charges: Individual medical costs billed by health insurance

In [None]:
!pip install comet_ml

In [None]:
# import comet_ml at the top of your file
from comet_ml import Experiment

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
comet_api_key = user_secrets.get_secret("cometmlkey")

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/insurance/insurance.csv')
data.head()

In [None]:
data.shape

In [None]:
! pip freeze | grep seaborn

In [None]:
!pip install -U seaborn

In [None]:
import seaborn as sns
sns.__version__

In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## EDA

In [None]:
data.isnull().sum()

#### Insights:
 - No missing values

In [None]:
data.describe()

#### Insights:
- charges seem to be right-skewed (median<mean)
- age, BMI, and children seem to be normally distributed
- The average insurance premium is at $13,270.

In [None]:
data.sex.value_counts()

In [None]:
data.smoker.value_counts()

In [None]:
data.region.value_counts()

#### Insights:
 - smoker is unbalanced, more people are non-smoker
 - sex and region seem to be balanced

### Relationship with target

In [None]:
sns.displot(data, x='charges', kind='kde')
plt.show()

The graph shows it is skewed to the right. We can tell visually that there may be outliers (the maximum charge is at $63,770).

In [None]:
sns.displot(data, x='charges',row='sex', col='region', hue='smoker', fill=True, multiple='stack', kind='kde')
plt.show()

In [None]:
var = 'sex'
mean_data = data.groupby(var).charges.mean()
print(mean_data)
print(mean_data.diff())
sns.violinplot(data=data, x=var, y='charges')
plt.title('Distribution of target against ' + var)
plt.show()

There is not much difference between gender based on the violin plot. For males, the average charge is "slightly" higher compared to female counterparts with the difference of around $1387.

In [None]:
var = 'smoker'
mean_data = data.groupby(var).charges.mean()
print(mean_data)
print(mean_data.diff())
sns.violinplot(data=data, x=var, y='charges')
plt.title('Distribution of target against ' + var)
plt.show()

Ok, so there's around $23,615 difference between smokers and non-smokers. Smoking is very expensive indeed

In [None]:
var = 'region'
mean_data = data.groupby(var).charges.mean()
print(mean_data)
print(mean_data.diff())
sns.violinplot(data=data, x=var, y='charges')
plt.title('Distribution of target against ' + var)
plt.show()

#### Insighes:
 - As with the gender, region groups also does not show much difference between them based on the plot. Even so, the individuals from the Southeast has charged more on there bills. The highest charged individual also lives in the region as shown in the chart.

In [None]:
sns.pairplot(data, hue='smoker')
plt.show()

#### Insights:

 - As with the gender, region groups also does not show much difference between them based on the plot. Even so, the individuals from the Southeast has charged more on there bills. The highest charged individual also lives in the region as shown in the chart.
 - Focusing again on the first 3 charts in the bottom row, we can say that the higher amount of charges are dominated by blue points which are represented by smokers.

## Preprocessing

In [None]:
# Categorical to numerical
data = pd.get_dummies(data, prefix=['sex', 'smoker', 'region'], drop_first=True)
data.head()

In [None]:
# Train-test split
X = data.drop('charges', axis=1)
y = data['charges']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) 

## Quantify effects

#### Hypothesis:

We have already visualized the relationship of the variables to the charges. Now we will further investigate by looking at the relationships using multiple linear regression. Remember that the aim of this section is to quantify the relationship and not to create the prediction model. Let us first create a training and testing data set to proceed.

Based on the visualization, we can make a couple of hypothesis about the relationship.


   - There is no real difference in charges between gender or regions.
    
   - The charge for smokers are very much higher than the non-smokers.
    
   - The charge gets higher as the individual gets older.
    
   - The charge gers higher as the individual reaches over 30BMI.
    
   - Lastly, the charge is higher for those who have fewer number of children.

In [None]:
import statsmodels.api as sm
from scipy import stats

X_train_const = sm.add_constant(X_train)
linearModel = sm.OLS(y_train, X_train_const)
linear = linearModel.fit()
print(linear.summary())

#### Hypothesis validation:

1. There is no real difference in charges between gender (p-value 0.907) or regions (p-value 0.342, 0.093, 0.173).
   - since all the p-values > 0.05 that means these variables do not have statistical significane on the target variable
2. The charge for smokers are very much higher than the non-smokers (p-value 0.000)
   - since p-value < 0.05 this variable is statistically significant
3. The charge gets higher as the individual gets older (p-value 0.000).
   - since p-value < 0.05 this variable is statistically significant
4. The charge gets higher as the individual reaches over 30BMI (p-value 0.000).
   - since p-value < 0.05 this variable is statistically significant
5. Lastly, the charge is higher for those who have fewer number of children (p-value 0.005).
   - since p-value < 0.05 this variable is statistically significant, meaning there is evidence that charges are different for people with fewer than people with more children

## Build models

In this section, we will create regression models and try to compare there robustness given the data. The models considered are Linear Regression, Ridge, LASSO, and ElasticNet.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet

In [None]:
pipeline = Pipeline([
    ('scaling', 'passthrough'),
    ('model', 'passthrough')    
])

param_grid = {
    'scaling': [StandardScaler(), MinMaxScaler()],
    'model': [LinearRegression(), Ridge(), Lasso(), ElasticNet()]
}

grid = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='r2')

In [None]:
grid.fit(X_train, y_train)

In [None]:
grid.best_estimator_

In [None]:
grid.score(X_test, y_test)

## Track Expertiments on [comet.ml](https://www.comet.ml/docs/python-sdk/scikit/)

In [None]:
for i in range(len(grid.cv_results_['params'])):
    exp = Experiment(workspace='hirenhk15',
        project_name='medical-insurance-charges-prediction',
        api_key=comet_api_key)
    for k,v in grid.cv_results_.items():
        if k == "params":
            exp.log_parameters(v[i])
        else:
            exp.log_metric(k,v[i])
    exp.end()

Click [here](https://www.comet.ml/hirenhk15/medical-insurance-charges-prediction/view/new) to view experiment results.

### Conclusion

We have found out that region and gender does not bring significant difference on charges among its groups. Age, BMI, number of children and smoking are the once that drives the charges. The statistical relationship between number of children and charges is surprisingly different from our visualization. Meanwhile, Ridge has given highest R2 score with MinMaxScaler.