# About the Dataset

- **age:** age of primary beneficiary

- **sex:** insurance contractor gender, female, male

- **bmi:** Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

- **children:** Number of children covered by health insurance / Number of dependents

- **smoker:** Smoking

- **region:** the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

- **charges:** Individual medical costs billed by health insurance

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
pip install comet_ml

In [None]:
# import comet_ml at the top of your file
from comet_ml import Experiment




from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("my_api")

In [None]:
data = pd.read_csv("/kaggle/input/insurance/insurance.csv")

In [None]:
data

In [None]:
data.shape

In [None]:
pip install -U seaborn

In [None]:
import seaborn as sns
sns.__version__

In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## EDA

In [None]:
data.isnull().sum()

#### Insight: No missing values

In [None]:
data.describe()

### Insights:
- charges seems to be right skewed (median < mean)
- age, bmi, children seem to be normall distributed

In [None]:
data.sex.value_counts()

In [None]:
data.smoker.value_counts()

In [None]:
data.region.value_counts()

### Insights:
- smoker is unbalanced, more people are non-smokers
- sex, region, seems to be balanced

## Relationship with target variable

In [None]:
sns.distplot(data.charges)

In [None]:
sns.displot(data=data,
            x='charges',
            row='sex',
            col='region',
            hue='smoker',
            fill=True,
            multiple='stack',
            kind='kde')
plt.show()

Learn More about plots <a href="https://seaborn.pydata.org/tutorial/distributions.html">here</a>

In [None]:
var = 'sex'
mean_data = data.groupby(var).charges.mean()
print(mean_data)
print(mean_data.diff())
sns.violinplot(data=data, x=var, y='charges')
plt.title('Distribution of target against '+var)
plt.show()

In [None]:
var = 'smoker'
mean_data = data.groupby(var).charges.mean()
print(mean_data)
print(mean_data.diff())
sns.violinplot(data=data, x=var, y='charges')
plt.title('Distribution of target against '+var)
plt.show()

In [None]:
var = 'region'
mean_data = data.groupby(var).charges.mean()
print(mean_data)
#print(mean_data.diff())
sns.violinplot(data=data, x=var, y='charges')
plt.title('Distribution of target against '+var)
plt.show()

### Insight:
- sex, region do not seem to have much impact on the target.
- smoker does seem to have huge impact on target

In [None]:
sns.pairplot(data,
            hue='smoker',
            palette='plasma')
plt.show()

## Hypothesis

We have already visualized the relationship of the variables to the charges. Now we will further investigate by looking at the relationships using multiple linear regression. Remember that the aim of this section is to quantify the relationship and not to create the prediction model. Let us first create a training and testing data set to proceed.

Based on the visualization, we can make a couple of hypothesis about the relationship.


   - There is no real difference in charges between gender or regions.
    
   - The charge for smokers are very much higher than the non-smokers.
    
   - The charge gets higher as the individual gets older.
    
   - The charge gers higher as the individual reaches over 30BMI.
    
   - Lastly, the charge is higher for those who have fewer number of children.

## Pre-Processing

In [None]:
# Categoricals to numerical
data = pd.get_dummies(data, prefix=['sex', 'smoker', 'region'], drop_first=True)
data.head()

In [None]:
# Split train-test
X = data.drop(columns='charges')
y = data.loc[:, 'charges']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Quantify Effects

In [None]:
import statsmodels.api as sm
from scipy import stats

X_train_const = sm.add_constant(X_train)

linearModel = sm.OLS(y_train, X_train_const)
linear = linearModel.fit()
print(linear.summary())

There is no real difference in charges between gender(p-value 0.907) or regions.(p-value 0.342, 0.093, 0.173)

The charge for smokers are very much higher than the non-smokers.

The charge gets higher as the individual gets older.

The charge gers higher as the individual reaches over 30BMI.

Lastly, the charge is higher for those who have fewer number of children.

## Build model

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet

In [None]:
pipeline = Pipeline([
    ('scaling', 'passthrough'),
    ('model', 'passthrough')
])

param_grid = {
    'scaling': [StandardScaler(), MinMaxScaler()],
    'model': [LinearRegression(), Ridge(), Lasso(), ElasticNet()]
}

grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

In [None]:
grid.fit(X_train, y_train)

In [None]:
grid.best_estimator_

In [None]:
grid.score(X_test, y_test)

In [None]:
for i in range(len(grid.cv_results_['params'])):
    exp = Experiment(workspace="maksteel",
        project_name="saturday-codealong-medical-insurance-costs-predict",
        api_key=secret_value_0)
    for k,v in grid.cv_results_.items():
        if k == "params":
            exp.log_parameters(v[i])
        else:
            exp.log_metric(k,v[i])