In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#                            What determines your insurance fee?


### In the provided dataset can be found several variables that could explain how much money patients are paying for their medical attendance. What we see is **what variables** under **which models** help predicting the explained variable along with the goodness of fit. 

# EDA:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sbn
sbn.set_style('darkgrid')

In [None]:
dataset = pd.read_csv(r'/kaggle/input/insurance/insurance.csv')
dataset

### As we can see in the dataset, we have the following variables:

#### Predictors:
* age: that describe how old the beneficiary is. 
* sex: insurance contractor gender:male,female.
* bmi, body mass index. Numeric variable that measures it the contractor suffers from over/underweight.
* children, number of children covered by the insurance policy.
* region: beneficiary's residential area.

#### Predicted variable:

charges: medical cost billed to beneficiary.


In [None]:
dataset.describe()

### From the previous table, we can see that: 
1. The variance of the variable "children" is far smaller than the rest. It may be taken into account when tunning the baseline model.
2. Variables "age" and "bmi" has similar shape except for the mean and maximum value.

### Now, distribution of age, bmi and charges will be plotted:

In [None]:
fig, axs = plt.subplots(ncols = 3, figsize = (20,5))
sbn.distplot(dataset['age'], ax = axs[0])
sbn.distplot(dataset['bmi'], ax = axs[1])
sbn.distplot(dataset['charges'], ax = axs[2])

### From previous plots, we can see:
1. Only "bmi" variable has a normal shape
2. "charges" appears to highly skewed. 
3. Regarding "age", there are more patients in their early-twenties than the rest of age ranges.

### Since we have two categorical variablese, wen can also check if the dataset has more data for a certain value or values or is well distributed across different categories.

In [None]:
fig2, axs2 = plt.subplots(ncols = 2, figsize = (15,6))
sbn.countplot(dataset['sex'], ax = axs2[0])
sbn.countplot(dataset['region'], ax = axs2[1])

### As we can see, the dataset is quite well distributed across different categories, so no modifications will be needed regarding this fact.

### We can also check how the previous variables affect the explained variable: 


In [None]:
fig3, axs3 = plt.subplots(ncols = 1, figsize = (15,10))
sbn.boxplot(x = dataset['region'], y = dataset['charges'], hue = dataset['sex'], ax = axs3)

### From the previous boxplot we can see that:
* The dataset can have outliers, that will be further analysed. 
* Neither "sex" nor "region" has a very significant impact on "charges". Said that, we should further explore the rest of variables in the dataset. 

### In order to see the influence of age and bmi on charges, we will make a pairplot:

In [None]:
sbn.pairplot(dataset.select_dtypes(exclude = ['object']).drop(["children"], axis = 1))

### From the previous pairplot, we can see that: 

* There is a slight slope in the "age" vs "bmi" slope. 
* There are three "lines" with increasing dispersion in the "age" vs "charges" plot, maning that there might be other variables affecting the relationship between age and charges.

### Let's see the impact that number of children under the policy may have:

In [None]:
fig4, axs4 = plt.subplots(ncols = 1, figsize = (15,10))
sbn.boxplot(x = dataset['children'], y = dataset['charges'], ax = axs4)

### It is not clear that the number of children can significantly affect the variable "charges". We see it in the overlapping boxplots.

### Now, we will analyse the relationship between smoker and the explained variable

In [None]:
fig5, axs5 = plt.subplots(ncols = 1, figsize = (7,5))
sbn.boxplot(x = dataset['smoker'], y = dataset['charges'], ax = axs5)

### Bingo!Here we have a factor that significantly affect "charges"! We will further explore the relationship between smoker, bmi and age and their response to charges.


In [None]:
fig6, axs6 = plt.subplots(ncols = 2, figsize = (15,10))
sbn.scatterplot(x = dataset[dataset['smoker'] == 'yes']['age'], y = dataset[dataset['smoker'] == 'yes']['charges'], ax = axs6[0]).set_title("Smoker = yes")
sbn.scatterplot(x = dataset[dataset['smoker'] == 'no']['age'], y = dataset[dataset['smoker'] == 'no']['charges'], ax = axs6[1]).set_title("Smoker = no")

### We can say that there is a positive relationship between age and charge. Nevertheless, there may be another factor that makes a split in the response: 
* In the case of non-smokers, there are two groups: one  with a very low dispersion and a charge up to 15.000, and another one with much higher fees and higher dispersion. 
* Regarding smokers, there are also two separate groups. 

### Now, we will try to include BMI into the equation to seee if it can explain that separation. 

In [None]:
sbn.distplot(dataset['bmi']).set_title("Distribution of bmi variable")

As we previously saw, bmi variable has a bell shape. We will create a variable called "overweight" to state if the beneficiary has overweight or not and see if it helps to explain the two groups formed in the previous age / charge scatterplot.

In [None]:
dataset['overweight'] = np.where(dataset['bmi']>30, 'yes', 'no')

In [None]:
fig7, axs7 = plt.subplots(ncols = 2, figsize = (15,10))
sbn.scatterplot(x = dataset[(dataset['smoker'] == 'yes')]['age'], y = dataset[(dataset['smoker'] == 'yes')]['charges'], ax = axs7[0], hue = dataset['overweight']).set_title("Smoker = yes")
sbn.scatterplot(x = dataset[(dataset['smoker'] == 'no')]['age'], y = dataset[(dataset['smoker'] == 'no')]['charges'], ax = axs7[1], hue = dataset['overweight']).set_title("Smoker = No")

### In the case of smokers, the fact of suffering from overweith clearly separates the two groups, where smokers with overweight pay  much more for their medical insurance.
### Since bmi is useless to explain the same fact for non-smokers, we will make similar plots with other variables. We will also include a binarized variable from the number of children (have / don't have children):

In [None]:
dataset['num_family'] = np.where(dataset['children']>0, 'yes', 'no')

In [None]:
fig8, axs8 = plt.subplots(ncols = 4, figsize = (20,10))
sbn.scatterplot(x = dataset[(dataset['smoker'] == 'no')]['age'], y = dataset[(dataset['smoker'] == 'no')]['charges'], hue = dataset['region'], ax = axs8[0]).set_title("Smoker = No")
sbn.scatterplot(x = dataset[(dataset['smoker'] == 'no')]['age'], y = dataset[(dataset['smoker'] == 'no')]['charges'], hue = dataset['sex'], ax = axs8[1]).set_title("Smoker = No")
sbn.scatterplot(x = dataset[(dataset['smoker'] == 'no')]['age'], y = dataset[(dataset['smoker'] == 'no')]['charges'], hue = dataset['children'], ax = axs8[2]).set_title("Smoker = No")
sbn.scatterplot(x = dataset[(dataset['smoker'] == 'no')]['age'], y = dataset[(dataset['smoker'] == 'no')]['charges'], hue = dataset['num_family'], ax = axs8[3]).set_title("Smoker = No")

### With available data and categorized as it is, we can't explain why non-mokers pay more money than others as we have done with smokers(overweight caused such difference).

### With the las graph we conclude de EDA section, having analized dataset characteristics as well ass relationships between varaibles. In the following part, we will build a baseline model against which we will compare some other tunned versions.

In [None]:
dataset = dataset.drop(['overweight', 'num_family'], axis = 1)

# Baseline model

### In this model, the only involved preprocessing activity will be transorming categorical variables into their dummy version. It means that, instead of categorical values, new binary variables will be created.
### To avoid collinearity, I will order OneHotEncoder function to drop the first values. (E.g: sex has "male" and "female". Thus, one of both will be dropped). 

### Prior to everything, I load usefull libraries that will be used later one.

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer,TransformedTargetRegressor
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from keras.wrappers.scikit_learn import KerasRegressor

In [None]:
X_num = dataset.select_dtypes(exclude = ['object']).drop(['charges'], axis = 1)
y = dataset['charges']
X_cat = dataset.select_dtypes(include = ['object'])
X = pd.concat([X_num, X_cat], axis = 1)
cat_index = dataset.select_dtypes(include = ['object']).columns

In [None]:
trans_steps= [('cat', OneHotEncoder(drop = 'first'), cat_index)]
cols_transform = ColumnTransformer(transformers = trans_steps)
pipeline = Pipeline(steps = [('trans', cols_transform), ('model', LinearRegression())])
cv = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 0)
baseline = cross_val_score(estimator = pipeline, X =X, y = y, scoring = 'r2', cv = cv, n_jobs = -1)

In [None]:
sbn.boxplot(baseline, orient = 'v').set_title("R2 of baseline model")

### I have used a simple pipeline that first apply OneHotEncoder to categorized variables and leave numerical ones the same. Afther such transformation, a simple linear regression is created and cross-validated. 
### As we can see, the baseline model predicts the medical charges with an R2 of 61%. 

### Now, I will try to fine-tune the model as well as trying different regressor to see if we can improve the previous result. 

  #                          Tunning the baseline model

In [None]:
y = dataset['charges']
X = dataset.drop(['charges'], axis = 1)

In [None]:
X.var()

### As we stated previously, factor "children" has a very little variance compared to the rest of numerical variables within the dataset. For that reason, the following step will be transforming "children" variable to categorical data and check if such modification helps to improve the model performance.

In [None]:
X['children'] = X['children'].astype(str)

In [None]:
cat_index = dataset.select_dtypes(include = ['object']).columns
trans_steps= [('cat', OneHotEncoder(drop = 'first'), cat_index)]
cols_transform = ColumnTransformer(transformers = trans_steps)
pipeline = Pipeline(steps = [('trans', cols_transform), ('model', LinearRegression())])
cv = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 0)
baseline_child_str = cross_val_score(estimator = pipeline, X =X, y = y, scoring = 'r2', cv = cv, n_jobs = -1)
sbn.boxplot(data = [baseline, baseline_child_str], orient = 'v').set_title("Baseline model (left), baseline with cat. children (right)")

### Well, not too much effect! Statistically speaking, it does not make any difference but improves the  model by an 0.23%. Although the impact is quite limited, I decided to leave categorization of the variable.

### The next step will be standarizing numerical variable and check the impact on model performance.

In [None]:
cat_index = X.select_dtypes(include = ['object']).columns
num_index = X.select_dtypes(exclude = ['object']).columns
trans_steps= [('cat', OneHotEncoder(drop = 'first'), cat_index), ('standarz', StandardScaler(),num_index)]
cols_transform = ColumnTransformer(transformers = trans_steps)
pipeline = Pipeline(steps = [('trans', cols_transform), ('model', LinearRegression())])
ttregressor = TransformedTargetRegressor(regressor = pipeline, transformer = StandardScaler())
cv = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 0)
baseline_trans = cross_val_score(estimator = ttregressor, X =X, y = y, scoring = 'r2', cv = cv, n_jobs = -1)
sbn.boxplot(data = [baseline, baseline_child_str, baseline_trans], orient = 'v').set_title("Baseline model (left), baseline with cat. children (center) and cat.+stand (right)")

### Beautiful!!Numerical input standarization clearly improve the linear regression model. Said that, we will use categorization of "children" variable + stantarization of X's variables as well as y variable for the following tests.

### Now, I will add polynomial features to the regression model. The first step on this stage will be building a pipeline to test different degrees and crossvalidate them to select the best one. 

In [None]:
from sklearn.preprocessing import PolynomialFeatures
r2_degrees = []
for i in range(0,6):
    cat_index = X.select_dtypes(include = ['object']).columns
    num_index = X.select_dtypes(exclude = ['object']).columns
    trans_steps= [('cat', OneHotEncoder(drop = 'first'), cat_index), ('standarz', StandardScaler(),num_index), ('poly', PolynomialFeatures(degree = i), num_index)]
    cols_transform = ColumnTransformer(transformers = trans_steps)
    pipeline = Pipeline(steps = [('trans', cols_transform), ('model', LinearRegression())])
    ttregressor = TransformedTargetRegressor(regressor = pipeline, transformer = StandardScaler())
    cv = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 0)
    cvs_i = cross_val_score(estimator = ttregressor, X =X, y = y, scoring = 'r2', cv = cv, n_jobs = -1)
    r2_degrees.append(cvs_i)
sbn.boxplot(data = r2_degrees).set_title("R2 with different Polynomial degrees")

### As we can see in the previous boxplot, the model performs practically the same for degree 1 and 2. For that reason and for computational efficiency, I will use no power transformation for the  the following phase, which is Recursive feature elimination (RFE). 

### For that purpose, I import the previous pipeline, with the diffence that I will loop over different number of features to include in the model.

In [None]:
r2_deg1_rfe = []
for i in range(1,15):
    cat_index = X.select_dtypes(include = ['object']).columns
    num_index = X.select_dtypes(exclude = ['object']).columns
    trans_steps= [('cat', OneHotEncoder(drop = 'first'), cat_index),('standarz', StandardScaler(),num_index)]
    cols_transform = ColumnTransformer(transformers = trans_steps)
    pipeline = Pipeline(steps = [('trans', cols_transform),('rfe', RFE(estimator = LinearRegression(),n_features_to_select = i)),('model', LinearRegression())])
    ttregressor = TransformedTargetRegressor(regressor = pipeline, transformer = StandardScaler())
    cv = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 0)
    cvs_i = cross_val_score(estimator = ttregressor, X =X, y = y, scoring = 'r2', cv = cv, n_jobs = -1)
    r2_deg1_rfe.append(np.mean(cvs_i))
plt.plot(r2_deg1_rfe)
plt.title("R2 with degree = 1. Different variables")

### What we can see in the previous line plot is how the R2 of the model increases with the number of features used.It is noticeable that, after the 4rd included feature , we obtain a similar performance. Thus, we will use the 4 most important variables for our model and drop the rest.

### We have conducted: 

### * Variables standarization
### * No polynomial transformation
### * 4 features maximum.

### The last thing that I am trying is to fine-tune the regression model through alpha parameter of Ridge Regression. Since GridSearchCV will be ued, I will manually modify the dataset, so we only have to loop over the alpha parameter without need of pipeline.

### Below, my custom funtions for this purpose. 

In [None]:
def columns_transform(X, degree = 1):
    ss = StandardScaler()
    pf = PolynomialFeatures(degree = degree)
    X_num = X.select_dtypes(exclude = ['object'])
    X_cat = X.select_dtypes(include = ['object'])
    X_num = pd.DataFrame(data = pf.fit_transform(X_num), columns = pf.get_feature_names(X_num.columns))
    X_num = pd.DataFrame(data = ss.fit_transform(X_num), columns = X_num.columns)
    X_cat = pd.get_dummies(X_cat).drop(['region_northeast', 'children_0','sex_female', 'smoker_no'], axis = 1)
    X_join = pd.concat([X_num, X_cat], axis = 1)
    return X_join

def scaler(X):
    ss = StandardScaler()
    X_scaled = ss.fit_transform(X.values.reshape(-1,1))
    X = pd.DataFrame(data = X_scaled, columns = ['charges'])
    return X

In [None]:
X_trans = columns_transform(X)
rfe = RFE(estimator = Ridge(), n_features_to_select = 4)
rfe.fit(X_trans, scaler(y))
print('Most important features: ',X_trans.columns[rfe.support_])
X_trans_4 = X_trans[X_trans.columns[rfe.support_]]

In [None]:
ridge_params = {'alpha':[0.01,0.05,0.1,0.30,0.50,0.60,0.70,0.80,0.90,1,2,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,3,3.5,4,5,6,7,8,9,10,15]}
ridge_grid = GridSearchCV(estimator = Ridge(), param_grid = ridge_params, scoring = 'r2', refit = True, cv = cv,verbose = 3, n_jobs = -1)
ridge_grid.fit(X_trans_4,scaler(y))
print("Best alpha: ", ridge_grid.best_params_)

In [None]:
best_linearRegressor = []
pipeline = Pipeline(steps = [('model', Ridge(alpha = 2.4))])
cv = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 0)
best_linearRegressor = cross_val_score(estimator = ttregressor, X =X, y = y, scoring = 'r2', cv = cv, n_jobs = -1)
sbn.boxplot(data = [baseline, best_linearRegressor]).set_title("R2 of baseline model VS Best Regression Model")

### With this last step, I have done regarding Linear Regression models. In the previous boxplot we can see the increase in the performance that we have achieved compared with the baseline model (only dummy variables).

# Random Forest

### The following part of the notebook will be used to work with Random Forest Regressors. Regarding the steps made with the Linear Regression, we will import the pipeline and transformations done to the original dataset. It means that we will apply standarization, explore polynomial features and finally fine-tune with model parameters. 

### Thus, first step will be testing different polynomial features. 


In [None]:
r2_degrees = []
for i in range(0,6):
    cat_index = X.select_dtypes(include = ['object']).columns
    num_index = X.select_dtypes(exclude = ['object']).columns
    trans_steps= [('cat', OneHotEncoder(drop = 'first'), cat_index), ('standarz', StandardScaler(),num_index), ('poly', PolynomialFeatures(degree = i), num_index)]
    cols_transform = ColumnTransformer(transformers = trans_steps)
    pipeline = Pipeline(steps = [('trans', cols_transform), ('model', RandomForestRegressor())])
    ttregressor = TransformedTargetRegressor(regressor = pipeline, transformer = StandardScaler())
    cv = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 0)
    cvs_i = cross_val_score(estimator = ttregressor, X =X, y = y, scoring = 'r2', cv = cv, n_jobs = -1)
    r2_degrees.append(cvs_i)
sbn.boxplot(data = r2_degrees).set_title("R2 with different Polynomial degrees")

### As we can see in the previous boxplot,scores across different degrees of polynomial features are practically the same. Nevertheless, degree 2 seems to provide a slight advantage to model. For that reason, I will use degree = 2 for the following step: RFE.

In [None]:
r2_deg2_rfe_rf = []
for i in range(1,25):
    cat_index = X.select_dtypes(include = ['object']).columns
    num_index = X.select_dtypes(exclude = ['object']).columns
    trans_steps= [('cat', OneHotEncoder(drop = 'first'), cat_index),('poly', PolynomialFeatures(degree = 2), num_index),('standarz', StandardScaler(),num_index)]
    cols_transform = ColumnTransformer(transformers = trans_steps)
    pipeline = Pipeline(steps = [('trans', cols_transform),('rfe', RFE(estimator = LinearRegression(),n_features_to_select = i)),('model', RandomForestRegressor())])
    ttregressor = TransformedTargetRegressor(regressor = pipeline, transformer = StandardScaler())
    cv = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 0)
    cvs_i = cross_val_score(estimator = ttregressor, X =X, y = y, scoring = 'r2', cv = cv, n_jobs = -1)
    r2_deg2_rfe_rf.append(np.mean(cvs_i))
plt.plot(r2_deg2_rfe_rf)
plt.title("R2 with degree = 2. Different variables")

### Good! Here we see that, beyond the 11th feature, the R2 remains stable in ~83%. It means that we don't need to complicate the model more than necessary.Thus, we will use the first 11 variables for our model.

### Once we have completed both polynomial features and RFE, I will try to fine-tune the model with the random forest parameters. 

In [None]:
x_rfr_final = columns_transform(X, 2)
y_rfr_final = scaler(y)
rfe_final_rfr = RFE(estimator = RandomForestRegressor(), n_features_to_select = 11)
rfe_final_rfr.fit(x_rfr_final,y_rfr_final)

In [None]:
print("Best params: ", x_rfr_final[x_rfr_final.columns[rfe_final_rfr.support_]].columns)

In [None]:
params_rfr = {'n_estimators':[50,100,150],
              'min_samples_split':[2,3,4,5,6,7,8,9,10,15,20],
              'min_samples_leaf':[2,3,4,5,6,7,8,9,10,20],
              'n_jobs':[-1]
             }
grid = GridSearchCV(estimator = RandomForestRegressor(), param_grid = params_rfr, scoring = 'r2', refit = True, verbose = 3)
grid.fit(x_rfr_final,y_rfr_final)
print('Best params: ', grid.best_params_) 

In [None]:
final_rfr_model = RandomForestRegressor(min_samples_leaf = 9, min_samples_split = 20, n_estimators = 50)
cv = cv = RepeatedKFold(n_splits = 5, n_repeats = 5, random_state = 0)
final_rfr_score = cross_val_score(estimator = final_rfr_model, X =x_rfr_final, y = y_rfr_final, scoring = 'r2', cv = cv, n_jobs = -1)
sbn.boxplot(data = [baseline, best_linearRegressor, final_rfr_score]).set_title("Baseline Model vs Best Regression vs Best RandomForest")

# Conclusions

### After fine-tunning the last model, Random forest Regression, I can say with confidence that this last model provides the best fit for the given dataset. Said that, I would also like to point out several lessons that I have learned by conducting this analysis: 

### * Firstly,speding time with preprocessing is really worth it. We all are always looking and testing different algorithms, and that is quite ok. Nevertheless, we should not forget that a simple preprocessing can add extra performance to the model with simple transformations and calculations. 
### * Secondly, comparing and testing different algorithms is part of our job. So no discussion here regarding wether to stick to one technique or try different ones.


### As previously stated, if we had to choose one of them, it is clear that we would choose the last one: Random Forest Regressio, since it reaches the highest R2 metric. 
### A preprocessing technique that I have not used is outlier elimination and it would be a good idea to insert it within the pipeline and check the impact on performance. 

### A model that I have not tested is Artificial Neural Network, which is a quite powerful one. I will definitely use it in future versions and exercises. 