<h1 align="center"> Medical Cost: EDA and Regression </h1>

<img src="https://qtxasset.com/styles/breakpoint_xl_880px_w/s3/2017-04/healthcare_costs.jpg?kc3g1eGqOgDg6.ABpPzctQ1kmDcH_A0L&itok=EyhX40L5" width="30%" />

Created: 2020-09-01

Last updated: 2020-09-01

Kaggle Kernel made by ðŸš€ <a href="https://www.kaggle.com/rafanthx13"> Rafael Morais de Assis</a>

In Progress

**Some References**
+ https://www.kaggle.com/hely333/eda-regression
+ https://www.kaggle.com/janiobachmann/patient-charges-clustering-and-regression
+ https://www.kaggle.com/mariapushkareva/medical-insurance-cost-with-linear-regression

## Problem Description

[Kaggle Link DataSet](https://www.kaggle.com/mirichoi0218/insurance)

**Context**

DataSet with the cost of treatment of different patients of US. Of course, there are several factors that influence the price of treatment but in this dataset has: age, bmi, sex, number of children / dependents, region of US, has children or not and finally the cost of treatment.

**File Description**

`insurance.csv`: DataSet with 1,338 rows and 7 columns

## DataSet Description

| Column   | Description                                                                                                                                                                                                                          | Values                                                     |
|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|
| age      | age of primary beneficiary                                                                                                                                                                                                           | int :: [18, 64]                                            |
| sex      | insurance contractor gender, female, male                                                                                                                                                                                            | string :: ['female','male']                                |
| bmi      | Body mass index, providing an understanding of<br>body, weights that are relatively high or low<br>relative to height, objective index of body <br>weight (kg / m ^ 2) using the ratio of height<br> to weight, ideally 18.5 to 24.9 | number :: [15.960, 53.130]                                 |
| children | Number of children covered by health insurance.<br>Number of dependents                                                                                                                                                              | number :: [0,5]                                            |
| smoker   | Smoking or Not                                                                                                                                                                                                                       | string :: ['yes','no']                                     |
| region   | the beneficiary's residential area in the US                                                                                                                                                                                         | string :: [northeast, southeast, <br>southwest, northwest] |
| charges  | Individual medical costs billed by health insurance                                                                                                                                                                                  | number :: [1,121.878 , 63,770.428]                         |



## The Goal

Based on the features predict the cost for a patient


## Table Of Content (TOC) <a id="top"></a>

+ [Import Libs and DataSet](#index01) 
+ [Snippets](#index02)
+ [EDA](#index03)
  - [Each feature individually](#index03)
  - [Each Feauture with 'charges'](#index04)
  - [Analyze feature crossover](#index05)
  - [Conclusions of EDA](#index06)
+ [Pre-Processing](#index07)
+ [Correlation](#index08)
+ [Split in Train and Test](#index09)
+ [Develop Models](#index10)
  - [Cross Validation](#index11)
  - [Fit Models](#index12)
  - [Test Models](#index13)
  - [Bests Models](#index14)
+ [Feature Importance](#index15)
+ [Hyperparameter Tuning Best Model](#index16)
+ [Evaluate Best Model to Regression](#index20)
+ [Conclusion](#index25)


## Import Libs and DataSet <a id='index01'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
import time
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Configs
pd.options.display.float_format = '{:,.3f}'.format
sns.set(style="whitegrid")
plt.style.use('seaborn')
seed = 42
np.random.seed(seed)

In [None]:
file_path = '/kaggle/input/insurance/insurance.csv'
df = pd.read_csv(file_path)

print("Test DataSet = {} rows and {} columns\n".format(
    df.shape[0], df.shape[1]))

quantitative = [f for f in df.columns if df.dtypes[f] != 'object']
qualitative  = [f for f in df.columns if df.dtypes[f] == 'object']

print("Qualitative Variables: (Numerics)", "\n=>", qualitative,
      "\n\nQuantitative Variable: (Strings)\n=>", quantitative)

df.head()

## Snippets <a id='index02'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
def eda_categ_feat_desc_plot(series_categorical, title = ""):
    """Generate 2 plots: barplot with quantity and pieplot with percentage. 
       @series_categorical: categorical series
       @title: optional
    """
    series_name = series_categorical.name
    val_counts = series_categorical.value_counts()
    val_counts.name = 'quantity'
    val_percentage = series_categorical.value_counts(normalize=True)
    val_percentage.name = "percentage"
    val_concat = pd.concat([val_counts, val_percentage], axis = 1)
    val_concat.reset_index(level=0, inplace=True)
    val_concat = val_concat.rename( columns = {'index': series_name} )
    
    fig, ax = plt.subplots(figsize = (12,4), ncols=2, nrows=1) # figsize = (width, height)
    if(title != ""):
        fig.suptitle(title, fontsize=18)
        fig.subplots_adjust(top=0.8)

    s = sns.barplot(x=series_name, y='quantity', data=val_concat, ax=ax[0])
    for index, row in val_concat.iterrows():
        s.text(row.name, row['quantity'], row['quantity'], color='black', ha="center")

    s2 = val_concat.plot.pie(y='percentage', autopct=lambda value: '{:.2f}%'.format(value),
                             labels=val_concat[series_name].tolist(), legend=None, ax=ax[1],
                             title="Percentage Plot")

    ax[1].set_ylabel('')
    ax[0].set_title('Quantity Plot')

    plt.show()

In [None]:
def eda_numerical_feat(series, title="", number_format="", with_label=True):
    f, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 4), sharex=False)
    if(title != ""):
        f.suptitle(title, fontsize=18)
    sns.distplot(series, ax=ax1, rug=True)
    sns.boxplot(series, ax=ax2)
    ax1.set_title("distplot")
    ax2.set_title("boxplot")
    if(with_label):
        describe = series.describe()
        labels = { 'min': describe.loc['min'], 'max': describe.loc['max'], 
              'Q1': describe.loc['25%'], 'Q2': describe.loc['50%'],
              'Q3': describe.loc['75%']}
        if(number_format != ""):
            for k, v in labels.items():
                ax2.text(v, 0.3, k + "\n" + number_format.format(v), ha='center', va='center', fontweight='bold',
                         size=10, color='white', bbox=dict(facecolor='#445A64'))
        else:
            for k, v in labels.items():
                ax2.text(v, 0.3, k + "\n" + str(v), ha='center', va='center', fontweight='bold',
                     size=10, color='white', bbox=dict(facecolor='#445A64'))
    plt.show()

In [None]:
def plot_model_score_regression(models_name_list, model_score_list, title=''):
    fig = plt.figure(figsize=(15, 6))
    ax = sns.pointplot( x = models_name_list, y = model_score_list, 
        markers=['o'], linestyles=['-'])
    for i, score in enumerate(model_score_list):
        ax.text(i, score + 0.002, '{:.4f}'.format(score),
                horizontalalignment='left', size='large', 
                color='black', weight='semibold')
    plt.ylabel('Score', size=20, labelpad=12)
    plt.xlabel('Model', size=20, labelpad=12)
    plt.tick_params(axis='x', labelsize=12)
    plt.tick_params(axis='y', labelsize=12)
    plt.xticks(rotation=70)
    plt.title(title, size=20)
    plt.show()

## Missing data <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

Has no missing data

In [None]:
df.isnull().sum()

## EDA

### Each feature individually <a id='index03'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
df.columns

In [None]:
eda_numerical_feat(df['charges'], "'charges' Distribution", "{:,.0f}")

In [None]:
eda_numerical_feat(df['bmi'], "'bmi' Distribution", "{:.2f}")

In [None]:
eda_numerical_feat(df['age'], "'Age' Distribution")

In [None]:
eda_categ_feat_desc_plot(df['children'], '"children" Distribution')

In [None]:
eda_categ_feat_desc_plot(df['sex'], '"Sex" Distribution')

In [None]:
eda_categ_feat_desc_plot(df['smoker'], '"smoker" Distribution')

In [None]:
eda_categ_feat_desc_plot(df['region'], '"region" Distribution')

### Each feauture with 'charges' <a id='index04'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
fig, ((ax1, ax2), (ax3,ax4), (ax5, ax6)) = plt.subplots(figsize = (16,14), ncols=2, nrows=3, sharex=False, sharey=False)

# sex
sns.boxplot(x="sex", y="charges", data=df, ax=ax1)
sns.distplot(df[ df['sex'] == 'male']['charges'], ax=ax2, hist=False, label="male")
sns.distplot(df[ df['sex'] == 'female']['charges'], ax=ax2, hist=False, label="female")
# region
sns.boxplot(x="region", y="charges", data=df, ax=ax3)
sns.kdeplot(df[ df['region'] == 'southwest']['charges'], ax=ax4, label="southwest")
sns.kdeplot(df[ df['region'] == 'southeast']['charges'], ax=ax4, label="southeast")
sns.kdeplot(df[ df['region'] == 'northwest']['charges'], ax=ax4, label="northwest")
sns.kdeplot(df[ df['region'] == 'northeast']['charges'], ax=ax4, label="northeast")
# children
sns.boxplot(x="children", y="charges", data=df, ax=ax5)
sns.distplot(df[ df['children'] == 0]['charges'], ax=ax6, hist=False, label="0")
sns.distplot(df[ df['children'] == 1]['charges'], ax=ax6, hist=False, label="1")
sns.distplot(df[ df['children'] == 2]['charges'], ax=ax6, hist=False, label="2")
sns.distplot(df[ df['children'] == 3]['charges'], ax=ax6, hist=False, label="3")
sns.distplot(df[ df['children'] == 4]['charges'], ax=ax6, hist=False, label="4")
sns.distplot(df[ df['children'] == 5]['charges'], ax=ax6, hist=False, label="5")

# Config Titles
fig.suptitle('Categorical Features with "charge"', fontsize=20)
font_size = 16
ax1.set_title('charges by sex')
ax2.set_title('charges by sex')
ax3.set_title('charges by children')
ax4.set_title('charges by children')
ax5.set_title('charges by region')
ax6.set_title('charges by region')

plt.legend();
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(figsize = (16,5), ncols=2, sharex=False, sharey=False)

font_size = 14
fig.suptitle('charge x smoke', fontsize=18)

sns.boxplot(x="smoker", y="charges", data=df, ax=ax1)
sns.distplot(df[(df.smoker == 'yes')]["charges"],color='c',ax=ax2, label='smoke')
sns.distplot(df[(df.smoker == 'no')]['charges'],color='b',ax=ax2, label='not smoke')

ax1.set_title('charges by smoke or not', fontsize=font_size)
ax2.set_title('Distribution of charges for smokers or  not', fontsize=font_size)
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(figsize = (16,5), ncols=2, sharex=False, sharey=False)

font_size = 14
fig.suptitle('charge x sex', fontsize=18)

sns.boxplot(x="sex", y="charges", data=df, ax=ax1)
sns.distplot(df[(df.sex == 'male')]["charges"],color='c',ax=ax2, hist=False, label='male')
sns.distplot(df[(df.sex == 'female')]['charges'],color='b',ax=ax2, hist=False, label='female')

ax1.set_title('charges by sex', fontsize=font_size)
ax2.set_title('Distribution of charges for male/female', fontsize=font_size)

plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(figsize = (16,11), nrows=2)

sns.scatterplot(x="age", y="charges", data=df, ax=ax1)
sns.boxplot(x="age", y="charges", data=df, palette="Set3", ax=ax2)

# config scatterplot x_axis
ax1.set_xticks(range(18,65)) # show age axis
ax1.set_xlim(17.5,64.5) # remove right/left margin

# Config Titles
fig.suptitle('charge by age', fontsize=20)
plt.show()

### charges by bmi

<img src="https://www.researchgate.net/profile/Selcuk_Nas/publication/320067348/figure/tbl2/AS:614180059090945@1523443345088/Classification-of-body-mass-according-to-body-mass-index-BMI.png" width="40%"/>

In [None]:
# Feature Engineering: Create 'weight_condition' to see better see bmi importance

df["weight_condition"] = np.nan
lst = [df]

for col in lst:
    col.loc[col["bmi"] < 18.5, "weight_condition"] = "Underweight"
    col.loc[(col["bmi"] >= 18.5) & (col["bmi"] < 24.986), "weight_condition"] = "Normal Weight"
    col.loc[(col["bmi"] >= 25) & (col["bmi"] < 29.926), "weight_condition"] = "Overweight"
    col.loc[col["bmi"] >= 30, "weight_condition"] = "Obese"

In [None]:
ax = sns.scatterplot(x="bmi", y="charges", hue="weight_condition", data=df)
ax.set_title("charges by bmi'")
plt.show()

### Analyze feature crossover <a id='index05'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(figsize = (17,11), ncols=2, nrows=3, sharex=False, sharey=False)

sns.scatterplot(x="age", y="charges", hue="smoker", data=df, ax=ax1)
sns.scatterplot(x="age", y="charges", hue="sex", data=df, ax=ax2)
sns.scatterplot(x="age", y="charges", hue="weight_condition", data=df, ax=ax3)
sns.scatterplot(x="age", y="charges", hue="bmi", data=df, size="bmi", ax=ax4)

sns.scatterplot(x="age", y="charges", hue="children", data=df, ax=ax5)
sns.scatterplot(x="age", y="charges", hue="region", data=df, ax=ax6)

# Config Titles
fig.suptitle('charge x age with others features', fontsize=18)
ax1.set_title("charges by age and smoke")
ax2.set_title("charges by age and sex")
ax3.set_title("charges by age and weight_condition")
ax4.set_title("charges by age and bmi")
ax5.set_title("charges by age and children")
ax6.set_title("charges by age and region")
plt.show()

In [None]:
# Feature Engineering: Create 'weight_condition' to see better 'age' importance

df['age_cat'] = np.nan
lst = [df]

for col in lst:
    col.loc[(col['age'] >= 18) & (col['age'] <= 30), 'age_cat'] = 'Young Adult'
    col.loc[(col['age'] >  30) & (col['age'] <= 50), 'age_cat'] = 'Adult'
    col.loc[(col['age'] >  50) & (col['age'] <= 60), 'age_cat'] = 'Senior'
    col.loc[ col['age'] >  60, 'age_cat'] = 'Elder'
    
df.head()

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(figsize = (17,11), ncols=2, nrows=3, sharex=False, sharey=False)

sns.scatterplot(x="bmi", y="charges", hue="smoker", data=df, ax=ax1)
sns.scatterplot(x="bmi", y="charges", hue="sex", data=df, ax=ax2)

sns.scatterplot(x="bmi", y="charges", hue="weight_condition", data=df, ax=ax3)
sns.scatterplot(x="bmi", y="charges", hue="age_cat", data=df, ax=ax4)

sns.scatterplot(x="bmi", y="charges", hue="children", data=df, ax=ax5)
sns.scatterplot(x="bmi", y="charges", hue="region", data=df, ax=ax6)

# Config Titles
fig.suptitle('charge by bmi with others features', fontsize=18)
ax1.set_title("charges by bmi and smoke")
ax2.set_title("charges by bmi and sex")
ax3.set_title("charges by bmi and weight_condition")
ax4.set_title("charges by bmi and age_cat")
ax5.set_title("charges by bmi and children")
ax6.set_title("charges by bmi and region")

plt.show()

### Conclusions Of EDA <a id='index06'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

==> chart: 'charge' x 'smoke'

As in 'charge x smoke' the fact that smoking is very important, the distribution of cartoons is clearly different between a smoker and a non-smoker. Most smokers have much higher charges than non-smokers

==> chart: charge by age

The higher the age the higher the price

==> chart: cartoons by bmi

The larger the bmi, the greater the tendency to have large values, although only bmi does not explain large costs

==> chart: charge x age with others features

In 'cartoons by age and smoke' you can clearly perceive 3 classes.
+ class 1, lower costs, are non-smokers
+ class 2, average expenses, smokers and non-smokers
+ class 3, higher expenses, smokers
After looking at 'cartoons by age and weight_condition' we have that in the vast majority this third class is of obese people (high bmi)

==> chart: 'cartoons by bmi with others features'

In 'cartoons by bmi and age_cat' he shows us the missing piece, together with 'cartoons by bmi and smoke'.

Only by looking at the data of the mental activity, in which a decision tree is smoke, age and bmi.

1. If you smoke you will have more expenses than non-smokers (a good part of the population), expenses over 15,000
  - The BMI is analyzed, if not obese, it is in a group between 18,000 and 30,000, if obese over 35,000
  - For each of these two groups, the older you are, the more expensive it becomes

2. If you don't smoke expenses below 15,000
  - For non-fulmenates the second criterion would be age, the older the higher the expenditure
  - In this, bmi does not influence much. Despite this, some with normal weight or above (NormalWeight, Overweight or Obese) can fall in the cost of being as expensive as a smoker, mainly obese

==>  Other features

gender, children and region have very little influence, this will also be seen in the correlations

<!-- 

chart: 'charge' x 'smoke'

Como em 'charge x smoke' o fato de fumar Ã© bem importnate, a distribuiÃ§Ã¢o de charges Ã© claramente diferente entre um fumante e um nÃ£o fumante. A maior parte dos fumantes tem encargos muito maiores que os nÃ¢o fumantes

chart: charge by age

Quanto maior a idade maior o preÃ§o

chart: charges by bmi

Quanto maior o bmi maior Ã© a tendencia de se ter grandes valores, apesar disso sÃ³ o bmi nÃ¢o explicar grandes custos

chart: charge x age with others features

Em 'charges by age and smoke' podemo perceber nitidamente 3 classes. 
+ classe 1, menor gastos, sÃ£o os nÃ£o fumantes
+ classe 2, gastos medianos, fumantes e nÃ£o fumantes
+ classe 3, maiores gastos, fumantes
Depois olhando para 'charges by age and weight_condition' temos que em grande maioria essa terceira classe Ã© das pessoas obesas (alto bmi)

chart: 'charges by bmi with others features'

Em 'charges by bmi and age_cat' nos mostra a peÃ§a que falta, junto com 'charges by bmi and smoke'.

Somente olhando os dados dapra fazer mental,emtne uma Ã¡rvore de decisao sÃ£o fumo, idade e bmi.

+ Se fuma tera mais gastos que os nÃ£o fulmantes (boa parte da populaÃ§Ã£o), gastos acima de 15,000
  - Analisa-se o BMI, se nÃ£o for obseo, fica num grupo entre 18,000 e 30,000, se obeso acima de 35,000
  - Para cada um desses dois grupos, quanto maior a idade, mais caro fica

+ Se nÃ£o fuma gastos abaixo de 15,000
  - Para os nao fulmenates o segundo critÃ©rio seria a idade, quanto mais velho maior o gasto
  - Nisso o bmi nÃ£o influencia muito. Apesar disso alguns com peso normal ou acima (NormalWeight, Overweight or Obese) podem cair no custo de serem tÃ£o caro quanto fumante, principlamente obesos

sexo, children e region influenciam bem pouco, isso tambÃ©m serÃ¡ visto na parte de correlaÃ§Ãµes


-->

## Pre-Processing <a id='index07'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
# Before
df.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

df = df.drop(['weight_condition','age_cat'], axis=1)

# sex
le = LabelEncoder()
le.fit(df.sex.drop_duplicates()) 
df.sex = le.transform(df.sex)

# smoker or not
le.fit(df.smoker.drop_duplicates()) 
df.smoker = le.transform(df.smoker)

# region
le.fit(df.region.drop_duplicates()) 
df.region = le.transform(df.region)

df.head() # after pre-processing

## Correlation <a id='index08'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>


<span style='font-size: 15pt'>Numerical correlations with heatmap</span>

In [None]:
corr_matrix = df.corr()
f, ax1 = plt.subplots(figsize=(18, 6), sharex=False)

ax1.set_title('Top Corr to {}'.format('"charges"'))
cols_top = corr_matrix.sort_values(by="charges", ascending=False)['charges'].index

cm = np.corrcoef(df[cols_top].values.T)
mask = np.zeros_like(cm)
mask[np.triu_indices_from(mask)] = True
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f',
                 annot_kws={'size': 14}, yticklabels=cols_top.values,
                 xticklabels=cols_top.values, mask=mask, ax=ax1)

What makes medical costs more expensive are: smoking, age and bmi

<span style='font-size: 15pt'>ANOVA: to categorical features</span>

In [None]:
# https://www.kaggle.com/hamelg/python-for-data-26-anova

import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

def anova_analysis(y_target, x_cat_feats, datf):
    for x_feat in x_cat_feats:
        model = ols('{} ~ {}'.format(y_target, x_feat),
                    data = datf).fit()
        anova_result = sm.stats.anova_lm(model, typ=2)
        print(anova_result,'\n')    
        

# If PR(>F) is less than 0.05 (alpha = cofiant level) means that the categorical feauture influence 'charges'
anova_analysis('charges', ['smoker', 'region'], df)

So smoking is really an important factor

## Split in Train and Test <a id='index09'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

# # Normal Split
# x = df.drop(['charges'], axis = 1)
# y = df['charges']
# x_train,x_test,y_train,y_test = train_test_split(x,y, random_state = 42)

####### OBS: IS better use Polinomal Transform than only split ############

# # Polinomial REgression: Feature Transform : 
#   create x^0, x^1, x^2 .... to linear models to be polinomial

X = df.drop(['charges','region'], axis = 1)
Y = df.charges

quad = PolynomialFeatures(degree = 2)
x_quad = quad.fit_transform(X)

x_train,x_test,y_train,y_test = train_test_split(x_quad, Y, random_state = 0)

## Develop Models <a id='index10'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

<span style='font-size: 15pt'>Prepare ML Models and training</span>

In [None]:
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score, train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, RobustScaler, scale
from sklearn.decomposition import PCA
from sklearn.linear_model import ElasticNet, LassoCV, BayesianRidge, LassoLarsIC
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, ElasticNetCV, LinearRegression
from sklearn.kernel_ridge import KernelRidge
from mlxtend.regressor import StackingCVRegressor
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor, BaggingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.svm import SVR

from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

In [None]:
# Setup cross validation folds

kf = KFold(n_splits=4, random_state=42, shuffle=True)

# Define error metrics
def rmse(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

def cv_rmse(model, X=x_train):
    rmse = np.sqrt(-cross_val_score(model, X, y_train, scoring="neg_mean_squared_error", cv=kf))
    return (rmse)

In [None]:
# Create ML Models

# Light Gradient Boosting Regressor
lightgb_model = LGBMRegressor(objective='regression',  num_leaves=6, learning_rate=0.01,  n_estimators=7000,
                       max_bin=200,  bagging_fraction=0.8, bagging_freq=4,  bagging_seed=8,
                       feature_fraction=0.2, feature_fraction_seed=8, min_sum_hessian_in_leaf = 11,
                       verbose=-1, random_state=42)

# XGBoost Regressor
xgboost_model = XGBRegressor(learning_rate=0.01, n_estimators=6000, max_depth=4, min_child_weight=0,
                       gamma=0.6, subsample=0.7, colsample_bytree=0.7, objective='reg:squarederror',
                       nthread=-1, scale_pos_weight=1, seed=42, reg_alpha=0.00006, random_state=42)

# Linear Regressor
linear_model = LinearRegression()

# Ridge Regressor
ridge_alphas = [1e-15, 1e-10, 1e-8, 9e-4, 7e-4, 5e-4, 3e-4, 1e-4, 
                1e-3, 5e-2, 1e-2, 0.1, 0.3, 1, 3, 5, 10, 15, 18, 20, 30, 50, 75, 100]
ridge_model = make_pipeline(RobustScaler(), RidgeCV(alphas=ridge_alphas, cv=kf))

# Lasso Regressor
lasso_alphas2 = [5e-05, 0.0001, 0.0008, 0.01, 0.1, 1]
lasso_model = make_pipeline(RobustScaler(),
                      LassoCV(max_iter=1e7, alphas=lasso_alphas2,
                              random_state=42, cv=kf))

# Elastic Net Regressor
elastic_alphas = [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007]
elastic_l1ratio = [0.8, 0.85, 0.9, 0.95, 0.99, 1]
elasticnet_model = make_pipeline(RobustScaler(),  
                           ElasticNetCV(max_iter=1e7, alphas=elastic_alphas,
                                        cv=kf, l1_ratio=elastic_l1ratio))

# Kernel Ridge
keridge_model = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

# Support Vector Regressor
svm_model = make_pipeline(RobustScaler(), SVR(C= 20, epsilon= 0.008, gamma=0.0003))

# Gradient Boosting Regressor
gboost_model = GradientBoostingRegressor(n_estimators=6000, learning_rate=0.01, max_depth=4, max_features='sqrt', 
                                min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=42)  

# Random Forest Regressor
randomforest_model = RandomForestRegressor(n_estimators=1200, max_depth=15, min_samples_split=5, min_samples_leaf=5,
                          max_features=None, oob_score=True, random_state=42)

# Neural Net
neuralnet_model = MLPRegressor()

# Extra Tree Regressor
extratree_model = ExtraTreesRegressor()

In [None]:
# SVM and NeuralNet was completally terrible

regressor_models = {
    'Linear': linear_model,
    'Ridge': ridge_model,
    'Lasso': lasso_model,
    'KernelRidge': keridge_model,
    'ElasticNet': elasticnet_model,
#     'SVM': svm_model,
    'RandomForest': randomforest_model,
    'ExtraTree': extratree_model,
#     'NeuralNet': neuralnet_model,
    'GBoost': gboost_model,
    'LightGB': lightgb_model,
    'XGBoost': xgboost_model,
}

### Cross Validation <a id='index11'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
## Cross Validation

cv_scores = {}
t_start = time.time()

for model_name, model in regressor_models.items():
    print('{:17}'.format(model_name), end='')
    t0 = time.time()
    score = cv_rmse(model)
    m, s = score.mean(), score.std()
    cv_scores[model_name] = [m,s]
    print('| MSE in CV | mean: {:11,.3f}, | std: {:9,.3f}  | took: {:9,.3f} s |'.format(m,s, time.time() - t0))
    
print('\nTime total to CrossValidation: took {:9,.3f} s'.format(time.time() - t_start)) # 200s

# Show Sorted DataFrame
df_cv = pd.DataFrame(data = cv_scores.values(), columns=['rmse_cv', 'std_cv'], index=cv_scores.keys())
df_cv = df_cv.sort_values(by='rmse_cv').reset_index().rename({'index': 'model'}, axis=1)
df_cv

## Fit Models <a id='index12'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>


In [None]:
# Def Stack Model: Stack up some the models above, optimized using one ml model
stack_regressors = (regressor_models['Lasso'],
                    regressor_models['LightGB'],
                    regressor_models['Ridge'],
                    regressor_models['RandomForest'])

stack_model = StackingCVRegressor(regressors = stack_regressors,
                                meta_regressor = regressor_models['ElasticNet'],
                                use_features_in_secondary=True)

regressor_models['Stack'] = stack_model

In [None]:
train_scores = {}
t_start = time.time()

for model_name, model in regressor_models.items():
    print('{:17}'.format(model_name), end='')
    t0 = time.time()
    if(model_name == 'Stack'):
        model  = model.fit(np.array(x_train), np.array(y_train))
        y_pred = model.predict(np.array(x_train))
    else:
        model  = model.fit( x_train, y_train )
        y_pred = model.predict(x_train)
    r2, mse = r2_score(y_train, y_pred), mean_squared_error(y_train, y_pred)
    train_scores[model_name] = [r2, mse, np.sqrt(mse)]
    text_print = '| Train | r2: {:6,.3f}, | mse: {:15,.3f}  | took: {:9,.3f} s |'
    print(text_print.format(r2, mse, time.time() - t0))
    regressor_models[model_name] = model
    
print('\nTime total to Fit Models: took {:9,.3f} s'.format(time.time() - t_start)) # 200s

In [None]:
regressor_models.keys()

In [None]:
# Blend Model is use a porcentage of some models mixing
class BlendModel:
    
    @classmethod
    def predict(self, X):
        return ((0.10 * regressor_models['Lasso'].predict(X)) + \
            (0.10 * regressor_models['GBoost'].predict(X)) + \
            (0.15 * regressor_models['XGBoost'].predict(X)) + \
            (0.10 * regressor_models['LightGB'].predict(X)) + \
            (0.20 * regressor_models['RandomForest'].predict(X)) + \
            (0.35 * regressor_models['Stack'].predict(np.array(X))))

regressor_models['BlendModel'] = BlendModel()
y_pred = BlendModel.predict(x_train)
r2, mse = r2_score(y_train, y_pred), mean_squared_error(y_train, y_pred)
train_scores['BlendModel'] = [r2, mse, np.sqrt(mse)]
print('RMSE score on train data to Blend Model:\n\t=>', np.sqrt(mse))

In [None]:
from mlens.ensemble import SuperLearner

# create a list of base-models
def get_models_to_super_leaner():
    models = list()
    models.append(regressor_models['Linear'])
    models.append(regressor_models['Ridge'])
    models.append(regressor_models['Lasso'])
    models.append(regressor_models['KernelRidge'])
    models.append(regressor_models['ElasticNet'])
    models.append(regressor_models['RandomForest'])
    models.append(regressor_models['GBoost'])
    return models

# create the super learner
def get_super_learner(X):
    ensemble = SuperLearner(scorer=rmse, folds=5, shuffle=True, sample_size=len(X))
    # add base models
    models = get_models_to_super_leaner()
    ensemble.add(models)
    # add the meta model
    ensemble.add_meta(LinearRegression())
    return ensemble

# key to regressros models
model_name = 'SuperLeaner'

# create the super learner
ensemble = get_super_learner(x_train)
# fit the super learner
t0 = time.time()
ensemble.fit(x_train, np.array(y_train)) # took 350s = 6min
# pred and evaluate in train dataset
y_pred = ensemble.predict(x_train)
r2, mse = r2_score(y_train, y_pred), mean_squared_error(y_train, y_pred)
train_scores[model_name] = [r2, mse, np.sqrt(mse)]
# show results
text_print = '| Super Leaner in Train | r2: {:6,.3f}, | mse: {:9,.3f}  | took: {:9,.3f} s |\n'
print(text_print.format(r2, mse, time.time() - t0))
# set in dict regressors
regressor_models[model_name] = ensemble
# summarize base learners
print(ensemble.data)
# evaluate meta model

In [None]:
# Show train_scores dataframe
df_train_scores = pd.DataFrame(data = train_scores.values(),index=train_scores.keys(), columns=['r2_train', 'mse_train', 'rmse_train'])
df_train_scores = df_train_scores.sort_values(by='r2_train', ascending=False).reset_index().rename({'index': 'model'}, axis=1)
df_train_scores

## Test Models <a id='index13'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
test_scores = {}

# predcit x_test to y_test and compare
for model_name, model in regressor_models.items():
    if(model_name == 'Stack'):
        y_pred = model.predict(np.array(x_test))
    else:
        y_pred = model.predict(x_test)
    r2, mse = r2_score(y_test, y_pred), mean_squared_error(y_test, y_pred)
    test_scores[model_name] = [r2, mse, np.sqrt(mse)]
    
# Sort DF test scores
df_test_scores = pd.DataFrame(data = test_scores.values(), columns=['r2_test', 'mse_test', 'rmse_test'], index=test_scores.keys())
df_test_scores = df_test_scores.sort_values(by='r2_test', ascending=False).reset_index().rename({'index': 'model'}, axis=1)
df_test_scores

## Best Models <a id='index14'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
# Include Blend in Train Scores
df_train_scores = pd.DataFrame(data = train_scores.values(),index=train_scores.keys(), columns=['r2_train', 'mse_train', 'rmse_train'])
df_train_scores = df_train_scores.sort_values(by='r2_train', ascending=False).reset_index().rename({'index': 'model'}, axis=1)

# df_test_scores
df_cv2 = df_cv.merge(df_train_scores, how='right'  ,left_on='model', right_on='model')
df_final_scores = df_cv2.merge(df_test_scores, how='right' ,left_on='model', right_on='model')

print(list(df_final_scores.columns))
df_final_scores.sort_values(by='mse_test')

In [None]:
plot_model_score_regression(list(test_scores.keys()), [r2 for r2, mse, rmse in test_scores.values()], 'Evaluate Models in Test: R2')

The best model for the test data was the SuperLeaner, after it GBoost. Both had excellent scores on the training data, even though they were not so good on the training data, as they probably did not have overfitting.

Unfortunately, there is no (or do not know how) to do cross validation for SuperLeaner. Since your score on the training and test data is similar to that of GBoost we can assume your score for cross validation, it would be similar to that of GBoost.

It happened that in the training data some models were overfitted, such as ExtraTree, XGBoost and LightGB. You can see this by observing that they were almost perfect in the training data but failed like any other model in the test data.

Even though some models cannot do corss validation (Blend, Stack, SuperLeaner) when ordering by 'mse_test' we realize that the models that can make cv are in the same position if you compare 'mse_test' with 'rmse_cv'. So Blend and Stack were also good models, too.

<!-- 
O melhor modelo para os dados de teste foi o SuperLeaner, depois dele GBoost. Ambos tiveram Ã³timos scores nos dados de treino, mesmo nÃ¢o sendo tÃ£o boons nos dados de treino, pois provavelmente nÃ£o tiveram overfitting.

Infelismente nÃ£o tem (ou nÃ£o sei fazer) como fazer cross validation para o SuperLeaner. Como a sua pontuaÃ§Ã¢o nos dados de treino e teste sÃ£o parecidas com a do GBoost podemos supor sua pontuaÃ§Ã¢o de cross validation, seria parecida com a do GBoost.

Ocorreu que nos dados de treinamento alguns modelo tiveram overfitting, como ExtraTree, XGBoost e LightGB. Ã‰ possÃ­vel notar isso observando que foram quase perfeitos nos dados de treino mas falharam como qualquer outro modelo nos dados de teste.

Mesmo que alguns modelos nÃ£o possam fazer corss validation (Blend, Stack, SuperLeaner) ao ordenar por 'mse_test' percebemos que os modelos que podem fazer cv ficam  na mesma colocaÃ§Ã¢o se comparar 'mse_test' com 'rmse_cv'. Dessa forma Blend e Stack tambÃ©m foram bons modelos tambÃ©m.
-->

## Feature Importance <a id='index15'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
# To one of best models: GBoost

plt.figure(figsize = (12,4))
feat_importances = pd.Series(regressor_models['GBoost'].feature_importances_)#, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()

In [None]:
from yellowbrick.model_selection import FeatureImportances, RFECV

# FeatureImportances and RFECV to a good model, if put GBoost(the best) take a long time

fig, (ax3,ax4) = plt.subplots(figsize = (15,5), ncols=2, sharex=False, sharey=False)

the_model = 'Linear'
t_start = time.time()

viz3 = FeatureImportances(regressor_models[the_model], ax=ax3, relative=False)
viz3.fit(x_train, y_train)
viz3.finalize()

viz4 = RFECV(regressor_models[the_model], ax=ax4)
viz4.fit(x_train, y_train)
viz4.finalize()

print('Time total to RFECV to {} : took {:9,.3f} s'.format(the_model, time.time() - t_start))

plt.show()

### Evaluate Best Model to Regression <a id='index20'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

In [None]:
from yellowbrick.regressor import ResidualsPlot, PredictionError
from yellowbrick.model_selection import FeatureImportances, RFECV

# Can't use 'SuperLeaner' than, use the second place: GBoost

fig, (ax1, ax2) = plt.subplots(figsize = (15,5), ncols=2)

viz1 = ResidualsPlot(regressor_models['GBoost'], ax=ax1)
viz1.score(x_test, y_test)
viz1.finalize()

viz2 = PredictionError(regressor_models['GBoost'], ax=ax2)
viz2.score(x_test, y_test)  
viz2.finalize()

plt.show()

## Conclusion <a id='index25'></a> <a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white; margin-left: 20px;" data-toggle="popover">Go to TOC</a>

The better model was:
SuperLeaner with MSE




In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_log_error

y_pred = regressor_models['SuperLeaner'].predict(x_test)
print('The best Regressor Model to Test DataSet:')
print('MAE : {:14,.3f}'.format(mean_absolute_error(y_pred, y_test)))
print('MSE : {:14,.3f}'.format(mean_squared_error(y_pred, y_test)))
print('RMSE: {:14,.3f}'.format(np.sqrt(mean_squared_error(y_pred, y_test))))
print('MSLE: {:14,.3f}'.format(mean_squared_log_error(y_pred, y_test)))
print('R2  : {:14,.3f}'.format(r2_score(y_pred, y_test)))

The better model was SuperLeaner with the results above.

---

This Kernel is still under development. I would highly appreciate your feedback for improvement and, of course, if you like it, please upvote it!


Please Upvote (It motivates me)