# Stacking ensembling

Vamos a crear un ensamble de tipo stacking. Para ello se utilizaron las siguientes fuentes:
 - https://mlwave.com/kaggle-ensembling-guide/
 - http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/
 - https://github.com/emanuele/kaggle_pbr/blob/master/blend.py

Para ello se desarrollaron las siguientes secciones:
 - [Predictores base](#Predictores-base)
     - [Xgboost](#Xgboost)
     - [Random Forest](#Random-Forest)
     - [AdaBoost](#AdaBoost)
 - [Metafeatures](#Metafeatures)
 - [Predictor Stacking](#Predictor-Stacking)

In [6]:
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from bayes_opt import BayesianOptimization
from sklearn.model_selection import KFold
from xgboost import XGBRegressor
from time import time
import xgboost as xgb
import pandas as pd
import os

In [2]:
if '__file__' in locals():
    current_folder = os.path.dirname(os.path.abspath(__file__))
else:
    current_folder = os.getcwd()

set_de_entrenamiento_testing_y_prediccion = '"{}"'.format(os.path.join(
    current_folder,
    '..',
    'Set de entrenamiento, testing y predicción.ipynb'
))
merge_features = '"{}"'.format(os.path.join(current_folder, '..', 'Features', 'Merge features.ipynb'))
predicciones_csv = os.path.join(current_folder, '..', 'predictions.csv')

Cargo el df con los features.

In [3]:
pd.options.mode.chained_assignment = None
%run $merge_features

KeyboardInterrupt: La limpieza ya corrió en este Kernel

KeyboardInterrupt: La limpieza ya corrió en este Kernel

KeyboardInterrupt: La limpieza ya corrió en este Kernel

In [4]:
assert(df_features.shape[0] == get_clean_df()['person'].unique().shape[0])

Cargo el set de entrenamiento.

In [5]:
%run $set_de_entrenamiento_testing_y_prediccion

labels_with_features = labels.merge(df_features, how='inner', on='person')
train = labels_with_features.drop('label', axis=1)
train_target = labels_with_features['label']

### Predictores base

En esta sección vamos a preparar los predictores base a utilizar. Estos son los mismos que se encuentran en la carpeta *Algoritmos de ML*.

In [26]:
base_predictors = []

#### Xgboost

Nota: vamos a usar XGBRegressor para tener la misma interfaz con el resto de los predictores. Los hiperparámetros tienen distinto nombre, pero producen los mismos resultados. Se puede consultar la documentación de xgboost para encontrar los nombres de los parámetros: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor. 

In [27]:
param = {
    'objective': 'reg:logistic',
    'colsample_bylevel': 0.6605668347627213,
    'colsample_bytree': 0.5279014819087092,
    'min_child_weight': 4.302125582335056,
    'learning_rate': 0.15803667962605694,
    'max_delta_step': 7.592652591386328,
    'n_estimators': 65,
    'reg_lambda': 1.1181195507921775,
    'max_depth': 9,
    'silent': True,
    'subsample': 0.43744176565530823,
    'reg_alpha': 3.845311207046479,
    'gamma': 6.219264874528072
}

base_predictors.append(XGBRegressor(**param))

#### Random Forest

In [28]:
param = {
    'bootstrap': True,
    'max_depth': 10,
    'max_features': 81,
    'min_samples_leaf': 49,
    'min_samples_split': 8,
    'n_estimators': 56
}

base_predictors.append(RandomForestRegressor(**param))

#### AdaBoost

In [29]:
param = {
    'n_estimators': 128,
    'loss': 'linear',
    'learning_rate': 0.07,
    'base_estimator': DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
}

base_predictors.append(AdaBoostRegressor(**param))

#### DecisionTree

In [30]:
base_predictors.append(DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))

### Metafeatures

Ahora vamos a realizar una predicción por cada predictor base, y las vamos a agregar a una copia del set de entrenamiento (*train_meta*).

In [31]:
train_meta = train.copy()
for predictor in base_predictors:
    train_meta[predictor.__class__.__name__] = np.nan

Para realizar las predicciones utilizar cross validation.

In [32]:
%%time
kf = KFold(n_splits=10, shuffle=False)
for train_i, validation_i in kf.split(train):    
    for predictor in base_predictors:
        # como warm_start=False cada vez que llamo fit, el modelo se reinicia
        predictor.fit(train.iloc[train_i], train_target.iloc[train_i]) # train
        train_meta[predictor.__class__.__name__].iloc[validation_i] = predictor.predict(train.iloc[validation_i]) # predict

CPU times: user 5min 9s, sys: 156 ms, total: 5min 9s
Wall time: 5min 9s


In [33]:
train_meta.head(3)

Unnamed: 0_level_0,screen_resolution_height mean,screen_resolution_width mean,screen_resolution_height std,screen_resolution_width std,ad campaign hit,brand listing,checkout,conversion,generic listing,lead,search engine hit,searched products,...,Sunday,Thursday,Tuesday,Wednesday,madrugada,maniana,noche,tarde,XGBRegressor,RandomForestRegressor,AdaBoostRegressor,DecisionTreeRegressor
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
0566e9c1,568.0,320.0,0.0,0.0,6.0,3.0,1.0,1.0,15.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0,0,0,1,0.014029,0.020092,0.109565,0.007599
6ec7ee77,640.0,360.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,0,0,0,0.066373,0.074123,0.319654,0.117647
abe7a2fb,640.0,360.0,0.0,0.0,9.0,14.0,1.0,0.0,9.0,0.0,4.0,6.0,...,0.0,0.0,0.0,0.0,0,0,0,0,0.016325,0.02639,0.171324,0.007599


### Predictor Stacking

Ahora vamos a entrenar un nuevo modelo utilizando como features las predicciones anteriores (metafeatures). También podemos agregar algunos de los features originales.

Para esto realizamos un Random Search. TODO 

In [34]:
param = {
    'learning_rate': 0.3,
    'gamma': 1.6317896840572566,
    'max_depth': 2,
    'objective': 'reg:logistic',
    'n_estimators': 25
}

stacking = XGBRegressor(**param)
stack_train = train_meta[[predictor.__class__.__name__ for predictor in base_predictors]]

In [35]:
%%time
scores = cross_val_score(stacking, stack_train, train_target, cv=10, scoring='roc_auc')
print("Accuracy: %0.6f (+/- %0.6f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.878476 (+/- 0.022346)
CPU times: user 1.09 s, sys: 0 ns, total: 1.09 s
Wall time: 1.09 s


In [39]:
stacking.fit(stack_train, train_target)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=1.6317896840572566, learning_rate=0.3,
       max_delta_step=0, max_depth=2, min_child_weight=1, missing=None,
       n_estimators=25, n_jobs=1, nthread=None, objective='reg:logistic',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

### Bayesian Optimization

Búsque de hiperparámetros para el stacking.

In [36]:
pbounds = {
    'max_depth': (2, 30),
    'eta': (0, 1),
    'gamma': (0, 20),
    'min_child_weight': (1, 8),
    'max_delta_step': (1, 8),
    'subsample': (0, 1),
    'colsample_bytree': (0, 1),
    'colsample_bylevel': (0, 1),
    'lambda': (1, 10),
    'alpha': (0, 8)
}

discrete = ['max_depth'] # parámetros discretos
cv_splits = 10 # cantidad de splits en el cv
num_round = 100 # cantidad máxima de boosts

In [37]:
dtrain = xgb.DMatrix(stack_train, label=train_target)
def cv_score_xgb(**param):
    param['silent'] = 1
    param['objective'] = 'reg:logistic'
    
    # transformo los valores que deben ser discretos
    for d in discrete:
        param[d] = int(param[d])
    
    # hago el cv
    scores = xgb.cv(param, dtrain, nfold=cv_splits, metrics='auc', verbose_eval=False, shuffle=False, stratified=False, num_boost_round=num_round, early_stopping_rounds=20)
    return scores['test-auc-mean'].max()

In [38]:
%%time
optimizer = BayesianOptimization(f=cv_score_xgb, pbounds=pbounds)
# optimizer.probe(
#     params = {'eta': 0.06027, 'gamma': 8.548, 'max_depth': 16}
# )
optimizer.maximize(
    init_points=20,
    n_iter=20,
)

|   iter    |  target   |   alpha   | colsam... | colsam... |    eta    |   gamma   |  lambda   | max_de... | max_depth | min_ch... | subsample |
-------------------------------------------------------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.8697  [0m | [0m 7.816   [0m | [0m 0.2265  [0m | [0m 0.8253  [0m | [0m 0.8487  [0m | [0m 0.6957  [0m | [0m 2.062   [0m | [0m 5.605   [0m | [0m 29.36   [0m | [0m 6.049   [0m | [0m 0.09945 [0m |
| [0m 2       [0m | [0m 0.8688  [0m | [0m 1.417   [0m | [0m 0.8445  [0m | [0m 0.3495  [0m | [0m 0.7859  [0m | [0m 14.48   [0m | [0m 8.7     [0m | [0m 2.882   [0m | [0m 14.25   [0m | [0m 1.283   [0m | [0m 0.467   [0m |
| [0m 3       [0m | [0m 0.8691  [0m | [0m 7.701   [0m | [0m 0.209   [0m | [0m 0.2717  [0m | [0m 0.9198  [0m | [0m 2.017   [0m | [0m 5.322   [0m | [0m 3.005   [0m | [0m 25.46   [0m | [0m 6.309   [0m | [

| [0m 34      [0m | [0m 0.8633  [0m | [0m 2.977   [0m | [0m 0.6204  [0m | [0m 0.9958  [0m | [0m 0.006989[0m | [0m 0.1107  [0m | [0m 9.847   [0m | [0m 7.891   [0m | [0m 2.656   [0m | [0m 3.393   [0m | [0m 0.7261  [0m |
| [0m 35      [0m | [0m 0.8776  [0m | [0m 7.015   [0m | [0m 0.171   [0m | [0m 0.2265  [0m | [0m 0.1914  [0m | [0m 2.437   [0m | [0m 1.124   [0m | [0m 6.635   [0m | [0m 18.02   [0m | [0m 1.729   [0m | [0m 0.9446  [0m |
| [0m 36      [0m | [0m 0.8766  [0m | [0m 7.374   [0m | [0m 0.1642  [0m | [0m 0.09267 [0m | [0m 0.3917  [0m | [0m 1.443   [0m | [0m 1.746   [0m | [0m 1.682   [0m | [0m 3.248   [0m | [0m 7.497   [0m | [0m 0.8913  [0m |
| [0m 37      [0m | [0m 0.8719  [0m | [0m 0.1461  [0m | [0m 0.3697  [0m | [0m 0.1508  [0m | [0m 0.8425  [0m | [0m 19.41   [0m | [0m 1.27    [0m | [0m 7.949   [0m | [0m 3.95    [0m | [0m 7.742   [0m | [0m 0.83    [0m |
| [0m 38      [0m | [0m 0.875

In [19]:
optimizer.max

{'params': {'eta': 0.05115951793330183,
  'gamma': 0.19356475793118721,
  'max_depth': 2.8721074126236488},
 'target': 0.8767669}

Escribo el mejor resultado en un archivo

In [20]:
hyperparameter_data = {
    'algorithm': 'stacking',
    'hyperparameters': optimizer.max['params'],
    'cv_splits': cv_splits,
    'auc': optimizer.max['target'],
    'features': train.columns
}

In [21]:
%run -i write_hyperparameters.py

### Predicción del set testing con los modelos base

Utilizamos los modelos entrenados con el 100% del set de entrenamiento para predecir el set de testing.

In [40]:
testing = labels_to_predict.merge(df_features, how='inner', on='person')
assert(testing.shape[0] == labels_to_predict.shape[0])

In [41]:
testing_with_base_predictions = testing.copy()
for predictor in base_predictors:
    testing_with_base_predictions[predictor.__class__.__name__] = np.nan

In [42]:
%%time
for predictor in base_predictors:
    # como warm_start=False cada vez que llamo fit, el modelo se reinicia
    predictor.fit(train, train_target) # train
    testing_with_base_predictions[predictor.__class__.__name__] = predictor.predict(testing) # predict

CPU times: user 35.5 s, sys: 72 ms, total: 35.5 s
Wall time: 35.5 s


### Predicción del set testing con el stacking

Utilizamos las predicciones como features para que el Stacking las combine en una prediccón final.

In [43]:
testing_with_base_predictions_for_stacking = testing_with_base_predictions[[predictor.__class__.__name__ for predictor in base_predictors]]
testing_with_base_predictions_for_stacking.head(3)

Unnamed: 0_level_0,XGBRegressor,RandomForestRegressor,AdaBoostRegressor,DecisionTreeRegressor
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4886f805,0.004135,0.00178,0.036643,0.007434
0297fc1e,0.025531,0.058865,0.13909,0.03904
2d681dd8,0.012194,0.002035,0.037569,0.007434


In [44]:
predictions = stacking.predict(testing_with_base_predictions_for_stacking)

In [45]:
testing_target = pd.DataFrame(data=stacking.predict(testing_with_base_predictions_for_stacking))
testing_target.index = testing_with_base_predictions_for_stacking.index
testing_target.index.name = 'person'
testing_target.columns = ['label']

In [46]:
testing_target.to_csv(predicciones_csv)