## 3. Tuneado Modelos Clúster

### Objetivo

Proceso para la selección de los mejores parámetros de cada uno de los algoritmos a traves de la metodología GridSearchCV.

### Descripción General de notebook

    1. Carga de datos base        
    2. Definición los parámetros a incluir en la grilla de cada uno de los algoritmos
    3. Determinar los mejores parámetros para cada uno de los algoritmos de cada clúster

In [1]:
from pandas import MultiIndex, Int16Dtype
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
sns.set_style('darkgrid')

import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import joblib

from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer, Binarizer, RobustScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, PowerTransformer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.compose import ColumnTransformer


from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import ElasticNet, SGDRegressor, LinearRegression, Lasso
from sklearn.svm import SVR
from sklearn.linear_model import BayesianRidge,LinearRegression, LogisticRegression
from sklearn.kernel_ridge import KernelRidge
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score
from sklearn.pipeline import make_pipeline, Pipeline

from sklearn.model_selection import KFold, ShuffleSplit, LeaveOneOut, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Warnings configuration
# ==============================================================================
import warnings
# warnings.filterwarnings('ignore')

In [2]:
%run "../7. Prediccion/Funciones_Prepara_Prediccion.ipynb"

### Preparación datos

In [6]:

# bicimad_def = _dataBaseOriginal("../../Data/DataFrame_Final_Cierre_Cluster.csv")
bicimad = _dataBaseOriginal("../../Data/DataFrame_Final_Cierre_Cluster_2017_2019.csv")
bicimad.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176477 entries, 0 to 176476
Data columns (total 29 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ESTACION                  176477 non-null  int64  
 1   DEMANDA                   176477 non-null  float64
 2   MES_sen                   176477 non-null  float64
 3   MES_cos                   176477 non-null  float64
 4   TEMP_MAX                  176477 non-null  float64
 5   TEMP_MIN                  176477 non-null  float64
 6   HUMEDAD                   176477 non-null  float64
 7   VIENTO                    176477 non-null  float64
 8   PRESION                   176477 non-null  float64
 9   PRECIPITACION_1h          176477 non-null  float64
 10  PRECIPITACION_3h          176477 non-null  float64
 11  Es_Festivo_1              176477 non-null  uint8  
 12  Es_FinSemana_1            176477 non-null  uint8  
 13  TEMPORADA_OTONO           176477 non-null  u

In [4]:
datos = pd.read_csv("../../Data/DataFrame_Final_Cierre_Cluster_2017_2019.csv", parse_dates=['FECHA'])

cluster0_estaciones = datos[datos['CLUSTER_soloDemanda']==0].ESTACION.unique()
cluster1_estaciones = datos[datos['CLUSTER_soloDemanda']==1].ESTACION.unique()
cluster2_estaciones = datos[datos['CLUSTER_soloDemanda']==2].ESTACION.unique()
cluster3_estaciones = datos[datos['CLUSTER_soloDemanda']==3].ESTACION.unique()
cluster4_estaciones = datos[datos['CLUSTER_soloDemanda']==4].ESTACION.unique()

Se definen los algoritmos a utilizar en cada uno de los modelos

In [7]:
seed = 99

models = list()
models.append(('RFR', RandomForestRegressor(random_state=seed)))
models.append(('GBR', GradientBoostingRegressor(random_state=seed)))
models.append(('LGBMR', LGBMRegressor(random_state=seed)))
models.append(('XGBR', XGBRegressor(random_state=seed)))

Se definen los parámetros para cada uno de los algoritmos a incluir en el GridSearchCV

In [9]:
from sklearn.model_selection import GridSearchCV

param_grid_RF = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3,5,7],
    'min_samples_leaf': [1,2,3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100,500,1000]
}

param_grid_GB = {
    'learning_rate': [0.01,0.02,0.04, 0.1],
    'subsample'    : [0.9, 0.5, 0.2, 0.1],
    'n_estimators' : [100,500,1000, 1500],
    'max_depth'    : [80, 90, 100,110]
}

param_grid_LGBM = {
    'learning_rate': [0.005, 0.01, 0.1],
    'n_estimators': [10,100,500, 1000],
    'num_leaves': [6,8,12,16],
    'max_depth':[25, 50, 100, 500]
}

param_grid_XGB = {
    'learning_rate': [0.001, 0.01, 0.1],
    'n_estimators':[100,700,1000],
    'max_depth': [3,5,8],
    'min_child_weight': [1,2,3,4],
    'gamma':[0, 0.1, 0.2],
    'reg_alpha': [0.1,1,200],#L1
    'reg_lambda':[1,200,500],#L2
}

In [10]:
clusterTotal = [cluster0_estaciones,
                cluster1_estaciones,
                cluster2_estaciones,
                cluster3_estaciones,
                cluster4_estaciones]
clusterTotal

[array([  9,  43,  57,  58,  64,  90, 129, 135, 160, 163, 168, 149, 175],
       dtype=int64),
 array([  3,   8,  17,  30,  42,  48,  53,  55,  71,  75,  76,  86,  95,
        103, 113, 115, 118, 131, 136, 145, 157,  65,  78, 155,  91, 156,
         99, 139], dtype=int64),
 array([ 15,  23,  24,  29,  35,  37,  40,  97,  98, 100, 107, 109, 111,
        120, 122, 147, 158, 159, 174,  11,  28,  32,  36,  88, 117, 119,
        127, 137, 140, 143, 144, 152, 165,  34,  39,  60, 104, 112, 138,
        141, 151,  85, 173,  61, 150,  72, 101,  87, 105, 146], dtype=int64),
 array([  1,   6,  13,  19,  26,  38,  41,  45,  46,  49,  52,  59,  62,
         74,  84, 108, 114, 128, 132, 161, 164, 166, 169, 170,  31,  56,
         83, 133, 162,  79], dtype=int64),
 array([  2,   4,   5,   7,  10,  12,  16,  18,  21,  25,  27,  33,  44,
         47,  50,  51,  54,  63,  66,  67,  77,  81,  93, 102, 110, 116,
        121, 123, 124, 125, 126, 134, 148, 153, 171,  20,  73,  80,  92,
        130, 142, 154

## Tuneado modelo Random Forest

Para el tuneado de cada uno de los modelos se ha realizado lo siguiente:

 - Se selecciona los datos de cada uno de los clúster
 - Creación de un modelo base para cada uno de los algoritmos
 - Realizar el tuneado de cada modelo a través de la función grid_search
 - Determinación del mejor modelo a través del grid_search.best_params_

In [11]:
paramRF = []
iPos = 0
for cluster in clusterTotal:
    
    bicimad_est = bicimad[bicimad['ESTACION'].isin(cluster)]
    
    train, test = train_test_split(bicimad_est, test_size = 0.30, random_state = seed)    
    X_train = train.drop(['DEMANDA','ESTACION'], axis=1)
    X_test = test.drop(['DEMANDA','ESTACION'], axis=1)
    y_train = train['DEMANDA']
    y_test = test['DEMANDA']
    
    print(f'Cluster: {iPos}')
    print('Estaciones: '+str(len(cluster)))
    print('Columnas: '+str(len(bicimad_est.columns)))
    iPos=iPos+1
    
    # Create a based model
    rf = RandomForestRegressor(random_state=seed)
    # Instantiate the grid search model
    grid_search = GridSearchCV(estimator = rf, param_grid = param_grid_RF, cv = 2, n_jobs = -1, verbose = 2)
    grid_search.fit(X_train, y_train)
    bestParams = grid_search.best_params_
    paramRF.append(bestParams)
    print(bestParams)
    print()

Cluster: 0
Estaciones: 13
Columnas: 29
Fitting 2 folds for each of 720 candidates, totalling 1440 fits
{'bootstrap': True, 'max_depth': 80, 'max_features': 7, 'min_samples_leaf': 1, 'min_samples_split': 8, 'n_estimators': 1000}

Cluster: 1
Estaciones: 28
Columnas: 29
Fitting 2 folds for each of 720 candidates, totalling 1440 fits
{'bootstrap': True, 'max_depth': 80, 'max_features': 7, 'min_samples_leaf': 1, 'min_samples_split': 8, 'n_estimators': 1000}

Cluster: 2
Estaciones: 50
Columnas: 29
Fitting 2 folds for each of 720 candidates, totalling 1440 fits
{'bootstrap': True, 'max_depth': 80, 'max_features': 7, 'min_samples_leaf': 1, 'min_samples_split': 8, 'n_estimators': 1000}

Cluster: 3
Estaciones: 30
Columnas: 29
Fitting 2 folds for each of 720 candidates, totalling 1440 fits
{'bootstrap': True, 'max_depth': 80, 'max_features': 7, 'min_samples_leaf': 1, 'min_samples_split': 8, 'n_estimators': 1000}

Cluster: 4
Estaciones: 51
Columnas: 29
Fitting 2 folds for each of 720 candidates, t

In [12]:
paramRF

[{'bootstrap': True,
  'max_depth': 80,
  'max_features': 7,
  'min_samples_leaf': 1,
  'min_samples_split': 8,
  'n_estimators': 1000},
 {'bootstrap': True,
  'max_depth': 80,
  'max_features': 7,
  'min_samples_leaf': 1,
  'min_samples_split': 8,
  'n_estimators': 1000},
 {'bootstrap': True,
  'max_depth': 80,
  'max_features': 7,
  'min_samples_leaf': 1,
  'min_samples_split': 8,
  'n_estimators': 1000},
 {'bootstrap': True,
  'max_depth': 80,
  'max_features': 7,
  'min_samples_leaf': 1,
  'min_samples_split': 8,
  'n_estimators': 1000},
 {'bootstrap': True,
  'max_depth': 80,
  'max_features': 7,
  'min_samples_leaf': 1,
  'min_samples_split': 8,
  'n_estimators': 1000}]

## Tuneado modelo Gradient Boosting

In [13]:
paramGB=[]
iPos = 0
for cluster in clusterTotal:
    
    bicimad_est = bicimad[bicimad['ESTACION'].isin(cluster)]
    
    train, test = train_test_split(bicimad_est, test_size = 0.30, random_state = seed)    
    X_train = train.drop(['DEMANDA','ESTACION'], axis=1)
    X_test = test.drop(['DEMANDA','ESTACION'], axis=1)
    y_train = train['DEMANDA']
    y_test = test['DEMANDA']
    
    print(f'Cluster: {iPos}')
    print('Estaciones: '+str(len(cluster)))
    print('Columnas: '+str(len(bicimad_est.columns)))
    iPos=iPos+1
    
    # Create a based model
    GB = GradientBoostingRegressor(random_state=seed)
    # Instantiate the grid search model
    grid_search = GridSearchCV(estimator = GB, param_grid = param_grid_GB, cv = 3, n_jobs = -1, verbose = 2)
    grid_search.fit(X_train, y_train)
    bestParams = grid_search.best_params_
    paramGB.append(bestParams)
    print(bestParams)
    print()    

Cluster: 0
Estaciones: 13
Columnas: 29
Fitting 3 folds for each of 256 candidates, totalling 768 fits
{'learning_rate': 0.01, 'max_depth': 80, 'n_estimators': 500, 'subsample': 0.2}

Cluster: 1
Estaciones: 28
Columnas: 29
Fitting 3 folds for each of 256 candidates, totalling 768 fits
{'learning_rate': 0.01, 'max_depth': 80, 'n_estimators': 500, 'subsample': 0.1}

Cluster: 2
Estaciones: 50
Columnas: 29
Fitting 3 folds for each of 256 candidates, totalling 768 fits


9 fits failed out of a total of 768.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 586, in fit
    n_stages = self._fit_stages(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 663, in _fit_stages
    raw_predictions = self._fit_stage(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 246, in _fit_stage
    tree.fit(X, residual, sample_weight=sample_w

{'learning_rate': 0.01, 'max_depth': 80, 'n_estimators': 500, 'subsample': 0.1}

Cluster: 3
Estaciones: 30
Columnas: 29
Fitting 3 folds for each of 256 candidates, totalling 768 fits
{'learning_rate': 0.01, 'max_depth': 80, 'n_estimators': 500, 'subsample': 0.1}

Cluster: 4
Estaciones: 51
Columnas: 29
Fitting 3 folds for each of 256 candidates, totalling 768 fits


14 fits failed out of a total of 768.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
8 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 586, in fit
    n_stages = self._fit_stages(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 663, in _fit_stages
    raw_predictions = self._fit_stage(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 246, in _fit_stage
    tree.fit(X, residual, sample_weight=sample_

{'learning_rate': 0.01, 'max_depth': 80, 'n_estimators': 500, 'subsample': 0.1}



In [14]:
paramGB

[{'learning_rate': 0.01,
  'max_depth': 80,
  'n_estimators': 500,
  'subsample': 0.2},
 {'learning_rate': 0.01,
  'max_depth': 80,
  'n_estimators': 500,
  'subsample': 0.1},
 {'learning_rate': 0.01,
  'max_depth': 80,
  'n_estimators': 500,
  'subsample': 0.1},
 {'learning_rate': 0.01,
  'max_depth': 80,
  'n_estimators': 500,
  'subsample': 0.1},
 {'learning_rate': 0.01,
  'max_depth': 80,
  'n_estimators': 500,
  'subsample': 0.1}]

## Tuneado modelo LBGM

In [15]:
paramLBGM = []
iPos = 0
for cluster in clusterTotal:
    
    bicimad_est = bicimad[bicimad['ESTACION'].isin(cluster)]
    
    train, test = train_test_split(bicimad_est, test_size = 0.30, random_state = seed)    
    X_train = train.drop(['DEMANDA','ESTACION'], axis=1)
    X_test = test.drop(['DEMANDA','ESTACION'], axis=1)
    y_train = train['DEMANDA']
    y_test = test['DEMANDA']
    
    print(f'Cluster: {iPos}')
    print('Estaciones: '+str(len(cluster)))
    print('Columnas: '+str(len(bicimad_est.columns)))
    iPos=iPos+1
    
    # Create a based model
    LGBM = LGBMRegressor(random_state=seed)
    # Instantiate the grid search model
    grid_search = GridSearchCV(estimator = LGBM, param_grid = param_grid_LGBM, cv = 3, n_jobs = -1, verbose = 2)
    grid_search.fit(X_train, y_train)
    bestParams = grid_search.best_params_
    paramLBGM.append(bestParams)
    print(bestParams)
    print()

Cluster: 0
Estaciones: 13
Columnas: 29
Fitting 3 folds for each of 192 candidates, totalling 576 fits
{'learning_rate': 0.1, 'max_depth': 25, 'n_estimators': 500, 'num_leaves': 16}

Cluster: 1
Estaciones: 28
Columnas: 29
Fitting 3 folds for each of 192 candidates, totalling 576 fits
{'learning_rate': 0.1, 'max_depth': 25, 'n_estimators': 1000, 'num_leaves': 16}

Cluster: 2
Estaciones: 50
Columnas: 29
Fitting 3 folds for each of 192 candidates, totalling 576 fits
{'learning_rate': 0.1, 'max_depth': 25, 'n_estimators': 1000, 'num_leaves': 16}

Cluster: 3
Estaciones: 30
Columnas: 29
Fitting 3 folds for each of 192 candidates, totalling 576 fits
{'learning_rate': 0.1, 'max_depth': 25, 'n_estimators': 1000, 'num_leaves': 16}

Cluster: 4
Estaciones: 51
Columnas: 29
Fitting 3 folds for each of 192 candidates, totalling 576 fits
{'learning_rate': 0.1, 'max_depth': 25, 'n_estimators': 1000, 'num_leaves': 16}



In [16]:
paramLBGM

[{'learning_rate': 0.1,
  'max_depth': 25,
  'n_estimators': 500,
  'num_leaves': 16},
 {'learning_rate': 0.1,
  'max_depth': 25,
  'n_estimators': 1000,
  'num_leaves': 16},
 {'learning_rate': 0.1,
  'max_depth': 25,
  'n_estimators': 1000,
  'num_leaves': 16},
 {'learning_rate': 0.1,
  'max_depth': 25,
  'n_estimators': 1000,
  'num_leaves': 16},
 {'learning_rate': 0.1,
  'max_depth': 25,
  'n_estimators': 1000,
  'num_leaves': 16}]

## Tuneado XGBoost

In [17]:
paramXGB=[]
iPos = 0
for cluster in clusterTotal:
    
    bicimad_est = bicimad[bicimad['ESTACION'].isin(cluster)]
    
    train, test = train_test_split(bicimad_est, test_size = 0.30, random_state = seed)    
    X_train = train.drop(['DEMANDA','ESTACION'], axis=1)
    X_test = test.drop(['DEMANDA','ESTACION'], axis=1)
    y_train = train['DEMANDA']
    y_test = test['DEMANDA']
    
    print(f'Cluster: {iPos}')
    print('Estaciones: '+str(len(cluster)))
    print('Columnas: '+str(len(bicimad_est.columns)))
    iPos=iPos+1
    
    # Create a based model
    XGB = XGBRegressor(random_state=seed)
    # Instantiate the grid search model
    grid_search = GridSearchCV(estimator = XGB, param_grid = param_grid_XGB, cv = 3, n_jobs = -1, verbose = 2)
    grid_search.fit(X_train, y_train)
    bestParams = grid_search.best_params_
    paramXGB.append(bestParams)
    print(bestParams)
    print()

Cluster: 0
Estaciones: 13
Columnas: 29
Fitting 3 folds for each of 2916 candidates, totalling 8748 fits
{'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 8, 'min_child_weight': 4, 'n_estimators': 700, 'reg_alpha': 0.1, 'reg_lambda': 1}

Cluster: 1
Estaciones: 28
Columnas: 29
Fitting 3 folds for each of 2916 candidates, totalling 8748 fits
{'gamma': 0, 'learning_rate': 0.1, 'max_depth': 8, 'min_child_weight': 4, 'n_estimators': 1000, 'reg_alpha': 1, 'reg_lambda': 200}

Cluster: 2
Estaciones: 50
Columnas: 29
Fitting 3 folds for each of 2916 candidates, totalling 8748 fits
{'gamma': 0, 'learning_rate': 0.1, 'max_depth': 8, 'min_child_weight': 3, 'n_estimators': 1000, 'reg_alpha': 0.1, 'reg_lambda': 200}

Cluster: 3
Estaciones: 30
Columnas: 29
Fitting 3 folds for each of 2916 candidates, totalling 8748 fits
{'gamma': 0, 'learning_rate': 0.1, 'max_depth': 8, 'min_child_weight': 3, 'n_estimators': 1000, 'reg_alpha': 1, 'reg_lambda': 200}

Cluster: 4
Estaciones: 51
Columnas: 29
Fitting 3 fold

In [18]:
paramXGB

[{'gamma': 0.1,
  'learning_rate': 0.1,
  'max_depth': 8,
  'min_child_weight': 4,
  'n_estimators': 700,
  'reg_alpha': 0.1,
  'reg_lambda': 1},
 {'gamma': 0,
  'learning_rate': 0.1,
  'max_depth': 8,
  'min_child_weight': 4,
  'n_estimators': 1000,
  'reg_alpha': 1,
  'reg_lambda': 200},
 {'gamma': 0,
  'learning_rate': 0.1,
  'max_depth': 8,
  'min_child_weight': 3,
  'n_estimators': 1000,
  'reg_alpha': 0.1,
  'reg_lambda': 200},
 {'gamma': 0,
  'learning_rate': 0.1,
  'max_depth': 8,
  'min_child_weight': 3,
  'n_estimators': 1000,
  'reg_alpha': 1,
  'reg_lambda': 200},
 {'gamma': 0,
  'learning_rate': 0.1,
  'max_depth': 8,
  'min_child_weight': 1,
  'n_estimators': 1000,
  'reg_alpha': 1,
  'reg_lambda': 200}]