# GridSearch & Pipelines
GridSearch is an optimization tool that we use when tuning hyperparameters. We define the grid of parameters that we want to search through, and we select the best combination of parameters for our data.

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

## Método 1
Itera un algoritmo sobre un conjunto de hiperparametros

In [1]:
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

In [5]:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()

# 'kernel':('linear', 'poly', 'rbf', 'sigmoid'), 

parameters = {
    'kernel':('linear', 'rbf', 'sigmoid'), 
    'C':[0.0001,0.1, 0.5, 1, 5, 10, 100], 
    'degree': [1,2,3,4,5,6,7,8,9],
    'coef0': [-10.,-1., 0., 0.1, 0.5, 1, 10, 100],
    'gamma': ('scale', 'auto')
    }

svc = svm.SVC()

clf = GridSearchCV(estimator=svc,
                   param_grid=parameters,
                   n_jobs=-1, # All kernels in computer
                   cv=10) # Folds in cross validation

clf.fit(iris.data, iris.target)

print("clf.best_stimator_", clf.best_estimator_)
print("clf.best_params_", clf.best_params_)

# Mean cross-validated score of the best_estimator
print("clf.best_score", clf.best_score_)

clf.best_stimator_ SVC(C=0.5, coef0=-10.0, degree=1, kernel='linear')
clf.best_params_ {'C': 0.5, 'coef0': -10.0, 'degree': 1, 'gamma': 'scale', 'kernel': 'linear'}
clf.best_score 0.9866666666666667


In [7]:
import pandas as pd
pd.DataFrame(clf.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_coef0,param_degree,param_gamma,param_kernel,params,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.005414,0.007242,0.003430,0.006070,0.0001,-10.0,1,scale,linear,"{'C': 0.0001, 'coef0': -10.0, 'degree': 1, 'ga...",...,0.933333,0.866667,0.800000,1.000000,0.933333,0.933333,0.933333,0.906667,6.110101e-02,1873
1,0.005598,0.005461,0.000802,0.001251,0.0001,-10.0,1,scale,rbf,"{'C': 0.0001, 'coef0': -10.0, 'degree': 1, 'ga...",...,0.933333,0.933333,0.800000,1.000000,0.933333,1.000000,0.933333,0.926667,5.537749e-02,1801
2,0.003300,0.001476,0.001001,0.001553,0.0001,-10.0,1,scale,sigmoid,"{'C': 0.0001, 'coef0': -10.0, 'degree': 1, 'ga...",...,0.533333,0.400000,0.533333,0.400000,0.400000,0.333333,0.333333,0.440000,9.521905e-02,2026
3,0.002097,0.001373,0.000900,0.001375,0.0001,-10.0,1,auto,linear,"{'C': 0.0001, 'coef0': -10.0, 'degree': 1, 'ga...",...,0.933333,0.866667,0.800000,1.000000,0.933333,0.933333,0.933333,0.906667,6.110101e-02,1873
4,0.002001,0.001340,0.001195,0.001464,0.0001,-10.0,1,auto,rbf,"{'C': 0.0001, 'coef0': -10.0, 'degree': 1, 'ga...",...,0.933333,0.933333,0.866667,0.933333,0.933333,1.000000,1.000000,0.933333,4.216370e-02,1657
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3019,0.001646,0.000632,0.000951,0.000468,100,100,9,scale,rbf,"{'C': 100, 'coef0': 100, 'degree': 9, 'gamma':...",...,1.000000,0.866667,0.933333,0.933333,1.000000,1.000000,1.000000,0.973333,4.422166e-02,505
3020,0.002505,0.000938,0.000798,0.000399,100,100,9,scale,sigmoid,"{'C': 100, 'coef0': 100, 'degree': 9, 'gamma':...",...,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333,5.551115e-17,2080
3021,0.001507,0.001312,0.000698,0.000457,100,100,9,auto,linear,"{'C': 100, 'coef0': 100, 'degree': 9, 'gamma':...",...,1.000000,0.933333,1.000000,0.866667,0.933333,1.000000,1.000000,0.973333,4.422166e-02,505
3022,0.001708,0.000921,0.000697,0.000457,100,100,9,auto,rbf,"{'C': 100, 'coef0': 100, 'degree': 9, 'gamma':...",...,0.933333,0.933333,0.933333,0.866667,1.000000,1.000000,1.000000,0.966667,4.472136e-02,1441


In [4]:
import multiprocessing
multiprocessing.cpu_count()

4

## Método 2

La forma pro es la que hace esto mismo y va recogiendo los errores de entrenamiento, de validación y tiene la capacidad de parar el proceso cuando se requiera además de guardar el modelo en local una vez terminado si es mejor que el que había anteriormente y de cargar el modelo anterior y seguir reentrenando.

Para montar un único gridsearch

In [8]:
import pickle

In [9]:
# Load libraries
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split 
# Set random seed
np.random.seed(0)

In [10]:
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [12]:
to_test = np.arange(1, 10)

In [13]:
# Create a pipeline

# Le podemos poner cualquier clasificador. Irá cambiando según va probando pero necesita 1.
pipe = Pipeline(steps=[('classifier', RandomForestClassifier())])


logistic_params = {
    'classifier': [LogisticRegression()],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__C': np.logspace(0, 4, 10)
    }

random_forest_params = {
    'classifier': [RandomForestClassifier()],
    'classifier__n_estimators': [10, 100, 1000],
    'classifier__max_features': [1, 2, 3]
    }

svm_params = {
    'classifier': [svm.SVC()],
    'classifier__kernel':('linear', 'rbf', 'sigmoid'), 
    'classifier__C':[0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 
    'classifier__degree': to_test,
    'classifier__coef0': [-10.,-1., 0., 0.1, 0.5, 1, 10, 100],
    'classifier__gamma': ('scale', 'auto')
    }


# Create space of candidate learning algorithms and their hyperparameters
search_space = [
    logistic_params,
    random_forest_params,
    svm_params
    ]

In [14]:
%%time

# Repeats the CV n times
cv = RepeatedKFold(n_splits=10, n_repeats=2, random_state=1)


# Create grid search 
clf = GridSearchCV(estimator=pipe,
                   param_grid=search_space,
                   cv=cv,
                   verbose=0,
                   n_jobs=-1)

# Fit grid search
best_model = clf.fit(X_train, y_train)
# View best model
separator = "\n############################\n"
print(separator)
print("best estimator:", best_model.best_estimator_.get_params()['classifier'])
print(separator)
print("clf.best_params_", clf.best_params_)
print(separator)
# Mean cross-validated score of the best_estimator
print("clf.best_score", clf.best_score_)
#SAVE MODEL
# save the model to disk
filename = 'finished_model.sav'
#pickle.dump(best_model, open(filename, 'wb'))

with open('finished_model.model', "wb") as archivo_salida:
    pickle.dump(best_model.best_estimator_, archivo_salida)


############################

best estimator: SVC(C=0.7, coef0=-10.0, degree=1, kernel='linear')

############################

clf.best_params_ {'classifier': SVC(C=0.7, coef0=-10.0, degree=1, kernel='linear'), 'classifier__C': 0.7, 'classifier__coef0': -10.0, 'classifier__degree': 1, 'classifier__gamma': 'scale', 'classifier__kernel': 'linear'}

############################

clf.best_score 0.9833333333333334
Wall time: 3min 27s


200 fits failed out of a total of 69700.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
200 fits failed with the following error:
Traceback (most recent call last):
  File "c:\users\rocio\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\rocio\appdata\local\programs\python\python37\lib\site-packages\sklearn\pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "c:\users\rocio\appdata\local\programs\python\python37\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  F

In [15]:
with open('finished_model.model', "rb") as archivo_entrada:
    pipeline_importada = pickle.load(archivo_entrada)

In [16]:
# Predict target vector
pipeline_importada.score(X_test, y_test) * 100

100.0

## Método 3

In [17]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import GridSearchCV

import pandas as pd

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [13]:

# Definimos los clasificadores en sus Pipelines
reg_log = Pipeline(steps=[
                          ("imputer",SimpleImputer()),
                          ("scaler",StandardScaler()),
                          ("reglog",LogisticRegression())
                         ])

rand_forest = RandomForestClassifier()

svm = Pipeline(steps=[("scaler",StandardScaler()),
                      ("selectkbest",SelectKBest()),
                      ("svm",SVC())])

# Definimos sus hiperparametros
reg_log_param = {    
                 "imputer__strategy": ['mean', 'median', 'most_frequent'],
                 "reglog__penalty": ["l1","l2"], 
                 "reglog__C": np.logspace(0, 4, 10)
                }

rand_forest_param = {
    'n_estimators': [10, 100, 1000],
    'max_features': [1, 2, 3]
    }


svm_param = {                    
            'selectkbest__k': [1,2,3],
            'svm__C': [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 
            'svm__kernel': ["linear","poly","rbf"],
            'svm__coef0': [-10.,-1., 0., 0.1, 0.5, 1, 10, 100],
            'svm__gamma': ('scale', 'auto')
            }






gs_reg_log = GridSearchCV(reg_log,
                            reg_log_param,
                            cv=10,
                            scoring="accuracy",
                            verbose=1,
                            n_jobs=-1)

gs_rand_forest = GridSearchCV(rand_forest,
                            rand_forest_param,
                            cv=10,
                            scoring="accuracy",
                            verbose=1,
                            n_jobs=-1)

gs_svm = GridSearchCV(svm,
                        svm_param,
                        cv=10,
                        scoring="accuracy",
                        verbose=1,
                        n_jobs=-1)

grids = {"gs_reg_log":gs_reg_log,
         "gs_rand_forest":gs_rand_forest,
         "gs_svm":gs_svm}

In [14]:
%%time

for nombre, grid_search in grids.items():
    grid_search.fit(X_train, y_train)
    

Fitting 10 folds for each of 60 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:    2.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 10 folds for each of 9 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:   16.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 10 folds for each of 1152 candidates, totalling 11520 fits


[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 2160 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 6160 tasks      | elapsed:    8.8s


Wall time: 34.4 s


[Parallel(n_jobs=-1)]: Done 11520 out of 11520 | elapsed:   15.3s finished


In [15]:
best_grids = [(i, j.best_score_) for i, j in grids.items()]

best_grids = pd.DataFrame(best_grids, columns=["Grid", "Best score"]).sort_values(by="Best score", ascending=False)
best_grids

Unnamed: 0,Grid,Best score
0,gs_reg_log,0.966667
1,gs_rand_forest,0.966667
2,gs_svm,0.966667


In [16]:
# El mejor modelo ha sido
best_model = grids["gs_rand_forest"]
best_model

GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'max_features': [1, 2, 3],
                         'n_estimators': [10, 100, 1000]},
             scoring='accuracy', verbose=1)

In [17]:
mejor_modelo = best_model.best_estimator_
mejor_modelo.fit(X_train, y_train)
mejor_modelo.score(X_test, y_test)

0.9666666666666667

Ya hemos escogido modelo gracias a los datos de validación. Ahora habría que entrenar el modelo con TODOS los datos de train.

## RandomSearch
El problema que tiene el GridSearchCV es que computacionalmente es muy costoso cuando el espacio dimensional de los hiperparámetros es grande.

Mediante el RandomSearch no se prueban todas las combinaciones, sino unas cuantas de manera aleatoria. Funciona bien con datasets con pocas features. Incluso [hay papers](https://www.jmlr.org/papers/v13/bergstra12a.html) que aseguran que es más eficiente RandomSearch frente a GridSearch

![imagen](https://miro.medium.com/proxy/1*ZTlQm_WRcrNqL-nLnx6GJA.png)

In [22]:
from sklearn.model_selection import RandomizedSearchCV

reg_log = Pipeline(steps=[
                          ("imputer",SimpleImputer()),
                          ("scaler",StandardScaler()),
                          ("reglog",LogisticRegression())
                         ])

reg_log_param = {    
                 "imputer__strategy": ['mean', 'median', 'most_frequent'],
                 "reglog__penalty": ["l1","l2"], 
                 "reglog__C": np.logspace(0, 4, 10)
                }


search = RandomizedSearchCV(reg_log,
                            reg_log_param,
                            n_iter=50, # Iteraciones en los hiperparámetros
                            scoring='accuracy',
                            n_jobs=-1,
                            cv=10,
                            random_state=42)

# execute search
result = search.fit(X_train, y_train)

# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)
print('Best Estimator: %s' % result.best_estimator_)



Best Score: 0.9666666666666666
Best Hyperparameters: {'reglog__penalty': 'l2', 'reglog__C': 59.94842503189409, 'imputer__strategy': 'mean'}
Best Estimator: Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler()),
                ('reglog', LogisticRegression(C=59.94842503189409))])
