Short Intro:
I assume you know the basic idea [https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation] behind Cross-Validation (CV). Here are just few points that I consider to be important.

1. CV is widely used for model selection because it allows you to estimate the performance of the fitted model on an unseen data.
    

2. Typically you want to use:
    - KFold CV [https://scikit-learn.org/stable/modules/cross_validation.html#k-fold] for regression problems 
    - StratifiedKFold CV [https://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold] for classification problems (especially if the distribution of target labels is not uniform) 

In this article we will stick to one model (LGBMRegressor [https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html]) and use cross-validation to select its hyperparameters.

In [1]:
from sklearn.datasets import load_boston
import numpy as np
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_squared_error as mse

import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

In [2]:
np.random.seed(1) # for reproducibility

In [3]:
X, y = load_boston(return_X_y=True)

In [4]:
print('X Shape: {}\ny Shape: {}'.format(X.shape, y.shape))

X Shape: (506, 13)
y Shape: (506,)


In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [6]:
kfold = KFold(n_splits=5, shuffle=True, random_state=1)

### GridSearchCV

Grid search [https://scikit-learn.org/stable/modules/grid_search.html#exhaustive-grid-search] performs an exhaustive search over the specified range of hyperparameters (grid). For this method you need to specify every single value for each parameter (which can be tricky, especially for the continious value parameters) that you want your model to try. 

In [7]:
from sklearn.model_selection import GridSearchCV

In [8]:
param_grid = {
              'max_depth': np.arange(2, 7),
              'learning_rate': np.arange(0.05, 0.51, 0.05),
             }

In [9]:
cv = GridSearchCV(LGBMRegressor(random_state=1), 
                  param_grid, 
                  cv=kfold, 
                  scoring='neg_mean_squared_error')

In [10]:
cv.fit(X_train, y_train)

GridSearchCV(cv=KFold(n_splits=5, random_state=1, shuffle=True),
             error_score='raise-deprecating',
             estimator=LGBMRegressor(boosting_type='gbdt', class_weight=None,
                                     colsample_bytree=1.0,
                                     importance_type='split', learning_rate=0.1,
                                     max_depth=-1, min_child_samples=20,
                                     min_child_weight=0.001, min_split_gain=0.0,
                                     n_estimators=100, n_jobs=-1, num_leaves=31,
                                     objective=Non...dom_state=1,
                                     reg_alpha=0.0, reg_lambda=0.0, silent=True,
                                     subsample=1.0, subsample_for_bin=200000,
                                     subsample_freq=0),
             iid='warn', n_jobs=None,
             param_grid={'learning_rate': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]),
      

In [11]:
cv.best_params_

{'learning_rate': 0.25, 'max_depth': 3}

In [12]:
cv.best_estimator_

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.25, max_depth=3,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
              random_state=1, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [13]:
cv.best_score_

-11.931569841987132

In [14]:
mse(y_train, cv.best_estimator_.predict(X_train))

2.037414969190734

In [15]:
mse(y_test, cv.best_estimator_.predict(X_test))

11.347274902683115

One of the major downsides of grid search CV is that it can the case, that for example learning_rate=0.45 always leads to terrible performance no matter what values other parameters have, but in the example above grid search CV the value learning_rate=0.45 is still used 5 times which leads to basically wasting of these 5 trials. 

### RandomizedSearchCV

In [16]:
from sklearn.model_selection import RandomizedSearchCV

from scipy.stats import randint, uniform

In [17]:
param_distributions = {'max_depth': randint(low=2, high=7),
                       'learning_rate' : uniform(loc=0.05, scale=0.5-0.05)}

In [18]:
cv = RandomizedSearchCV(LGBMRegressor(random_state=1), 
                        param_distributions, 
                        n_iter=45,
                        cv=kfold, 
                        scoring='neg_mean_squared_error',
                        random_state=1)

In [19]:
cv.fit(X_train, y_train)

RandomizedSearchCV(cv=KFold(n_splits=5, random_state=1, shuffle=True),
                   error_score='raise-deprecating',
                   estimator=LGBMRegressor(boosting_type='gbdt',
                                           class_weight=None,
                                           colsample_bytree=1.0,
                                           importance_type='split',
                                           learning_rate=0.1, max_depth=-1,
                                           min_child_samples=20,
                                           min_child_weight=0.001,
                                           min_split_gain=0.0, n_estimators=100,
                                           n_jobs=-1, num_leaves=31,
                                           objecti...
                                           subsample_freq=0),
                   iid='warn', n_iter=45, n_jobs=None,
                   param_distributions={'learning_rate': <scipy.stats._distn_infrastruct

In [20]:
cv.best_params_

{'learning_rate': 0.2861597195917005, 'max_depth': 3}

In [21]:
cv.best_estimator_

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.2861597195917005,
              max_depth=3, min_child_samples=20, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31,
              objective=None, random_state=1, reg_alpha=0.0, reg_lambda=0.0,
              silent=True, subsample=1.0, subsample_for_bin=200000,
              subsample_freq=0)

In [22]:
cv.best_score_

-11.869192784679505

In [23]:
mse(y_train, cv.best_estimator_.predict(X_train))

1.836994185753246

In [24]:
mse(y_test, cv.best_estimator_.predict(X_test))

11.382017864091782

### Hyperopt

In [25]:
from hyperopt import tpe, hp, fmin, space_eval, Trials

In [26]:
class ModelTraining:

    def __init__(self, X, y, params_space, n_trials, cv_scoring_metric, cv):

        self.X = X
        self.y = y 
        self.params_space = params_space
        self.n_trials = n_trials
        self.cv_scoring_metric = cv_scoring_metric
        self.cv = cv
        self.trials = Trials()

    def _objective(self, params):
        estimator = LGBMRegressor(**params, random_state=1)
        score = cross_val_score(estimator, self.X, self.y, 
                                scoring=self.cv_scoring_metric, cv=self.cv).mean()
        return -score
        
    def optimize(self):
        return fmin(self._objective,
                    self.params_space,
                    algo=tpe.suggest,
                    max_evals=self.n_trials,
                    trials=self.trials,
                    rstate=np.random.RandomState(1))

In [27]:
params_space = {'learning_rate' : hp.uniform('learning_rate', 0.05, 0.5),
                'max_depth' : hp.choice('max_depth', np.arange(2,7))}

In [28]:
model_training = ModelTraining(X=X_train, 
                               y=y_train, 
                               params_space=params_space,
                               n_trials=50, 
                               cv_scoring_metric='neg_mean_squared_error', 
                               cv=kfold)

In [29]:
best_params = model_training.optimize()

100%|███████████████████████████████████████████████████| 50/50 [00:08<00:00,  5.97it/s, best loss: 11.870490843729723]


In [30]:
best_params
# keep in mind, that when 'hp.choice' is used, the return of the fmin function contains 
# the index of the parameter value provided in the corresponding hp.choice range of values
# thus, 'max_depth': 3 means that "third parameter in the np.arange(2,7) was picked as optimal"
# as np.arange(2,7) -> [2,3,4,5,6], the third parameter has the value of 5 (indexing starts with 0)

{'learning_rate': 0.2556072866275003, 'max_depth': 1}

In [31]:
# You can retrieve the values of the selected parameters by using space_eval:
space_eval(params_space, best_params)

{'learning_rate': 0.2556072866275003, 'max_depth': 3}

In [32]:
best_model = LGBMRegressor(**space_eval(params_space, best_params), random_state=1)
best_model.fit(X_train, y_train)

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.2556072866275003,
              max_depth=3, min_child_samples=20, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31,
              objective=None, random_state=1, reg_alpha=0.0, reg_lambda=0.0,
              silent=True, subsample=1.0, subsample_for_bin=200000,
              subsample_freq=0)

In [33]:
mse(y_train, best_model.predict(X_train))

2.0059894557008233

In [34]:
mse(y_test, best_model.predict(X_test))

11.54208304337233

### Optuna

In [35]:
import optuna
optuna.logging.set_verbosity('WARNING')

In [36]:
from optuna import create_study, samplers

In [37]:
class ModelTraining:

    def __init__(self, X, y, n_trials, cv_scoring_metric, cv,
                 sampler_seed=1):

        self.X = X
        self.y = y
        self.n_trials = n_trials
        self._cv_scoring_metric = cv_scoring_metric
        self._cv = cv
        self.study = None
        self._sampler_seed= sampler_seed

    def _objective(self, trial):
        params = {
            'learning_rate': trial.suggest_uniform('learning_rate', 0.05, 0.5),
            'max_depth': trial.suggest_int('max_depth', 2, 7-1),
        }
            
        model = LGBMRegressor(**params, random_state=1)
        score = cross_val_score(model, 
                                self.X, 
                                self.y, 
                                scoring=self._cv_scoring_metric,
                                cv=self._cv).mean()
        return score       

    def optimize(self):
        self.study = create_study(sampler=samplers.TPESampler(seed=self._sampler_seed),
                                  direction='maximize')
        self.study.optimize(self._objective, n_trials=self.n_trials)

In [38]:
model_training = ModelTraining(X=X_train, 
                               y=y_train, 
                               n_trials=50, 
                               cv_scoring_metric='neg_mean_squared_error', 
                               cv=kfold)

In [39]:
model_training.optimize()

In [40]:
model_training.study.best_params

{'learning_rate': 0.24545070025296128, 'max_depth': 3}

In [41]:
model_training.study.best_value

-11.623597255551626

In [42]:
best_model = LGBMRegressor(**model_training.study.best_params, random_state=1)
best_model.fit(X_train, y_train)

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.24545070025296128,
              max_depth=3, min_child_samples=20, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31,
              objective=None, random_state=1, reg_alpha=0.0, reg_lambda=0.0,
              silent=True, subsample=1.0, subsample_for_bin=200000,
              subsample_freq=0)

In [43]:
mse(y_train, best_model.predict(X_train))

2.22107239312464

In [44]:
mse(y_test, best_model.predict(X_test))

11.41184036504707

### Useful links

- Predefined sklearn evaluation metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules


- Scipy distributions: https://docs.scipy.org/doc/scipy/reference/stats.html


- Hyperopt parameter expressions: https://github.com/hyperopt/hyperopt/wiki/FMin#21-parameter-expressions


- Optuna suggest options: https://optuna.readthedocs.io/en/latest/reference/trial.html#optuna.trial.Trial.suggest_categorical