# Ray Tune -Ray Tune with Sklearn Hyperparameter Tuning

© 2019-2022, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademyLogo.png)



<img src="https://docs.ray.io/en/latest/_images/tune_overview.png" align="center" width="50%">

Scikit-Learn is one of the most widely used tools in the ML community for working with data, offering dozens of easy-to-use machine learning algorithms. However, to achieve high performance for these algorithms, you often need to perform **model selection**. Model selection is way to elect the best performant model, after tuning over a set of parameters.

`tune-sklearn` is a module that integrates Ray Tune's hyperparameter tuning and scikit-learn's Classifier API. `tune-sklearn` has two APIs: [TuneSearchCV](https://docs.ray.io/en/latest/tune/api_docs/sklearn.html#tunesearchcv-docs) and [TuneGridSearchCV](https://docs.ray.io/en/latest/tune/api_docs/sklearn.html#tunesearchcv-docs). They are drop-in replacements for scikit-learn's [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) and [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV), so you only need to change less than five lines in a standard scikit-Learn script to use Tune's replacement API.

Let's compare Tune's scikit-learn APIs to the standard scikit-learn `GridSearchCV`. For this example, we'll be using `TuneGridSearchCV` with a stochastic gradient descent (SGD) [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html).

To start out, include the import statement to get tune-scikit-learn’s grid search cross validation interface.

We need to install a few libraries.

In [None]:
# !pip install tune-sklearn
# !pip install scikit-optimize

In [2]:
from tune_sklearn import TuneGridSearchCV
from sklearn.linear_model import SGDClassifier

ModuleNotFoundError: No module named 'ray.tune.search.skopt'

In [3]:
from sklearn.model_selection import GridSearchCV

# Other relevant imports
from sklearn.model_selection import train_test_split
# Use the stochastic gradient descent (SGD) classifier
from sklearn.linear_model import SGDClassifier

# import the classification dataset
from sklearn import datasets
from sklearn.datasets import make_classification
import numpy as np

Create classification data using `sklearn.datasets`. To start with, with we using a small dataset of 11K rows and 1k columns. As an excercise you can increase the number and see the difference between using regular scikit-learn and tune-scikit-learn.

In [4]:
def create_classification_data() -> (np.ndarray, np.ndarray):
    X, y = make_classification(
        n_samples=11000,
        n_features=1000,
        n_informative=50,
        n_redundant=0,
        n_classes=10,
        class_sep=2.5)
    return X, y

Create the classifcation data, training and test data sets, and define our hyperparameter
grid. 

In [5]:
X, y = create_classification_data()
# Split the dataset into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=1000)

# Example parameters grid to tune from SGDClassifier
parameter_grid = {"alpha": [1e-4, 1e-1, 1], "epsilon": [0.01, 0.1]}

### Use Sklearn to train the model

Run this on a single node

In [6]:
%%time
# n_jobs=-1 enables use of all cores
sklearn_search = GridSearchCV(SGDClassifier(), parameter_grid, n_jobs=-1, verbose=True)
sklearn_search.fit(x_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
CPU times: total: 3.2 s
Wall time: 2min 3s


0,1,2
,"estimator  estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.",SGDClassifier()
,"param_grid  param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.","{'alpha': [0.0001, 0.1, ...], 'epsilon': [0.01, 0.1]}"
,"scoring  scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s  :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric  names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.",
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20  `n_jobs` default changed from 1 to None",-1
,"refit  refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20  Support for callable added.",True
,"cv  cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22  ``cv`` default value if None changed from 3-fold to 5-fold.",
,"verbose  verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is  displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed  together with the starting time of the computation.",True
,"pre_dispatch  pre_dispatch: int, or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use  this for lightweight and fast-running jobs, to avoid delays due to on-demand  spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'",'2*n_jobs'
,"error_score  error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.",
,"return_train_score  return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21  Default value was changed from ``True`` to ``False``",False

0,1,2
,"loss  loss: {'hinge', 'log_loss', 'modified_huber', 'squared_hinge', 'perceptron', 'squared_error', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'}, default='hinge' The loss function to be used. - 'hinge' gives a linear SVM. - 'log_loss' gives logistic regression, a probabilistic classifier. - 'modified_huber' is another smooth loss that brings tolerance to  outliers as well as probability estimates. - 'squared_hinge' is like hinge but is quadratically penalized. - 'perceptron' is the linear loss used by the perceptron algorithm. - The other losses, 'squared_error', 'huber', 'epsilon_insensitive' and  'squared_epsilon_insensitive' are designed for regression but can be useful  in classification as well; see  :class:`~sklearn.linear_model.SGDRegressor` for a description. More details about the losses formulas can be found in the :ref:`User Guide ` and you can find a visualisation of the loss functions in :ref:`sphx_glr_auto_examples_linear_model_plot_sgd_loss_functions.py`.",'hinge'
,"penalty  penalty: {'l2', 'l1', 'elasticnet', None}, default='l2' The penalty (aka regularization term) to be used. Defaults to 'l2' which is the standard regularizer for linear SVM models. 'l1' and 'elasticnet' might bring sparsity to the model (feature selection) not achievable with 'l2'. No penalty is added when set to `None`. You can see a visualisation of the penalties in :ref:`sphx_glr_auto_examples_linear_model_plot_sgd_penalties.py`.",'l2'
,"alpha  alpha: float, default=0.0001 Constant that multiplies the regularization term. The higher the value, the stronger the regularization. Also used to compute the learning rate when `learning_rate` is set to 'optimal'. Values must be in the range `[0.0, inf)`.",0.1
,"l1_ratio  l1_ratio: float, default=0.15 The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Only used if `penalty` is 'elasticnet'. Values must be in the range `[0.0, 1.0]` or can be `None` if `penalty` is not `elasticnet`. .. versionchanged:: 1.7  `l1_ratio` can be `None` when `penalty` is not ""elasticnet"".",0.15
,"fit_intercept  fit_intercept: bool, default=True Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.",True
,"max_iter  max_iter: int, default=1000 The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the ``fit`` method, and not the :meth:`partial_fit` method. Values must be in the range `[1, inf)`. .. versionadded:: 0.19",1000
,"tol  tol: float or None, default=1e-3 The stopping criterion. If it is not None, training will stop when (loss > best_loss - tol) for ``n_iter_no_change`` consecutive epochs. Convergence is checked against the training loss or the validation loss depending on the `early_stopping` parameter. Values must be in the range `[0.0, inf)`. .. versionadded:: 0.19",0.001
,"shuffle  shuffle: bool, default=True Whether or not the training data should be shuffled after each epoch.",True
,"verbose  verbose: int, default=0 The verbosity level. Values must be in the range `[0, inf)`.",0
,"epsilon  epsilon: float, default=0.1 Epsilon in the epsilon-insensitive loss functions; only if `loss` is 'huber', 'epsilon_insensitive', or 'squared_epsilon_insensitive'. For 'huber', determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold. Values must be in the range `[0.0, inf)`.",0.1


In [7]:
print("Best hyperparameters found were: ", sklearn_search.best_params_)

Best hyperparameters found were:  {'alpha': 0.1, 'epsilon': 0.1}


### Use Ray's Tune's drop-in replacement

And from here, we proceed just like how we would in scikit-learn’s interface!

The `SGDClassifier` has a `partial_fit` API, which enables it to stop fitting to the data for a certain hyperparameter configuration. If the estimator does not support early stopping, we would fall back to a parallel grid search.

As you can see, the setup here is exactly how you would do it for scikit-learn, except we replace `GridSearchCV` with `TuneGridSearchCV`. Now, let's try fitting a model.



#### Start Ray on the local host

This will start Ray on the localhost. If you have a cluster, then you can supply the arguments to `ray.init(...)`.
Check the [documentation](https://docs.ray.io/en/latest/package-ref.html?highlight=ray.init#ray-init) for the specific arguments. Some examples:
 * `ray.init()`: Start Ray locally and all the relevant processes
 * `ray.init(address="localhost:6379")`: connect to the localhost cluster at a specified port (for the head node)
 * `ray.init(address="ray://123.45.67.89:10001")`: connect to an existing remote cluster, using the URI

In [8]:
import ray
ray.init(ignore_reinit_error=True)

2026-01-18 17:07:42,988	INFO worker.py:2007 -- Started a local Ray instance.


0,1
Python version:,3.11.9
Ray version:,2.53.0


Note the slight differences we introduced below:

 * an `early_stopping`, and
 * a specification of `max_iters` parameter

The ``early_stopping`` parameter allows us to terminate unpromising configurations. If ``early_stopping=True``, ``TuneGridSearchCV`` will default to using Tune's [ASHAScheduler](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-scheduler-hyperband). You can pass in a custom algorithm - see Tune's documentation on [schedulers](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-schedulers) for a full list to choose from.

``max_iters`` is the maximum number of iterations a given hyperparameter set could run for; it may run for fewer iterations if it is early stopped.

In [9]:
from tune_sklearn import TuneGridSearchCV
from sklearn.linear_model import SGDClassifier

%%time
tune_search = TuneGridSearchCV(
    SGDClassifier(), parameter_grid, early_stopping=True, 
    max_iters=10, name="AcademyTraining", verbose=1)
tune_search.fit(x_train, y_train)

ModuleNotFoundError: No module named 'ray.tune.search.skopt'

In [None]:
print("Best hyperparameters found were: ", tune_search.best_params)

## Using Bayesian Optimization

In addition to the grid search interface, tune-sklearn also provides an interface, `TuneSearchCV`, for sampling from **distributions of hyperparameters**.

In addition, you can easily enable Bayesian optimization over the distributions in only 2 lines of code:



In [None]:
%%time
digits = datasets.load_digits()
x = digits.data
y = digits.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2)

clf = SGDClassifier()
parameter_grid = {"alpha": (1e-4, 1), "epsilon": (0.01, 0.1)}

bayopt_tune_search = TuneSearchCV(
    clf,
    parameter_grid,
    search_optimization="bayesian",
    n_trials=3,
    early_stopping=True,
    max_iters=10,
    verbose=1,
)
bayopt_tune_search.fit(x_train, y_train)

In [None]:
print("Best hyperparameters found were: ", bayopt_tune_search.best_params)

In [None]:
ray.shutdown()

### Excercise

 * Try increasing the `n_samples` to 110K and `test_size=10000.` 
 
 Run end-to-end. If the normal scikit-learn takes too long, stop it and continue with Ray's version.
 Do you see the difference in execution time?