Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] extending BaseSearchCV with a custom search strategy #9599

Merged
merged 32 commits into from Aug 5, 2018
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
d0d0436
ENH Implementers of BaseSearchCV can now provide candidates through a…
jnothman Aug 21, 2017
d4dbb5d
Pull out _format_results helper
jnothman Aug 22, 2017
8e36e37
Cannot del all_results in Py2
jnothman Aug 22, 2017
0d20d0b
Test
jnothman Aug 22, 2017
4264fff
More test, and a fixme
jnothman Aug 22, 2017
6905a47
Fix rank and masking cases
jnothman Aug 22, 2017
9c46788
Improve words
jnothman Aug 22, 2017
ab56a97
Spelling
jnothman Aug 22, 2017
dcaeb14
Provide cumulative results to coroutine
jnothman Sep 11, 2017
4f9f84e
See if we can make private methods appear
jnothman Jun 14, 2018
249bcbb
Merge branch 'master' into search-coroutine
jnothman Jun 14, 2018
05c3c3b
Improve documentation
jnothman Jun 14, 2018
1dad444
FIX parameter sampler generation
jnothman Jun 15, 2018
579a43e
Public AdaptiveSearchCV instead of abstract interface
jnothman Jun 16, 2018
90aaa06
BaseSearchCV is not in sklearn.model_selection.__init__
jnothman Jun 16, 2018
d6ef8da
AdaptiveSearchCV is also OTHER
jnothman Jun 16, 2018
1c19d9f
Fix super call
jnothman Jun 16, 2018
709b89a
Change coroutine structure to callback
jnothman Jun 18, 2018
b77eec4
Allow the user to pass search directly
jnothman Jun 18, 2018
4211b2b
Adapt tests to new interface
jnothman Jun 18, 2018
c23e378
Clarify
jnothman Jun 18, 2018
6c7c231
Merge branch 'master' into search-coroutine
jnothman Jun 26, 2018
4ef1f1b
Fix merge error
jnothman Jul 16, 2018
e430dc4
Fixes given merge
jnothman Jul 16, 2018
ecffb3e
PEP8 fix
jnothman Jul 17, 2018
718ded1
Get rid of AdaptiveSearchCV
jnothman Jul 21, 2018
df3e4c7
Specify cv to a divisor of n_samples
jnothman Jul 23, 2018
f14c8ce
Avoid return_train_score FutureWarning
jnothman Jul 24, 2018
7d91d18
Nitpicks
jnothman Jul 26, 2018
2a3eecf
What's new
jnothman Jul 26, 2018
5c135d0
Olivier's comments
jnothman Aug 5, 2018
bd88b64
Merge branch 'master' into search-coroutine
jnothman Aug 5, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/whats_new/v0.20.rst
Expand Up @@ -305,6 +305,13 @@ Model evaluation and meta-estimators
hyperparameter optimization and refitting the best model on the whole
dataset. :issue:`11310` by :user:`Matthias Feurer <mfeurer>`.

- `BaseSearchCV` now has an experimental, private interface to support
customized parameter search strategies, through its ``_run_search``
method. See the implementations in :class:`model_selection.GridSearchCV`
and :class:`model_selection.RandomizedSearchCV` and please provide feedback
if you use this. Note that we do not assure the stability of this API
beyond version 0.20. :issue:`9599` by `Joel Nothman`_

Decomposition and manifold learning

- Speed improvements for both 'exact' and 'barnes_hut' methods in
Expand Down
169 changes: 112 additions & 57 deletions sklearn/model_selection/_search.py
Expand Up @@ -406,7 +406,8 @@ def __repr__(self):

class BaseSearchCV(six.with_metaclass(ABCMeta, BaseEstimator,
MetaEstimatorMixin)):
"""Base class for hyper parameter search with cross-validation."""
"""Abstract base class for hyper parameter search with cross-validation.
"""

@abstractmethod
def __init__(self, estimator, scoring=None,
Expand Down Expand Up @@ -577,6 +578,30 @@ def classes_(self):
self._check_is_fitted("classes_")
return self.best_estimator_.classes_

@abstractmethod
def _run_search(self, evaluate_candidates):
"""Repeatedly calls `evaluate_candidates` to conduct a search.

Copy link
Member

@ogrisel ogrisel Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some motivation for the intent of this abstract method. For instance:

This method, implemented in sub-classes, makes it is possible to customize the
the scheduling of evaluations: GridSearchCV and RandomizedSearchCV schedule
evaluations for their whole parameter search space at once but other more
sequential approaches are also possible: for instance is possible to
iteratively schedule evaluations for new regions of the parameter search
space based on previously collected evaluation results. This makes it
possible to implement Bayesian optimization or more generally sequential
model-based optimization by deriving from the BaseSearchCV abstract base
class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice text!

Parameters
----------
evaluate_candidates : callable
This callback accepts a list of candidates, where each candidate is
a dict of parameter settings. It returns a dict of all results so
far, formatted like ``cv_results_``.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we refactor this function? I think a better implementation would be

def _generate_candidates(self):
    params = results = None
    while True:
        params = self._candidates(params, results)
        if params is None:
            break
        results = yield params

where self._candidates should be overwritten (and raise a NotImplementedError in this class).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would allow the documentation of _candidates to be simple: takes in params and results, and returns more params. Plus, the iteration logic is separated from the choosing logic.

Examples
--------

::

def _run_search(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:

def _run_search(self, evaluate_candidates):
   ...

'Try C=0.1 only if C=1 is better than C=10'
all_results = evaluate_candidates([{'C': 1}, {'C': 10}])
score = all_results['mean_test_score']
if score[0] < score[1]:
evaluate_candidates([{'C': 0.1}])
"""

def fit(self, X, y=None, groups=None, **fit_params):
"""Run fit with all sets of parameters.

Expand Down Expand Up @@ -636,29 +661,86 @@ def fit(self, X, y=None, groups=None, **fit_params):

X, y, groups = indexable(X, y, groups)
n_splits = cv.get_n_splits(X, y, groups)
# Regenerate parameter iterable for each fit
candidate_params = list(self._get_param_iterator())
n_candidates = len(candidate_params)
if self.verbose > 0:
print("Fitting {0} folds for each of {1} candidates, totalling"
" {2} fits".format(n_splits, n_candidates,
n_candidates * n_splits))

base_estimator = clone(self.estimator)
pre_dispatch = self.pre_dispatch

out = Parallel(
n_jobs=self.n_jobs, verbose=self.verbose,
pre_dispatch=pre_dispatch
)(delayed(_fit_and_score)(clone(base_estimator), X, y, scorers, train,
test, self.verbose, parameters,
fit_params=fit_params,
return_train_score=self.return_train_score,
return_n_test_samples=True,
return_times=True, return_parameters=False,
error_score=self.error_score)
for parameters, (train, test) in product(candidate_params,
cv.split(X, y, groups)))

parallel = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
pre_dispatch=self.pre_dispatch)

fit_and_score_kwargs = dict(scorer=scorers,
fit_params=fit_params,
return_train_score=self.return_train_score,
return_n_test_samples=True,
return_times=True,
return_parameters=False,
error_score=self.error_score,
verbose=self.verbose)
results_container = [{}]
with parallel:
all_candidate_params = []
all_out = []

def evaluate_candidates(candidate_params):
candidate_params = list(candidate_params)
n_candidates = len(candidate_params)

if self.verbose > 0:
print("Fitting {0} folds for each of {1} candidates,"
" totalling {2} fits".format(
n_splits, n_candidates, n_candidates * n_splits))

out = parallel(delayed(_fit_and_score)(clone(base_estimator),
X, y,
train=train, test=test,
parameters=parameters,
**fit_and_score_kwargs)
for parameters, (train, test)
in product(candidate_params,
cv.split(X, y, groups)))

all_candidate_params.extend(candidate_params)
all_out.extend(out)

# XXX: When we drop Python 2 support, we can use nonlocal
# instead of results_container
results_container[0] = self._format_results(
all_candidate_params, scorers, n_splits, all_out)
return results_container[0]

self._run_search(evaluate_candidates)

results = results_container[0]

# For multi-metric evaluation, store the best_index_, best_params_ and
# best_score_ iff refit is one of the scorer names
# In single metric evaluation, refit_metric is "score"
if self.refit or not self.multimetric_:
self.best_index_ = results["rank_test_%s" % refit_metric].argmin()
self.best_params_ = results["params"][self.best_index_]
self.best_score_ = results["mean_test_%s" % refit_metric][
self.best_index_]

if self.refit:
self.best_estimator_ = clone(base_estimator).set_params(
**self.best_params_)
refit_start_time = time.time()
if y is not None:
self.best_estimator_.fit(X, y, **fit_params)
else:
self.best_estimator_.fit(X, **fit_params)
refit_end_time = time.time()
self.refit_time_ = refit_end_time - refit_start_time

# Store the only scorer not as a dict for single metric evaluation
self.scorer_ = scorers if self.multimetric_ else scorers['score']

self.cv_results_ = results
self.n_splits_ = n_splits

return self

def _format_results(self, candidate_params, scorers, n_splits, out):
n_candidates = len(candidate_params)

# if one choose to see train score, "out" will contain train score info
if self.return_train_score:
Expand Down Expand Up @@ -744,7 +826,6 @@ def _store(key_name, array, weights=None, splits=False, rank=False):
prev_keys = set(results.keys())
_store('train_%s' % scorer_name, train_scores[scorer_name],
splits=True)

if self.return_train_score == 'warn':
for key in set(results.keys()) - prev_keys:
message = (
Expand All @@ -755,33 +836,7 @@ def _store(key_name, array, weights=None, splits=False, rank=False):
# warn on key access
results.add_warning(key, message, FutureWarning)

# For multi-metric evaluation, store the best_index_, best_params_ and
# best_score_ iff refit is one of the scorer names
# In single metric evaluation, refit_metric is "score"
if self.refit or not self.multimetric_:
self.best_index_ = results["rank_test_%s" % refit_metric].argmin()
self.best_params_ = candidate_params[self.best_index_]
self.best_score_ = results["mean_test_%s" % refit_metric][
self.best_index_]

if self.refit:
self.best_estimator_ = clone(base_estimator).set_params(
**self.best_params_)
refit_start_time = time.time()
if y is not None:
self.best_estimator_.fit(X, y, **fit_params)
else:
self.best_estimator_.fit(X, **fit_params)
refit_end_time = time.time()
self.refit_time_ = refit_end_time - refit_start_time

# Store the only scorer not as a dict for single metric evaluation
self.scorer_ = scorers if self.multimetric_ else scorers['score']

self.cv_results_ = results
self.n_splits_ = n_splits

return self
return results


class GridSearchCV(BaseSearchCV):
Expand Down Expand Up @@ -1100,9 +1155,9 @@ def __init__(self, estimator, param_grid, scoring=None, fit_params=None,
self.param_grid = param_grid
_check_param_grid(param_grid)

def _get_param_iterator(self):
"""Return ParameterGrid instance for the given param_grid"""
return ParameterGrid(self.param_grid)
def _run_search(self, evaluate_candidates):
"""Search all candidates in param_grid"""
evaluate_candidates(ParameterGrid(self.param_grid))


class RandomizedSearchCV(BaseSearchCV):
Expand Down Expand Up @@ -1414,8 +1469,8 @@ def __init__(self, estimator, param_distributions, n_iter=10, scoring=None,
pre_dispatch=pre_dispatch, error_score=error_score,
return_train_score=return_train_score)

def _get_param_iterator(self):
"""Return ParameterSampler instance for the given distributions"""
return ParameterSampler(
def _run_search(self, evaluate_candidates):
"""Search n_iter candidates from param_distributions"""
evaluate_candidates(ParameterSampler(
self.param_distributions, self.n_iter,
random_state=self.random_state)
random_state=self.random_state))
55 changes: 53 additions & 2 deletions sklearn/model_selection/tests/test_search.py
Expand Up @@ -25,6 +25,7 @@
from sklearn.utils.testing import assert_false, assert_true
from sklearn.utils.testing import assert_array_equal
from sklearn.utils.testing import assert_array_almost_equal
from sklearn.utils.testing import assert_allclose
from sklearn.utils.testing import assert_almost_equal
from sklearn.utils.testing import assert_greater_equal
from sklearn.utils.testing import ignore_warnings
Expand Down Expand Up @@ -53,6 +54,7 @@
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import ParameterSampler
from sklearn.model_selection._search import BaseSearchCV

from sklearn.model_selection._validation import FitFailedWarning

Expand Down Expand Up @@ -881,8 +883,8 @@ def test_random_search_cv_results():
check_cv_results_array_types(search, param_keys, score_keys)
check_cv_results_keys(cv_results, param_keys, score_keys, n_cand)
# For random_search, all the param array vals should be unmasked
assert_false(any(cv_results['param_C'].mask) or
any(cv_results['param_gamma'].mask))
assert_false(any(np.ma.getmaskarray(cv_results['param_C'])) or
any(np.ma.getmaskarray(cv_results['param_gamma'])))


@ignore_warnings(category=DeprecationWarning)
Expand Down Expand Up @@ -1539,6 +1541,55 @@ def test_transform_inverse_transform_round_trip():
assert_array_equal(X, X_round_trip)


def test_custom_run_search():
def check_results(results, gscv):
exp_results = gscv.cv_results_
assert sorted(results.keys()) == sorted(exp_results)
for k in results:
if not k.endswith('_time'):
# XXX: results['params'] is a list :|
results[k] = np.asanyarray(results[k])
if results[k].dtype.kind == 'O':
assert_array_equal(exp_results[k], results[k],
err_msg='Checking ' + k)
else:
assert_allclose(exp_results[k], results[k],
err_msg='Checking ' + k)

def fit_grid(param_grid):
return GridSearchCV(clf, param_grid, cv=5,
return_train_score=True).fit(X, y)

class CustomSearchCV(BaseSearchCV):
def __init__(self, estimator, **kwargs):
super(CustomSearchCV, self).__init__(estimator, **kwargs)

def _run_search(self, evaluate):
results = evaluate([{'max_depth': 1}, {'max_depth': 2}])
check_results(results, fit_grid({'max_depth': [1, 2]}))
results = evaluate([{'min_samples_split': 5},
{'min_samples_split': 10}])
check_results(results, fit_grid([{'max_depth': [1, 2]},
{'min_samples_split': [5, 10]}]))

# Using regressor to make sure each score differs
clf = DecisionTreeRegressor(random_state=0)
X, y = make_classification(n_samples=100, n_informative=4,
random_state=0)
mycv = CustomSearchCV(clf, cv=5, return_train_score=True).fit(X, y)
gscv = fit_grid([{'max_depth': [1, 2]},
{'min_samples_split': [5, 10]}])

results = mycv.cv_results_
check_results(results, gscv)
for attr in dir(gscv):
if attr[0].islower() and attr[-1:] == '_' and \
attr not in {'cv_results_', 'best_estimator_',
'refit_time_'}:
assert getattr(gscv, attr) == getattr(mycv, attr), \
"Attribute %s not equal" % attr


def test_deprecated_grid_search_iid():
depr_message = ("The default of the `iid` parameter will change from True "
"to False in version 0.22")
Expand Down
3 changes: 2 additions & 1 deletion sklearn/utils/testing.py
Expand Up @@ -541,7 +541,8 @@ def uninstall_mldata_mock():
"RFE", "RFECV", "BaseEnsemble", "ClassifierChain",
"RegressorChain"]
# estimators that there is no way to default-construct sensibly
OTHER = ["Pipeline", "FeatureUnion", "GridSearchCV", "RandomizedSearchCV",
OTHER = ["Pipeline", "FeatureUnion",
"GridSearchCV", "RandomizedSearchCV",
"SelectFromModel", "ColumnTransformer"]

# some strange ones
Expand Down