Skip to content

[WIP] Multiple-metric grid search #2759

Open
wants to merge 57 commits into from

10 participants

@mblondel
scikit-learn member

This PR brings multiple-metric grid search. This is important for finding the best-tuned estimator on a per metric basis without redoing the grid / randomized search from scratch for each metric.

Highlights:

  • No public API has been broken.
  • Some parts are actually simplified.
  • Updates cross_val_score so as to support lists as scoring parameter. In this case, a 2d array of shape (n_scoring, n_folds) is returned instead of a 1d array of shape (n_folds,).
  • Adds multiple metric support to GridSearchCV and RandomizedSearchCV.
  • Introduces _evaluate_scorers for computing several scorers without recomputing the predictions every time.
  • Fixes #2588 (regressors can now be used with metrics that require need_threshold=True.

Tagging this PR with the v0.15 milestone.

@mblondel
scikit-learn member

@jnothman @larsmans I've made a few preliminary changes towards supporting multiple metrics in grid search (This is still in an early stage so I'm not submitting any PR.). This is a bit different from what I suggested on the ML.

First, I got rid off of the _ProbaScorer, _ThresholdScorer and _PredictScorer classes. Instead, I introduced a new _Scorer class whose purpose is solely to annotate metrics with greater_is_better, needs_proba, etc. Second, I introduced an evaluate_scorers function whose purpose is to evaluate several scorers without recomputing the predictions from scratch every time. The goal is to use it within cross_val_score.

My branch is based on @AlexanderFabisch's recent modifications.

scikit-learn member
scikit-learn member

I know it was long ago proposed and rejected, but I still think it's worth considering using decorations on the metric functions for such annotations.

I like the idea too as we would no longer need to differentiate between metrics and scorers. One question is whether we would need to rename scoring in GridSearchCV or not.

@jnothman
scikit-learn member

We'll need to handle multilabel differently again. Firstly, you may want to use type_of_target(y) which includes distinguishing multiclass and binary. Secondly, I think we need to factor out this decision_function -> predict transformation. It's currently implemented in LabelBinarizer.inverse_transform, but I think it should be a function.

scikit-learn member

@jnothman I've made some progress on the evaluate_scorers function.
https://github.com/mblondel/scikit-learn/tree/multiple_grid_search

One concern I have is that type_of_target(y) sounds fragile, especially since that y is test data. What if the test set contains only a subset of the labels in the training set?

I would also like to consider removing support for the multiouput multiclass case. The code in master just stacks each matrix in the list output by predict_proba/decision_function. First, there's no decision_function that does that in scikit-learn. Second, if one cares about multiple output, stacking does not seem like a sensible thing to do. I'd rather implement proper multiple output metrics and adds an option needs_multiple_output to make_scorer.

One thing we must be careful is to not do too much overloading in metrics. We should not hesitate to create separate functions in some cases.

@arjoly, your comments are welcome too.

scikit-learn member
scikit-learn member

One concern I have is that type_of_target(y) sounds fragile, especially since that y is test data. What if the test set contains only a subset of the labels in the training set?

This is fragile only with multiclass / multioutput-multilclass data. One way to be more robust would be to use the classes_ attribute of the estimator.

In my opinion, the main issue is that we need to infer the label set on (possibly) a subset of the data. Should we add a way to set the label prior cross-validation or grid searching?

I would also like to consider removing support for the multiouput multiclass case. The code in master just stacks each matrix in the list output by predict_proba/decision_function. First, there's no decision_function that does that in scikit-learn. Second, if one cares about multiple output, stacking does not seem like a sensible thing to do. I'd rather implement proper multiple output metrics and adds an option needs_multiple_output to make_scorer.

The issue is that tree / forest based method consider multilabel data as multioutput-multiclass data. This has been discussed in #2451.

Nevertheless, the support for multi-output multi-class decision function could be removed. I added this code for completeness and make sure that roc_auc_score and average_precison_score would work in all situations.

One thing we must be careful is to not do too much overloading in metrics. We should not hesitate to create separate functions in some cases.

I agree that we should consider splitting the function for all metrics that consider binary, multiclass, multilabel or multioutput-multiclass differently.

scikit-learn member

I agree that we should consider splitting the function for all metrics that consider binary, multiclass, multilabel or multioutput-multiclass differently.

Maybe not for binary/multiclass (I don't want to change my code if I used a binary dataset instead of a multiclass dataset) but probably for single-label/multi-label.

scikit-learn member

Can you give example(s) of metric that you would want to split?

mblondel added some commits Jan 15, 2014
@arjoly

This one should work wit one versus all. No?

@arjoly

This one correspond to tree learned on multilabel data. no?

scikit-learn member

The two tests I removed don't pass without this line:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/scorer.py#L142

I'd prefer to avoid such a hack for now until we agree on how to properly support the multi-output multiclass case.

scikit-learn member

All right !

@coveralls

Coverage Status

Coverage remained the same when pulling 47dd41c on mblondel:multiple_grid_search into 1a887dd on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 75a8762 on mblondel:multiple_grid_search into 1a887dd on scikit-learn:master.

@jnothman
scikit-learn member

Introduces grid_search_cv and randomized_search_cv functions which also accept lists as scoring parameter. GridSearchCV and RandomizedSearchCV are reserved for single-metric hyper-parameter tuning since they need to behave like a predictor.

Interesting approach. This explicitly excludes the P/R/F usecase in which you always want to use F as your objective, but evaluate the others for diagnostic purposes. Would you suggest:

  • leaving that use-case to an arbitrary callback approach?
  • having a way to specify which metric should be used in predicting?
  • something else?
@jnothman jnothman commented on the diff Jan 16, 2014
sklearn/metrics/scorer.py
+ % (self.score_func.__name__,
+ "" if self.greater_is_better else ", greater_is_better=False",
+ kwargs_string))
+
+
+def _evaluate_scorers(estimator, X, y, scorers):
+ has_pb = hasattr(estimator, "predict_proba")
+ has_df = hasattr(estimator, "decision_function")
+ _is_classifier = is_classifier(estimator)
+ _type_of_y = type_of_target(y)
+
+ # Make a first pass through scorers to determine if we need
+ # predict_proba or decision_function.
+ needs_proba = False
+ needs_df = False
+ for scorer in scorers:
@jnothman
scikit-learn member
jnothman added a note Jan 16, 2014

You need to handle custom scorers here, which may not have the attributes you are expecting.

@mblondel
scikit-learn member
mblondel added a note Jan 17, 2014

I was assuming that all scorers have been built with make_scorer. What use case do you have in mind?

@jnothman
scikit-learn member
jnothman added a note Jan 17, 2014

http://scikit-learn.org/dev/modules/model_evaluation.html#implementing-your-own-scoring-object

Though I also noticed this falsehood in the previous section: "The scoring parameter can be a callable that takes model predictions and ground truth."

@mblondel
scikit-learn member
mblondel added a note Jan 17, 2014

Hum, good catch. I can add support for that but this will of course trigger recomputations of the predictions in the callable.

@mblondel
scikit-learn member
mblondel added a note Feb 7, 2014

You need to handle custom scorers here, which may not have the attributes you are expecting.

Done. This simplified grid_search.py a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman jnothman closed this Jan 16, 2014
@jnothman jnothman reopened this Jan 16, 2014
@jnothman
scikit-learn member

@AlexanderFabisch may also be interested in where this is going.

@jnothman jnothman referenced this pull request Jan 16, 2014
Open

[WIP] sample_weight support #1574

2 of 6 tasks complete
@coveralls

Coverage Status

Coverage remained the same when pulling 75a8762 on mblondel:multiple_grid_search into 1a887dd on scikit-learn:master.

@mblondel
scikit-learn member

@jnothman It seems to me that you could use the grid_scores list returned by grid_search_cv for that. It gives the mean and standard deviation of each parameter combination against all metrics. Are the mean and std enough for you or do you want the scores on all the cross-validation folds?

@jnothman
scikit-learn member

I haven't had time to look at your implementation yet; my point is that I might still want to use it as a predictor object.

@mblondel
scikit-learn member

You can use it for model selection by setting refit=False. grid_scores will only give you the means and stds, though.

Edit: sorry, probably misunderstood your previous comment.

@mblondel
scikit-learn member

For the output of grid_search_cv and randomized_search_cv, what do people think of using a named tuple?

@jnothman
scikit-learn member

You can use it for model selection by setting refit=False. grid_scores will only give you the means and stds, though.

What do you think of GridSearchCV(..., scoring=['micro_f1', 'micro_precision', 'micro_recall'], refit='micro_f1'), while GridSearchCV(..., scoring=['micro_f1', 'micro_precision', 'micro_recall'], refit=True) will raise ValueError('refit does not specify the score to optimise')?

For the output of grid_search_cv and randomized_search_cv, what do people think of using a named tuple?

To what perceived benefit?

@mblondel
scikit-learn member

Note to myself: flip the signs back when the metric has greater_is_better=False (see #2439)

@mblondel
scikit-learn member

What do you think of [...]

I'd rather not change the meaning of refit for backward compatibility. How about we use the convention that the first scorer in the list is the one used in the predict method? We could also add a method set_best_estimator(scorer) where scorer is a string or an integer index.

To what perceived benefit?

The nice thing about named tuples is that they combine the advantages of tuples and dicts.

You can do unpacking:

best_params, best_scores, grid_scores, best_estimators = grid_search_cv(estimator, param_grid, X, y)

You can access by index:

grid_search_cv(estimator, param_grid, X, y)[3]

You can access by name:

grid_search_cv(estimator, param_grid, X, y).best_estimators
@jnothman
scikit-learn member

I'd rather not change the meaning of refit for backward compatibility.
Its meaning is unchanged when there are not multiple scorers.

How about we use the convention that the first scorer in the list is the one used in the predict method?

This might suffice, but is not as explicit.

The nice thing about named tuples is that they combine the advantages of tuples and dicts.

I'd say the nice thing in this context would be to have names identical to _CVScoreTuple. But they're not really extensible. Here you have a problem using namedtuples where you are either returning a 3-tuple or a 4-tuple. Similarly, for a forward-extensible API, cv_scores_ should return dicts or recarrays that are not unpackable.

@mblondel
scikit-learn member

I will add multiple-metric support directly to GridSearchCV and RandomizedSearchCV. This way, I think we can even get rid of grid_search_cv and randomized_search_cv.

In your use case, can you live with two extra refits for the precision and recall metrics?

We can also detect whether two metrics happened to choose the same parameter combination, in order to avoid refitting twice.

@jnothman
scikit-learn member
@arjoly arjoly and 3 others commented on an outdated diff Jan 17, 2014
sklearn/metrics/scorer.py
- """
- y_type = type_of_target(y)
- if y_type not in ("binary", "multilabel-indicator"):
- raise ValueError("{0} format is not supported".format(y_type))
-
+ for k, v in self.kwargs.items()])
+ return ("make_scorer(%s%s%s)"
+ % (self.score_func.__name__,
+ "" if self.greater_is_better else ", greater_is_better=False",
+ kwargs_string))
+
+
+def _evaluate_scorers(estimator, X, y, scorers):
+ has_pb = hasattr(estimator, "predict_proba")
+ has_df = hasattr(estimator, "decision_function")
+ _is_classifier = is_classifier(estimator)
@arjoly
scikit-learn member
arjoly added a note Jan 17, 2014

Can we assume that the estimator is derived from ClassifierMixin?

@mblondel
scikit-learn member
mblondel added a note Jan 17, 2014

is_classifier is used in several places in the scikit already, including check_cv.

@mblondel
scikit-learn member
mblondel added a note Jan 17, 2014

I could use hasattr(estimator, "classes_") instead (any classifier without this attribute should be updated).

@arjoly
scikit-learn member
arjoly added a note Jan 17, 2014

+1 for the idea
+1 for modifying is_classifier in base.py

@mblondel
scikit-learn member
mblondel added a note Jan 17, 2014

If other people agree, I will create a separate issue for that.

@jnothman
scikit-learn member
jnothman added a note Jan 18, 2014

-1 for testing for classes_. is_classifier has to be applicable before fitting, and adding more stop-gap conditions doesn't make it any more definitive. is_classifier is problematic, but particularly for meta-estimators (e.g. Pipeline) whose only polymorphic solution now requires the estimator to implement __issubclass__ to conditionally return ClassifierMixin! I think this is bad design, but checking for classes_ is far from helpful.

@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Feb 4, 2014

I agree with the fact that the code path underlying is_classifier is messy and bad-design. However, right now I believe that it is better to use is_classifier rather than calling is_instance. That way the code path for testing if an estimator is a classifier is the same everywhere.

In the long run, I have the impression that registering a 'virtual subclass' of an ABCMeta class might be a way to explore:
http://docs.python.org/2/library/abc.html

@mblondel
scikit-learn member
mblondel added a note Feb 4, 2014

And is_classifier is already used before this code gets executed anyway:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L1101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly arjoly commented on an outdated diff Jan 17, 2014
sklearn/metrics/scorer.py
- y_type = type_of_target(y)
- if y_type not in ("binary", "multilabel-indicator"):
- raise ValueError("{0} format is not supported".format(y_type))
-
+ for k, v in self.kwargs.items()])
+ return ("make_scorer(%s%s%s)"
+ % (self.score_func.__name__,
+ "" if self.greater_is_better else ", greater_is_better=False",
+ kwargs_string))
+
+
+def _evaluate_scorers(estimator, X, y, scorers):
+ has_pb = hasattr(estimator, "predict_proba")
+ has_df = hasattr(estimator, "decision_function")
+ _is_classifier = is_classifier(estimator)
+ _type_of_y = type_of_target(y)
@arjoly
scikit-learn member
arjoly added a note Jan 17, 2014

To be more robust here in classification, you can infer the label set from y and estimator.classes_. Then you can correct the type of y from binary to multiclass classification if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage remained the same when pulling c4905c3 on mblondel:multiple_grid_search into 1a887dd on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 40f6ef7 on mblondel:multiple_grid_search into 23fb798 on scikit-learn:master.

@jnothman jnothman commented on an outdated diff Jan 20, 2014
sklearn/grid_search.py
+ n_candidates = len(parameter_iterable)
+ print("Fitting {0} folds for each of {1} candidates, totalling"
+ " {2} fits".format(len(cv), n_candidates,
+ n_candidates * len(cv)))
+
+ base_estimator = clone(estimator)
+
+ out = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)(
+ delayed(_fit_and_score)(
+ clone(base_estimator), X, y, scorers, train, test,
+ verbose, parameters, fit_params,
+ return_parameters=True)
+ for parameters in parameter_iterable
+ for train, test in cv)
+
+ # Out is a list of triplet: score, estimator, n_test_samples
@jnothman
scikit-learn member
jnothman added a note Jan 20, 2014

I don't think this comment is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman jnothman and 1 other commented on an outdated diff Jan 20, 2014
sklearn/grid_search.py
+ verbose, parameters, fit_params,
+ return_parameters=True)
+ for parameters in parameter_iterable
+ for train, test in cv)
+
+ # Out is a list of triplet: score, estimator, n_test_samples
+ n_fits = len(out)
+ n_folds = len(cv)
+ n_scorers = len(scorers)
+
+ grid_scores = []
+ for i in xrange(n_scorers):
+ grid_scores.append([])
+
+ for grid_start in range(0, n_fits, n_folds):
+ n_test_samples = 0
@jnothman
scikit-learn member
jnothman added a note Jan 20, 2014

can you rewrite the next 22 lines as something like:

fold_scores, n_test, _, parameters = zip(*out[grid_start:grid_start + n_folds])
parameters = parameters[0]
mean_scores = np.average(fold_scores, axis=0, weights=n_test if iid else None)
# all_scores = np.array(fold_scores).T
@mblondel
scikit-learn member
mblondel added a note Jan 20, 2014

I was planning to refactor the inner loop too but I wanted to get the test to pass first. Thanks for the code snippet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman jnothman commented on the diff Jan 20, 2014
sklearn/grid_search.py
+
+ scores += curr_scores
+
+ if iid:
+ scores /= float(n_test_samples)
+ else:
+ scores /= float(n_folds)
+
+ for i in xrange(n_scorers):
+ # TODO: shall we also store the test_fold_sizes?
+ tup = _CVScoreTuple(parameters, scores[i], all_scores[i])
+ grid_scores[i].append(tup)
+
+ # Find the best parameters by comparing on the mean validation score:
+ # note that `sorted` is deterministic in the way it breaks ties
+ bests = [sorted(grid_scores[i], key=lambda x: x.mean_validation_score,
@jnothman
scikit-learn member
jnothman added a note Jan 20, 2014

Why not keep a single _CVScoreTuple for each parameter setting, with an array of scores for each scorer. Then return the best index according to each scorer, rather than pulling out the best scores, best params, etc. individually.

@mblondel
scikit-learn member
mblondel added a note Jan 20, 2014

I thought about this too. I think Python stores pointers to the _CVScoreTuple objects so memory-wise there shouldn't be too much overhead. And I was concerned with backward compatibility for attributes stored in GridSearchCV but I haven't given it much thoughts yet.

@jnothman
scikit-learn member
jnothman added a note Jan 20, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel
scikit-learn member
mblondel commented Feb 3, 2014

@AlexanderFabisch @jnothman Is there any reason why _fit_and_score returns the times it takes for fitting and scoring? I personally don't care at all about the time it takes to score, just about training times. Is it fine with you if I change that? The reason is that I want to add a new public function cross_val_report which returns a tuple (train_scores, test_scores, train_times), as opposed to cross_val_score which just returns test scores.

Also, I can see the benefit of the return_train_scores option (scoring the training set may take time) but not of return_parameters. Is it ok if I remove it? (_fit_and_score is a private function, so it shouldn't break any external code)

@AlexanderFabisch
scikit-learn member

@AlexanderFabisch @jnothman Is there any reason why _fit_and_score returns the times it takes for fitting and scoring? I personally don't care at all about the time it takes to score, just about training times. Is it fine with you if I change that?

Yes, you can do that.

Also, I can see the benefit of the return_train_scores option (scoring the training set may take time) but not of return_parameters. Is it ok if I remove it? (_fit_and_score is a private function, so it shouldn't break any external code)

It is required in grid_search: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/grid_search.py#L377
I agree, it would be better to remove that parameter if it is possible.

@jnothman
scikit-learn member
@mblondel
scikit-learn member
mblondel commented Feb 3, 2014

I agree, it would be better to remove that parameter if it is possible.

It's fine to always return the parameters because other places can just ignore them. I'd be better to not return the parameters at all but that would require turning the parameter grid iterable into a list, which I guess is not desirable for very big search spaces (or is it?).

@mblondel
scikit-learn member
mblondel commented Feb 3, 2014

I just added cross_val_report.

@jnothman
scikit-learn member
jnothman commented Feb 3, 2014

Yes, I forgot grid search has moved to using an iterable, not a sequence, so yes, returning parameters is helpful.

I'm not sure about cross_val_report. The name implies it'll print something, like classification_report, but that's no big deal. But we seem to be proliferating various functions with slightly different signatures, to do essentially the same thing. cross_val_report could almost be written as a function of learning_curve, if only the latter returned times (and I don't see why it shouldn't).

@mblondel
scikit-learn member
mblondel commented Feb 3, 2014

@jnothman I agree that due to classification_report, cross_val_report might be interpreted as printing something but I couldn't think of a better name... Suggestions welcome.

Regarding learning_curve, don't it have a train_sizes argument? And it returns the means, not the scores of each fold.

I personally prefer introducing a new function than adding return_* options to cross_val_score but I'm pleased to change if more people prefer that.

@AlexanderFabisch
scikit-learn member

We could return scores for each fold in learning_curve as well. That might be even better because we could then plot confidence intervals (with std, var, scipy.stats.sem, ...) with fill_between.

@mblondel
scikit-learn member
mblondel commented Feb 3, 2014

That might be even better because we could then plot confidence intervals

Agreed that being able to plot error-bars is important.

But I think I'd rather make train_sizes an optional argument of cross_val_report than to ask people to use learning_curve(train_size=[1.0]). In the multiple scorer case, the latter would return a n_scorers x n_folds x n_sizes 3d array. This is a bit inconvenient to use if all I need is the train and test scores for the full train sets.

@mblondel mblondel Remove score_func from cross_val_report.
This function is deprecated, no need to add it to a new function.
e0dfe23
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Feb 4, 2014
sklearn/cross_validation.py
"""
X, y = check_arrays(X, y, sparse_format='csr', allow_lists=True)
cv = _check_cv(cv, X, y, classifier=is_classifier(estimator))
- scorer = check_scoring(estimator, score_func=score_func, scoring=scoring)
- # We clone the estimator to make sure that all the folds are
- # independent, and that it is pickle-able.
+
+ if isinstance(scoring, list):
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Feb 4, 2014

I would check at least for:

isinstance(scoring, (list, tuple))

as it is always surprising when code behaves differently for tuples and lists. The question is whether we want to also include ndarrays in there (it's a bit harder).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Feb 4, 2014
sklearn/cross_validation.py
- return_train_score : boolean, optional, default: False
- Compute and return score on training set.
+ train_times : array of float, shape=(n_folds,)
+ Array of training times of the estimator for each run of the cross
+ validation.
+ """
+ X, y = check_arrays(X, y, sparse_format='csr', allow_lists=True)
+ cv = _check_cv(cv, X, y, classifier=is_classifier(estimator))
+
+ if isinstance(scoring, list):
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Feb 4, 2014

Same remark here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Feb 4, 2014
sklearn/cross_validation.py
-def _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters,
- fit_params, return_train_score=False,
- return_parameters=False):
- """Fit estimator and compute scores for a given dataset split.
+
+def cross_val_report(estimator, X, y=None, scoring=None, cv=None, n_jobs=1,
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Feb 4, 2014

Is there a reason that cross_val_score cannot refactored to use the new function cross_val_report internally?

@jnothman
scikit-learn member
jnothman added a note Feb 4, 2014
@GaelVaroquaux
scikit-learn member
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Feb 4, 2014
sklearn/metrics/scorer.py
@@ -20,6 +20,7 @@
from abc import ABCMeta, abstractmethod
from warnings import warn
+import numbers
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Feb 4, 2014

The "abc" imports are no longer used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Feb 4, 2014
sklearn/grid_search.py
+ if y is not None:
+ if len(y) != n_samples:
+ raise ValueError('Target variable (y) has a different number '
+ 'of samples (%i) than data (X: %i samples)'
+ % (len(y), n_samples))
+ y = np.asarray(y)
+ cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
+
+ if verbose > 0:
+ if isinstance(parameter_iterable, Sized):
+ n_candidates = len(parameter_iterable)
+ print("Fitting {0} folds for each of {1} candidates, totalling"
+ " {2} fits".format(len(cv), n_candidates,
+ n_candidates * len(cv)))
+
+ base_estimator = clone(estimator)
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Feb 4, 2014

I wonder if the cloning should not happen earlier in the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on the diff Feb 4, 2014
sklearn/metrics/scorer.py
-class _BaseScorer(six.with_metaclass(ABCMeta, object)):
- def __init__(self, score_func, sign, kwargs):
- self._kwargs = kwargs
- self._score_func = score_func
- self._sign = sign
+class _Scorer(object):
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Feb 4, 2014

I think that I prefer this design to the one that was there before (multiple classes) :)

@jnothman
scikit-learn member
jnothman added a note Feb 4, 2014
@mblondel
scikit-learn member
mblondel added a note Feb 5, 2014

The previous design was different in that it was trying to be a blackbox: attributes were private and the user was only supposed to use __call__. Here, I'm just trying to add annotations to metrics. This is indeed close to decorators but one advantage of make_scorer is that you can easily pass extra parameters, e.g., SCORERS["ndcg@10"] = make_scorer(ndcg_score, k=10).

@amueller
scikit-learn member
amueller added a note Jul 17, 2014

Is there a reason this change is tied in with this PR? I would very much prefer to limit the scope of the PR.

@mblondel
scikit-learn member
mblondel added a note Jul 18, 2014

This is needed to evaluate several scorers without recomputing the predictions every time. With your design, the predictions get recomputed for every scorer. This is fine for linear models, not quite for kernel SVM or RF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel mblondel referenced this pull request Feb 6, 2014
Merged

[MRG] Validation curves #2765

@mblondel mblondel referenced this pull request Feb 7, 2014
Closed

[MRG] Learning curves #2701

@mblondel mblondel Merge branch 'master' into multiple_grid_search
Conflicts:
	sklearn/grid_search.py
	sklearn/learning_curve.py
5d8570b
@coveralls

Coverage Status

Coverage remained the same when pulling 5d8570b on mblondel:multiple_grid_search into 5319994 on scikit-learn:master.

@mblondel
scikit-learn member
mblondel commented Feb 9, 2014

Pretty happy with my evening hacking session. I made the following modifications to validation_curve:
1) multiple scorer support
2) return training times
3) support for arbitrary parameter grid

1) and 2) were so that I can remove cross_val_report. If only one scorer is used, the scorer axis is flattened.

3) is important because some hyper-parameters may sometimes interact. 3) adds much more flexibility without making the interface more complicated.

I also added an example which demonstrates the interactions between C and gamma in SVC:
contour

@AlexanderFabisch Does is sound good to you? If ok with you, I will add multiple scorer support to learning_curve and return training times to be consistent with validation_curve.

@AlexanderFabisch AlexanderFabisch commented on the diff Feb 9, 2014
sklearn/learning_curve.py
return np.array((train_scores, test_scores)).T
-def validation_curve(estimator, X, y, param_name, param_range, cv=None,
+def validation_curve(estimator, X, y, param_grid, cv=None,
@AlexanderFabisch
scikit-learn member

Is there a reason why you changed this? Is it now possible to pass multiple parameter ranges?

@mblondel
scikit-learn member
mblondel added a note Feb 10, 2014

Yes, see my previous my comment and the new example plot_validation_contour.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@AlexanderFabisch
scikit-learn member

I think multiple scorer support in learning_curve and validation_curve is a good idea.

I am not completely convinced that we should always return the training times. Maybe that makes the interface too complicated but I am not sure yet.

By the way there is a narrative documentation with doctests and there are two examples (plot_learning_curve.py and plot_validation_curve.py) that have to be fixed.

@mblondel
scikit-learn member

I am not completely convinced that we should always return the training times. Maybe that makes the interface too complicated but I am not sure yet.

I think we should always return them. For learning_curve, it is useful to know how the algorithm scales with respect to n_samples (linearly, quadratically, ...). For validation_curve, it is useful to know the effect of a parameter on training time. For example, in SVC, training time will be typically faster with smaller C.

I'd rather not add a return_training_times=False argument just so that you can do train_scores, test_scores = validation_curve(...) instead of train_scores, test_scores, _ = validation_curve(...).

@jnothman
scikit-learn member

I don't altogether object to handling n-dimensional parameter grids for validation curve, but:

  • Handling sequences of grids -- which ParameterGrid provides for -- is irrelevant here (it's relevant for GridSearchCV because it selects the optimum among them)
  • I think the output should be shaped so that there is an axis for each parameter (and similar utilities have been proposed for the output of GridSearchCV)
  • The implicit ordering of the axes (parameter names sorted alphabetically, case-sensitively) isn't good. I would even considering allowing the user to specify an ordering (in contravention of ParameterGrid's current API) and/or including the ordering as a return value. This and other interface differences might suggest providing validation_curve as a wrapper around validation_contour (n-dimensional).
@jnothman
scikit-learn member

And regarding training times, I think it's useful and harmless; training scores I'm not as certain about.

@jnothman
scikit-learn member

But learning_curve and validation_curve should have similar outputs if possible.

@mblondel
scikit-learn member

Handling sequences of grids -- which ParameterGrid provides for -- is irrelevant here (it's relevant for GridSearchCV because it selects the optimum among them)

Agreed. I already deleted the mention of lists of dicts in the docstring.

I think the output should be shaped so that there is an axis for each parameter

I thought about that but I think it is nice that the returned shape is always consistent regardless of the number of parameters.

The implicit ordering of the axes (parameter names sorted alphabetically, case-sensitively) isn't good

I like this convention because it makes the interface slicker IMO. It should of course be properly documented. We could also add support for ordered dicts in ParameterGrid. Or, as you suggest, we could also add an extra (optional) parameter for specifying the order (and we would document that the order is alphabetical if not provided).

@mblondel
scikit-learn member

It is true that we don't often follow the "convention over configuration" paradigm in scikit-learn. A recent counter-example is make_pipeline though.

@jnothman
scikit-learn member
@amueller
scikit-learn member

This looks great, thanks for working on this.

@jnothman
scikit-learn member

I thought about that but I think it is nice that the returned shape is always consistent regardless of the number of parameters.

I'm not convinced by this. I assume (but haven't reviewed the code) that multiple scorers changes the ndim; why shouldn't multiple parameters? I think it's a natural way to view the data. But I'll settle with a doctest example that shows the reshape (and/or meshgrid) in action when finding the mean CV score over a 2d grid.

@mblondel
scikit-learn member

With the new learning curve features in master, I think it would be time to have a model_selection module with grid search, cross-validation and learning curve features. @GaelVaroquaux, any objection? IIRC, @jnothman liked the idea too.

@jnothman
scikit-learn member

I think it would be time to have a model_selection module with grid search, cross-validation and learning curve features.

In #1848 there was consensus to rename grid_search. I've suggested including CV as well, if not scorers.

@jnothman
scikit-learn member

And grid_search still needs a rename. Perhaps model_selection.search.

@mblondel
scikit-learn member
@mblondel
scikit-learn member

I've suggested including CV as well, if not scorers.

+1 to move scorers to model_selection on my side

@jnothman jnothman commented on the diff Feb 10, 2014
sklearn/grid_search.py
+
+ for i in xrange(n_scorers):
+ # TODO: shall we also store the test_fold_sizes?
+ tup = _CVScoreTuple(parameters, mean_scores[i], fold_scores[:, i])
+ grid_scores[i].append(tup)
+
+ # Find the best parameters by comparing on the mean validation score:
+ # note that `sorted` is deterministic in the way it breaks ties
+ bests = [sorted(grid_scores[i], key=lambda x: x.mean_validation_score,
+ reverse=True)[0] for i in xrange(n_scorers)]
+ best_params = [best.parameters for best in bests]
+ best_scores = [best.mean_validation_score for best in bests]
+
+ best_estimators = []
+ if refit:
+ for i in xrange(len(scorers)):
@jnothman
scikit-learn member
jnothman added a note Feb 10, 2014

I still don't see the point in refitting all best estimators -- presumably an expensive task -- if only one will be used for prediction in a grid search. This forces the user to fit additional models when they only want one, while refitting only one does not stop the user from fitting additional models for each other objective.

If all must be fit, at least use Parallel?

@mblondel
scikit-learn member
mblondel added a note Feb 10, 2014
@mblondel
scikit-learn member
mblondel added a note Feb 10, 2014

How about: refit="all" or refit=idx or refit="string-identifier" (only if not a callable scorer)?

if only one will be used for prediction in a grid search

I was thinking people could also use best_estimators_[idx].predict directly.

@jnothman
scikit-learn member
jnothman added a note Feb 10, 2014

How about: refit="all" or refit=idx or refit="string-identifier" (only if not a callable scorer)?

I earlier suggested refit="string-identifier" and had thought idx equally. I don't mind this at all.

I was thinking people could also use best_estimators_[idx].predict directly.

Yes, but they could equally fit it themselves if this is how they're to use it. But only one best estimator can be used with GridSearchCV's fit-and-predict API...

That is, unless best_estimator_ is affected by the user changing refit at runtime:

>>> clf = GridSearchCV(SVC(), ..., scoring=['f1', 'roc_auc'], refit=True)
>>> clf.fit(X, y)
>>> clf.refit = 'f1'   # equivalently, clf.set_params(refit='f1') or clf.set_params(refit=0)
>>> clf.predict(X_test)  # predict with the best scorer by F1
>>> clf.refit = 'roc_auc'
>>> clf.predict(X_test)  # predict with the best scorer by ROC

This is a bit like changing the K parameter in SelectKBest after fitting the scores once...

@amueller
scikit-learn member
amueller added a note Feb 29, 2016

I would use the first entry in the scoring list to find best_params and refit only for that. The other scores I'd just record.
That would not allow the change at runtime, as you suggested. However, is that a common use-case, or one that we should encourage? [I feel it encourages trying out multiple estimators on the test set]

@jnothman
scikit-learn member
jnothman added a note Mar 1, 2016

Happy to go with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel mblondel commented on the diff Feb 11, 2014
sklearn/cross_validation.py
X_train, y_train = _safe_split(estimator, X, y, train)
X_test, y_test = _safe_split(estimator, X, y, test, train)
+
+ start_time = time.time()
@mblondel
scikit-learn member
mblondel added a note Feb 11, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
scikit-learn member
ogrisel commented Jun 9, 2014

@mblondel this PR is still WIP (and have not have found the time to review it myself yet). As it's a non trivial new feature with API impacts I think it's risky to include it in 0.15 in my opinion.

I remove the 0.15 tag. Please let us know if you have an objection.

@ogrisel ogrisel removed this from the 0.15 milestone Jun 9, 2014
@mblondel
scikit-learn member
mblondel commented Jun 9, 2014

+1

Sorry I forgot to remove the tag myself.

@amueller amueller commented on the diff Jul 17, 2014
sklearn/cross_validation.py
- scoring_time = time.time() - start_time
+ train_time = time.time() - start_time
+
+ test_scores = _evaluate_scorers(estimator, X_test, y_test, scorers)
+
+ if return_train_scores:
+ if len(scorers) == 1:
@amueller
scikit-learn member
amueller added a note Jul 17, 2014

what is the benefit of this if?

@mblondel
scikit-learn member
mblondel added a note Jul 18, 2014

I think the len(scorers) == 1 case doesn't need to be treated separately indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff Jul 17, 2014
sklearn/metrics/scorer.py
- y_type = type_of_target(y)
- if y_type not in ("binary", "multilabel-indicator"):
- raise ValueError("{0} format is not supported".format(y_type))
-
+ for k, v in self.kwargs.items()])
+ return ("make_scorer(%s%s%s)"
+ % (self.score_func.__name__,
+ "" if self.greater_is_better else ", greater_is_better=False",
+ kwargs_string))
+
+
+def _evaluate_scorers(estimator, X, y, scorers):
+ """Evaluate a list of scorers. `scorers` may contain _Scorer objects or
+ callables of the form callable(estimator, X, y)."""
+
+ if len(scorers) == 1 and not isinstance(scorers[0], _Scorer):
@amueller
scikit-learn member
amueller added a note Jul 17, 2014

Why is the second condition here?

@mblondel
scikit-learn member
mblondel added a note Jul 18, 2014

This condition is for handling a user-defined callable. It's a complete black box from our point of view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
scikit-learn member

What is the rational behind returning a 2d array. I think I would have preferred a dict. Also, I don't understand how (or if?) you return the training score in GridSearchCV.

@amueller
scikit-learn member

@mblondel do you plan to work on this? I have half of a mind to take this over ;)

@mblondel
scikit-learn member

@amueller That would be nice! (No plan to work on this in the near future...)

@jnothman
scikit-learn member

@amueller wrote:

What is the rational behind returning a 2d array. I think I would have preferred a dict.
What are the dict keys when the scorers aren't entered as strings?

I'd really like to see this happen. I'd happily attempt to complete the PR. What do you consider to still be lacking, @mblondel?

I am however a little concerned that any code that attempts efficient calculation of multiple scorers (with the current definition of scorer) is going to be frameworkish to a great extent, and hence will be difficult to get merged. Is there some way to limit this?

@mblondel
scikit-learn member
@jnothman
scikit-learn member

Note: this PR fixes #1850.

@jnothman
scikit-learn member

I still consider this feature sorely missing and of high priority. And seeing as it was possible to get multiple metrics back prior to the advent scorers (because there was no output checking on score_func) it is really fixing a functional regression.

Obviously a lot of the codebase has changed since this PR was launched and much of the work yet to be done is transferring changes onto moved code. I do wonder whether there's a way to get it merged piece by piece in an agile way, or whether we just need a monolithic PR at the risk of growing stale again. Certainly, some of the auxiliary changes to validation_curve should be left for another time; and if there are any concerns, we can make do with a simplified _evaluate_scorers in which prediction is repeated for each scorer.

I think @mblondel has made some reasonable API decisions here, but we should decide on the following:

I think the only substantial question is whether scores should be a dict {scorer_name: score_data_structure} (for all of cross_val_score, validation_curve, *SearchCV.grid_scores_, *SearchCV.best_score_) or a list/array. I think a list, as presently implemented, is straightforward enough to understand and code with, but that code may not be as legible as if a dict. If we choose output as a dict, when scorers are custom, will the dict key just be a scorer object / function, or will it be possible to specify scoring as a dict {name: scorer}? I also anticipate a grid_scores_to_dataframe function in sklearn-pandas in any case!

The other issue is refit, and whether multiple best_estimator_ values can be refit. I think no. IMO, if multiple scoring values are specified, refit=True should be illegal, requiring refit=0 (indexing into a scoring list) or refit='mse'? I think the latter is my preference initially.

@amueller and others, do you have an opinion on these API issues?

Apart from those things, what remains to be done appears to be: moving the changes to the current codebase; ensuring test coverage; documentation; and an example or two.

@maniteja123

Hi everyone, it seems that the scorer API has changed a lot since the PR has started. I have tried to follow the discussions in issues 1850 and pull 2123 and would like to ask your opinion on working on this enhancement. I understand that this touches a lot of API and there needs to be strong decision from the core devs. I would like to know if there is a possibility for a newbie to work on this ? Thanks.

@jnothman
scikit-learn member

I don't think the scorer API has changed much, no.

@maniteja123

Oh really sorry, I have been going through all the related PRs for some time and confused it with the changes in grid_search and cross_val_score and others which use the scorers as arguments, and will probably get affected here. I just wanted to ask if it would be advisable to work on this feature. Thanks again.

@jnothman
scikit-learn member

I think it would be a decent thing to work on, if you're comfortable with that part of the code base.

@maniteja123

I have looked into the main aim of this PR, coming from here and the discussion at #1850 but if it is okay, could I ask about the decision regarding :

  • the type of scoring as list/tuple or dict
  • the working of refit option
  • _evaluate_scorers seems to have been agreed upon.
  • this PR seems to refactor a common code for Scorer rather than PredictScorer, ProbaScorer and ThresholdScorer as is the current implementation, which is preferred ?
  • What else other than cross_val_score, validation_curve, *SearchCV.grid_scores_, *SearchCV.best_score could support this feature ?
  • Also as a side note, though #3524 is not related to this, could you also tell about the status for adding sample_weight and scorer_params to cross_val_score, GridSearchCVand others mentioned in #1574 ?

Sorry for asking many doubts. I am not that aware of the practical use cases and am trying to get an idea based on the discussions here. Thanks again for patiently answering.

@maniteja123

Sorry @rvraghav93 , I didn't know that you intended to work in this issue. I just thought this to be challenging and also important, that's why looked at it.

@rvraghav93

Yes. Myself and @amueller had a discussion on this. I've also let @GaelVaroquaux know that I will be working on this after my Tree PRs.

@jnothman
scikit-learn member
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.