Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

[WIP] Multiple-metric grid search #2759

Open
wants to merge 57 commits into from

8 participants

@mblondel
Owner

This PR brings multiple-metric grid search. This is important for finding the best-tuned estimator on a per metric basis without redoing the grid / randomized search from scratch for each metric.

Highlights:

  • No public API has been broken.
  • Some parts are actually simplified.
  • Updates cross_val_score so as to support lists as scoring parameter. In this case, a 2d array of shape (n_scoring, n_folds) is returned instead of a 1d array of shape (n_folds,).
  • Adds multiple metric support to GridSearchCV and RandomizedSearchCV.
  • Introduces _evaluate_scorers for computing several scorers without recomputing the predictions every time.
  • Fixes #2588 (regressors can now be used with metrics that require need_threshold=True.

Tagging this PR with the v0.15 milestone.

@mblondel
Owner

@jnothman @larsmans I've made a few preliminary changes towards supporting multiple metrics in grid search (This is still in an early stage so I'm not submitting any PR.). This is a bit different from what I suggested on the ML.

First, I got rid off of the _ProbaScorer, _ThresholdScorer and _PredictScorer classes. Instead, I introduced a new _Scorer class whose purpose is solely to annotate metrics with greater_is_better, needs_proba, etc. Second, I introduced an evaluate_scorers function whose purpose is to evaluate several scorers without recomputing the predictions from scratch every time. The goal is to use it within cross_val_score.

My branch is based on @AlexanderFabisch's recent modifications.

Owner
Owner

I know it was long ago proposed and rejected, but I still think it's worth considering using decorations on the metric functions for such annotations.

I like the idea too as we would no longer need to differentiate between metrics and scorers. One question is whether we would need to rename scoring in GridSearchCV or not.

@jnothman

We'll need to handle multilabel differently again. Firstly, you may want to use type_of_target(y) which includes distinguishing multiclass and binary. Secondly, I think we need to factor out this decision_function -> predict transformation. It's currently implemented in LabelBinarizer.inverse_transform, but I think it should be a function.

Owner

@jnothman I've made some progress on the evaluate_scorers function.
https://github.com/mblondel/scikit-learn/tree/multiple_grid_search

One concern I have is that type_of_target(y) sounds fragile, especially since that y is test data. What if the test set contains only a subset of the labels in the training set?

I would also like to consider removing support for the multiouput multiclass case. The code in master just stacks each matrix in the list output by predict_proba/decision_function. First, there's no decision_function that does that in scikit-learn. Second, if one cares about multiple output, stacking does not seem like a sensible thing to do. I'd rather implement proper multiple output metrics and adds an option needs_multiple_output to make_scorer.

One thing we must be careful is to not do too much overloading in metrics. We should not hesitate to create separate functions in some cases.

@arjoly, your comments are welcome too.

Owner
Owner

One concern I have is that type_of_target(y) sounds fragile, especially since that y is test data. What if the test set contains only a subset of the labels in the training set?

This is fragile only with multiclass / multioutput-multilclass data. One way to be more robust would be to use the classes_ attribute of the estimator.

In my opinion, the main issue is that we need to infer the label set on (possibly) a subset of the data. Should we add a way to set the label prior cross-validation or grid searching?

I would also like to consider removing support for the multiouput multiclass case. The code in master just stacks each matrix in the list output by predict_proba/decision_function. First, there's no decision_function that does that in scikit-learn. Second, if one cares about multiple output, stacking does not seem like a sensible thing to do. I'd rather implement proper multiple output metrics and adds an option needs_multiple_output to make_scorer.

The issue is that tree / forest based method consider multilabel data as multioutput-multiclass data. This has been discussed in #2451.

Nevertheless, the support for multi-output multi-class decision function could be removed. I added this code for completeness and make sure that roc_auc_score and average_precison_score would work in all situations.

One thing we must be careful is to not do too much overloading in metrics. We should not hesitate to create separate functions in some cases.

I agree that we should consider splitting the function for all metrics that consider binary, multiclass, multilabel or multioutput-multiclass differently.

Owner

I agree that we should consider splitting the function for all metrics that consider binary, multiclass, multilabel or multioutput-multiclass differently.

Maybe not for binary/multiclass (I don't want to change my code if I used a binary dataset instead of a multiclass dataset) but probably for single-label/multi-label.

Owner

Can you give example(s) of metric that you would want to split?

mblondel added some commits
@arjoly

This one should work wit one versus all. No?

@arjoly

This one correspond to tree learned on multilabel data. no?

Owner

The two tests I removed don't pass without this line:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/scorer.py#L142

I'd prefer to avoid such a hack for now until we agree on how to properly support the multi-output multiclass case.

Owner

All right !

@coveralls

Coverage Status

Coverage remained the same when pulling 47dd41c on mblondel:multiple_grid_search into 1a887dd on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 75a8762 on mblondel:multiple_grid_search into 1a887dd on scikit-learn:master.

@jnothman
Owner

Introduces grid_search_cv and randomized_search_cv functions which also accept lists as scoring parameter. GridSearchCV and RandomizedSearchCV are reserved for single-metric hyper-parameter tuning since they need to behave like a predictor.

Interesting approach. This explicitly excludes the P/R/F usecase in which you always want to use F as your objective, but evaluate the others for diagnostic purposes. Would you suggest:

  • leaving that use-case to an arbitrary callback approach?
  • having a way to specify which metric should be used in predicting?
  • something else?
@jnothman jnothman commented on the diff
sklearn/metrics/scorer.py
((124 lines not shown))
+ % (self.score_func.__name__,
+ "" if self.greater_is_better else ", greater_is_better=False",
+ kwargs_string))
+
+
+def _evaluate_scorers(estimator, X, y, scorers):
+ has_pb = hasattr(estimator, "predict_proba")
+ has_df = hasattr(estimator, "decision_function")
+ _is_classifier = is_classifier(estimator)
+ _type_of_y = type_of_target(y)
+
+ # Make a first pass through scorers to determine if we need
+ # predict_proba or decision_function.
+ needs_proba = False
+ needs_df = False
+ for scorer in scorers:
@jnothman Owner

You need to handle custom scorers here, which may not have the attributes you are expecting.

@mblondel Owner

I was assuming that all scorers have been built with make_scorer. What use case do you have in mind?

@jnothman Owner

http://scikit-learn.org/dev/modules/model_evaluation.html#implementing-your-own-scoring-object

Though I also noticed this falsehood in the previous section: "The scoring parameter can be a callable that takes model predictions and ground truth."

@mblondel Owner

Hum, good catch. I can add support for that but this will of course trigger recomputations of the predictions in the callable.

@mblondel Owner
mblondel added a note

You need to handle custom scorers here, which may not have the attributes you are expecting.

Done. This simplified grid_search.py a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman jnothman closed this
@jnothman jnothman reopened this
@jnothman
Owner

@AlexanderFabisch may also be interested in where this is going.

@jnothman jnothman referenced this pull request
Open

[WIP] sample_weight support #1574

2 of 6 tasks complete
@coveralls

Coverage Status

Coverage remained the same when pulling 75a8762 on mblondel:multiple_grid_search into 1a887dd on scikit-learn:master.

@mblondel
Owner

@jnothman It seems to me that you could use the grid_scores list returned by grid_search_cv for that. It gives the mean and standard deviation of each parameter combination against all metrics. Are the mean and std enough for you or do you want the scores on all the cross-validation folds?

@jnothman
Owner

I haven't had time to look at your implementation yet; my point is that I might still want to use it as a predictor object.

@mblondel
Owner

You can use it for model selection by setting refit=False. grid_scores will only give you the means and stds, though.

Edit: sorry, probably misunderstood your previous comment.

@mblondel
Owner

For the output of grid_search_cv and randomized_search_cv, what do people think of using a named tuple?

@jnothman
Owner

You can use it for model selection by setting refit=False. grid_scores will only give you the means and stds, though.

What do you think of GridSearchCV(..., scoring=['micro_f1', 'micro_precision', 'micro_recall'], refit='micro_f1'), while GridSearchCV(..., scoring=['micro_f1', 'micro_precision', 'micro_recall'], refit=True) will raise ValueError('refit does not specify the score to optimise')?

For the output of grid_search_cv and randomized_search_cv, what do people think of using a named tuple?

To what perceived benefit?

@mblondel
Owner

Note to myself: flip the signs back when the metric has greater_is_better=False (see #2439)

@mblondel
Owner

What do you think of [...]

I'd rather not change the meaning of refit for backward compatibility. How about we use the convention that the first scorer in the list is the one used in the predict method? We could also add a method set_best_estimator(scorer) where scorer is a string or an integer index.

To what perceived benefit?

The nice thing about named tuples is that they combine the advantages of tuples and dicts.

You can do unpacking:

best_params, best_scores, grid_scores, best_estimators = grid_search_cv(estimator, param_grid, X, y)

You can access by index:

grid_search_cv(estimator, param_grid, X, y)[3]

You can access by name:

grid_search_cv(estimator, param_grid, X, y).best_estimators
@jnothman
Owner

I'd rather not change the meaning of refit for backward compatibility.
Its meaning is unchanged when there are not multiple scorers.

How about we use the convention that the first scorer in the list is the one used in the predict method?

This might suffice, but is not as explicit.

The nice thing about named tuples is that they combine the advantages of tuples and dicts.

I'd say the nice thing in this context would be to have names identical to _CVScoreTuple. But they're not really extensible. Here you have a problem using namedtuples where you are either returning a 3-tuple or a 4-tuple. Similarly, for a forward-extensible API, cv_scores_ should return dicts or recarrays that are not unpackable.

@mblondel
Owner

I will add multiple-metric support directly to GridSearchCV and RandomizedSearchCV. This way, I think we can even get rid of grid_search_cv and randomized_search_cv.

In your use case, can you live with two extra refits for the precision and recall metrics?

We can also detect whether two metrics happened to choose the same parameter combination, in order to avoid refitting twice.

@jnothman
Owner
sklearn/metrics/scorer.py
((117 lines not shown))
- """
- y_type = type_of_target(y)
- if y_type not in ("binary", "multilabel-indicator"):
- raise ValueError("{0} format is not supported".format(y_type))
-
+ for k, v in self.kwargs.items()])
+ return ("make_scorer(%s%s%s)"
+ % (self.score_func.__name__,
+ "" if self.greater_is_better else ", greater_is_better=False",
+ kwargs_string))
+
+
+def _evaluate_scorers(estimator, X, y, scorers):
+ has_pb = hasattr(estimator, "predict_proba")
+ has_df = hasattr(estimator, "decision_function")
+ _is_classifier = is_classifier(estimator)
@arjoly Owner
arjoly added a note

Can we assume that the estimator is derived from ClassifierMixin?

@mblondel Owner

is_classifier is used in several places in the scikit already, including check_cv.

@mblondel Owner

I could use hasattr(estimator, "classes_") instead (any classifier without this attribute should be updated).

@arjoly Owner
arjoly added a note

+1 for the idea
+1 for modifying is_classifier in base.py

@mblondel Owner

If other people agree, I will create a separate issue for that.

@jnothman Owner

-1 for testing for classes_. is_classifier has to be applicable before fitting, and adding more stop-gap conditions doesn't make it any more definitive. is_classifier is problematic, but particularly for meta-estimators (e.g. Pipeline) whose only polymorphic solution now requires the estimator to implement __issubclass__ to conditionally return ClassifierMixin! I think this is bad design, but checking for classes_ is far from helpful.

@GaelVaroquaux Owner

I agree with the fact that the code path underlying is_classifier is messy and bad-design. However, right now I believe that it is better to use is_classifier rather than calling is_instance. That way the code path for testing if an estimator is a classifier is the same everywhere.

In the long run, I have the impression that registering a 'virtual subclass' of an ABCMeta class might be a way to explore:
http://docs.python.org/2/library/abc.html

@mblondel Owner
mblondel added a note

And is_classifier is already used before this code gets executed anyway:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L1101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/metrics/scorer.py
((118 lines not shown))
- y_type = type_of_target(y)
- if y_type not in ("binary", "multilabel-indicator"):
- raise ValueError("{0} format is not supported".format(y_type))
-
+ for k, v in self.kwargs.items()])
+ return ("make_scorer(%s%s%s)"
+ % (self.score_func.__name__,
+ "" if self.greater_is_better else ", greater_is_better=False",
+ kwargs_string))
+
+
+def _evaluate_scorers(estimator, X, y, scorers):
+ has_pb = hasattr(estimator, "predict_proba")
+ has_df = hasattr(estimator, "decision_function")
+ _is_classifier = is_classifier(estimator)
+ _type_of_y = type_of_target(y)
@arjoly Owner
arjoly added a note

To be more robust here in classification, you can infer the label set from y and estimator.classes_. Then you can correct the type of y from binary to multiclass classification if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage remained the same when pulling c4905c3 on mblondel:multiple_grid_search into 1a887dd on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 40f6ef7 on mblondel:multiple_grid_search into 23fb798 on scikit-learn:master.

sklearn/grid_search.py
((27 lines not shown))
+ n_candidates = len(parameter_iterable)
+ print("Fitting {0} folds for each of {1} candidates, totalling"
+ " {2} fits".format(len(cv), n_candidates,
+ n_candidates * len(cv)))
+
+ base_estimator = clone(estimator)
+
+ out = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)(
+ delayed(_fit_and_score)(
+ clone(base_estimator), X, y, scorers, train, test,
+ verbose, parameters, fit_params,
+ return_parameters=True)
+ for parameters in parameter_iterable
+ for train, test in cv)
+
+ # Out is a list of triplet: score, estimator, n_test_samples
@jnothman Owner

I don't think this comment is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/grid_search.py
((37 lines not shown))
+ verbose, parameters, fit_params,
+ return_parameters=True)
+ for parameters in parameter_iterable
+ for train, test in cv)
+
+ # Out is a list of triplet: score, estimator, n_test_samples
+ n_fits = len(out)
+ n_folds = len(cv)
+ n_scorers = len(scorers)
+
+ grid_scores = []
+ for i in xrange(n_scorers):
+ grid_scores.append([])
+
+ for grid_start in range(0, n_fits, n_folds):
+ n_test_samples = 0
@jnothman Owner

can you rewrite the next 22 lines as something like:

fold_scores, n_test, _, parameters = zip(*out[grid_start:grid_start + n_folds])
parameters = parameters[0]
mean_scores = np.average(fold_scores, axis=0, weights=n_test if iid else None)
# all_scores = np.array(fold_scores).T
@mblondel Owner

I was planning to refactor the inner loop too but I wanted to get the test to pass first. Thanks for the code snippet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman jnothman commented on the diff
sklearn/grid_search.py
((67 lines not shown))
+
+ scores += curr_scores
+
+ if iid:
+ scores /= float(n_test_samples)
+ else:
+ scores /= float(n_folds)
+
+ for i in xrange(n_scorers):
+ # TODO: shall we also store the test_fold_sizes?
+ tup = _CVScoreTuple(parameters, scores[i], all_scores[i])
+ grid_scores[i].append(tup)
+
+ # Find the best parameters by comparing on the mean validation score:
+ # note that `sorted` is deterministic in the way it breaks ties
+ bests = [sorted(grid_scores[i], key=lambda x: x.mean_validation_score,
@jnothman Owner

Why not keep a single _CVScoreTuple for each parameter setting, with an array of scores for each scorer. Then return the best index according to each scorer, rather than pulling out the best scores, best params, etc. individually.

@mblondel Owner

I thought about this too. I think Python stores pointers to the _CVScoreTuple objects so memory-wise there shouldn't be too much overhead. And I was concerned with backward compatibility for attributes stored in GridSearchCV but I haven't given it much thoughts yet.

@jnothman Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel
Owner

@AlexanderFabisch @jnothman Is there any reason why _fit_and_score returns the times it takes for fitting and scoring? I personally don't care at all about the time it takes to score, just about training times. Is it fine with you if I change that? The reason is that I want to add a new public function cross_val_report which returns a tuple (train_scores, test_scores, train_times), as opposed to cross_val_score which just returns test scores.

Also, I can see the benefit of the return_train_scores option (scoring the training set may take time) but not of return_parameters. Is it ok if I remove it? (_fit_and_score is a private function, so it shouldn't break any external code)

@AlexanderFabisch

@AlexanderFabisch @jnothman Is there any reason why _fit_and_score returns the times it takes for fitting and scoring? I personally don't care at all about the time it takes to score, just about training times. Is it fine with you if I change that?

Yes, you can do that.

Also, I can see the benefit of the return_train_scores option (scoring the training set may take time) but not of return_parameters. Is it ok if I remove it? (_fit_and_score is a private function, so it shouldn't break any external code)

It is required in grid_search: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/grid_search.py#L377
I agree, it would be better to remove that parameter if it is possible.

@jnothman
Owner
@mblondel
Owner

I agree, it would be better to remove that parameter if it is possible.

It's fine to always return the parameters because other places can just ignore them. I'd be better to not return the parameters at all but that would require turning the parameter grid iterable into a list, which I guess is not desirable for very big search spaces (or is it?).

@mblondel
Owner

I just added cross_val_report.

@jnothman
Owner

Yes, I forgot grid search has moved to using an iterable, not a sequence, so yes, returning parameters is helpful.

I'm not sure about cross_val_report. The name implies it'll print something, like classification_report, but that's no big deal. But we seem to be proliferating various functions with slightly different signatures, to do essentially the same thing. cross_val_report could almost be written as a function of learning_curve, if only the latter returned times (and I don't see why it shouldn't).

@mblondel
Owner

@jnothman I agree that due to classification_report, cross_val_report might be interpreted as printing something but I couldn't think of a better name... Suggestions welcome.

Regarding learning_curve, don't it have a train_sizes argument? And it returns the means, not the scores of each fold.

I personally prefer introducing a new function than adding return_* options to cross_val_score but I'm pleased to change if more people prefer that.

@AlexanderFabisch

We could return scores for each fold in learning_curve as well. That might be even better because we could then plot confidence intervals (with std, var, scipy.stats.sem, ...) with fill_between.

@mblondel
Owner

That might be even better because we could then plot confidence intervals

Agreed that being able to plot error-bars is important.

But I think I'd rather make train_sizes an optional argument of cross_val_report than to ask people to use learning_curve(train_size=[1.0]). In the multiple scorer case, the latter would return a n_scorers x n_folds x n_sizes 3d array. This is a bit inconvenient to use if all I need is the train and test scores for the full train sets.

@mblondel mblondel Remove score_func from cross_val_report.
This function is deprecated, no need to add it to a new function.
e0dfe23
sklearn/cross_validation.py
((8 lines not shown))
"""
X, y = check_arrays(X, y, sparse_format='csr', allow_lists=True)
cv = _check_cv(cv, X, y, classifier=is_classifier(estimator))
- scorer = check_scoring(estimator, score_func=score_func, scoring=scoring)
- # We clone the estimator to make sure that all the folds are
- # independent, and that it is pickle-able.
+
+ if isinstance(scoring, list):
@GaelVaroquaux Owner

I would check at least for:

isinstance(scoring, (list, tuple))

as it is always surprising when code behaves differently for tuples and lists. The question is whether we want to also include ndarrays in there (it's a bit harder).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cross_validation.py
((63 lines not shown))
- return_train_score : boolean, optional, default: False
- Compute and return score on training set.
+ train_times : array of float, shape=(n_folds,)
+ Array of training times of the estimator for each run of the cross
+ validation.
+ """
+ X, y = check_arrays(X, y, sparse_format='csr', allow_lists=True)
+ cv = _check_cv(cv, X, y, classifier=is_classifier(estimator))
+
+ if isinstance(scoring, list):
@GaelVaroquaux Owner

Same remark here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cross_validation.py
((44 lines not shown))
-def _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters,
- fit_params, return_train_score=False,
- return_parameters=False):
- """Fit estimator and compute scores for a given dataset split.
+
+def cross_val_report(estimator, X, y=None, scoring=None, cv=None, n_jobs=1,
@GaelVaroquaux Owner

Is there a reason that cross_val_score cannot refactored to use the new function cross_val_report internally?

@jnothman Owner
jnothman added a note
@GaelVaroquaux Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/metrics/scorer.py
@@ -20,6 +20,7 @@
from abc import ABCMeta, abstractmethod
from warnings import warn
+import numbers
@GaelVaroquaux Owner

The "abc" imports are no longer used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/grid_search.py
((17 lines not shown))
+ if y is not None:
+ if len(y) != n_samples:
+ raise ValueError('Target variable (y) has a different number '
+ 'of samples (%i) than data (X: %i samples)'
+ % (len(y), n_samples))
+ y = np.asarray(y)
+ cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
+
+ if verbose > 0:
+ if isinstance(parameter_iterable, Sized):
+ n_candidates = len(parameter_iterable)
+ print("Fitting {0} folds for each of {1} candidates, totalling"
+ " {2} fits".format(len(cv), n_candidates,
+ n_candidates * len(cv)))
+
+ base_estimator = clone(estimator)
@GaelVaroquaux Owner

I wonder if the cloning should not happen earlier in the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on the diff
sklearn/metrics/scorer.py
((5 lines not shown))
-class _BaseScorer(six.with_metaclass(ABCMeta, object)):
- def __init__(self, score_func, sign, kwargs):
- self._kwargs = kwargs
- self._score_func = score_func
- self._sign = sign
+class _Scorer(object):
@GaelVaroquaux Owner

I think that I prefer this design to the one that was there before (multiple classes) :)

@jnothman Owner
jnothman added a note
@mblondel Owner
mblondel added a note

The previous design was different in that it was trying to be a blackbox: attributes were private and the user was only supposed to use __call__. Here, I'm just trying to add annotations to metrics. This is indeed close to decorators but one advantage of make_scorer is that you can easily pass extra parameters, e.g., SCORERS["ndcg@10"] = make_scorer(ndcg_score, k=10).

@amueller Owner

Is there a reason this change is tied in with this PR? I would very much prefer to limit the scope of the PR.

@mblondel Owner

This is needed to evaluate several scorers without recomputing the predictions every time. With your design, the predictions get recomputed for every scorer. This is fine for linear models, not quite for kernel SVM or RF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel mblondel referenced this pull request
Merged

[MRG] Validation curves #2765

@mblondel mblondel referenced this pull request
Closed

[MRG] Learning curves #2701

@mblondel mblondel Merge branch 'master' into multiple_grid_search
Conflicts:
	sklearn/grid_search.py
	sklearn/learning_curve.py
5d8570b
@coveralls

Coverage Status

Coverage remained the same when pulling 5d8570b on mblondel:multiple_grid_search into 5319994 on scikit-learn:master.

@mblondel
Owner

Pretty happy with my evening hacking session. I made the following modifications to validation_curve:
1) multiple scorer support
2) return training times
3) support for arbitrary parameter grid

1) and 2) were so that I can remove cross_val_report. If only one scorer is used, the scorer axis is flattened.

3) is important because some hyper-parameters may sometimes interact. 3) adds much more flexibility without making the interface more complicated.

I also added an example which demonstrates the interactions between C and gamma in SVC:
contour

@AlexanderFabisch Does is sound good to you? If ok with you, I will add multiple scorer support to learning_curve and return training times to be consistent with validation_curve.

@AlexanderFabisch AlexanderFabisch commented on the diff
sklearn/learning_curve.py
((23 lines not shown))
return np.array((train_scores, test_scores)).T
-def validation_curve(estimator, X, y, param_name, param_range, cv=None,
+def validation_curve(estimator, X, y, param_grid, cv=None,
@AlexanderFabisch Owner

Is there a reason why you changed this? Is it now possible to pass multiple parameter ranges?

@mblondel Owner

Yes, see my previous my comment and the new example plot_validation_contour.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@AlexanderFabisch

I think multiple scorer support in learning_curve and validation_curve is a good idea.

I am not completely convinced that we should always return the training times. Maybe that makes the interface too complicated but I am not sure yet.

By the way there is a narrative documentation with doctests and there are two examples (plot_learning_curve.py and plot_validation_curve.py) that have to be fixed.

@mblondel
Owner

I am not completely convinced that we should always return the training times. Maybe that makes the interface too complicated but I am not sure yet.

I think we should always return them. For learning_curve, it is useful to know how the algorithm scales with respect to n_samples (linearly, quadratically, ...). For validation_curve, it is useful to know the effect of a parameter on training time. For example, in SVC, training time will be typically faster with smaller C.

I'd rather not add a return_training_times=False argument just so that you can do train_scores, test_scores = validation_curve(...) instead of train_scores, test_scores, _ = validation_curve(...).

@jnothman
Owner

I don't altogether object to handling n-dimensional parameter grids for validation curve, but:

  • Handling sequences of grids -- which ParameterGrid provides for -- is irrelevant here (it's relevant for GridSearchCV because it selects the optimum among them)
  • I think the output should be shaped so that there is an axis for each parameter (and similar utilities have been proposed for the output of GridSearchCV)
  • The implicit ordering of the axes (parameter names sorted alphabetically, case-sensitively) isn't good. I would even considering allowing the user to specify an ordering (in contravention of ParameterGrid's current API) and/or including the ordering as a return value. This and other interface differences might suggest providing validation_curve as a wrapper around validation_contour (n-dimensional).
@jnothman
Owner

And regarding training times, I think it's useful and harmless; training scores I'm not as certain about.

@jnothman
Owner

But learning_curve and validation_curve should have similar outputs if possible.

@mblondel
Owner

Handling sequences of grids -- which ParameterGrid provides for -- is irrelevant here (it's relevant for GridSearchCV because it selects the optimum among them)

Agreed. I already deleted the mention of lists of dicts in the docstring.

I think the output should be shaped so that there is an axis for each parameter

I thought about that but I think it is nice that the returned shape is always consistent regardless of the number of parameters.

The implicit ordering of the axes (parameter names sorted alphabetically, case-sensitively) isn't good

I like this convention because it makes the interface slicker IMO. It should of course be properly documented. We could also add support for ordered dicts in ParameterGrid. Or, as you suggest, we could also add an extra (optional) parameter for specifying the order (and we would document that the order is alphabetical if not provided).

@mblondel
Owner

It is true that we don't often follow the "convention over configuration" paradigm in scikit-learn. A recent counter-example is make_pipeline though.

@jnothman
Owner
@amueller
Owner

This looks great, thanks for working on this.

@jnothman
Owner

I thought about that but I think it is nice that the returned shape is always consistent regardless of the number of parameters.

I'm not convinced by this. I assume (but haven't reviewed the code) that multiple scorers changes the ndim; why shouldn't multiple parameters? I think it's a natural way to view the data. But I'll settle with a doctest example that shows the reshape (and/or meshgrid) in action when finding the mean CV score over a 2d grid.

@mblondel
Owner

With the new learning curve features in master, I think it would be time to have a model_selection module with grid search, cross-validation and learning curve features. @GaelVaroquaux, any objection? IIRC, @jnothman liked the idea too.

@jnothman
Owner

I think it would be time to have a model_selection module with grid search, cross-validation and learning curve features.

In #1848 there was consensus to rename grid_search. I've suggested including CV as well, if not scorers.

@jnothman
Owner

And grid_search still needs a rename. Perhaps model_selection.search.

@mblondel
Owner
@mblondel
Owner

I've suggested including CV as well, if not scorers.

+1 to move scorers to model_selection on my side

@jnothman jnothman commented on the diff
sklearn/grid_search.py
((61 lines not shown))
+
+ for i in xrange(n_scorers):
+ # TODO: shall we also store the test_fold_sizes?
+ tup = _CVScoreTuple(parameters, mean_scores[i], fold_scores[:, i])
+ grid_scores[i].append(tup)
+
+ # Find the best parameters by comparing on the mean validation score:
+ # note that `sorted` is deterministic in the way it breaks ties
+ bests = [sorted(grid_scores[i], key=lambda x: x.mean_validation_score,
+ reverse=True)[0] for i in xrange(n_scorers)]
+ best_params = [best.parameters for best in bests]
+ best_scores = [best.mean_validation_score for best in bests]
+
+ best_estimators = []
+ if refit:
+ for i in xrange(len(scorers)):
@jnothman Owner

I still don't see the point in refitting all best estimators -- presumably an expensive task -- if only one will be used for prediction in a grid search. This forces the user to fit additional models when they only want one, while refitting only one does not stop the user from fitting additional models for each other objective.

If all must be fit, at least use Parallel?

@mblondel Owner
@mblondel Owner

How about: refit="all" or refit=idx or refit="string-identifier" (only if not a callable scorer)?

if only one will be used for prediction in a grid search

I was thinking people could also use best_estimators_[idx].predict directly.

@jnothman Owner

How about: refit="all" or refit=idx or refit="string-identifier" (only if not a callable scorer)?

I earlier suggested refit="string-identifier" and had thought idx equally. I don't mind this at all.

I was thinking people could also use best_estimators_[idx].predict directly.

Yes, but they could equally fit it themselves if this is how they're to use it. But only one best estimator can be used with GridSearchCV's fit-and-predict API...

That is, unless best_estimator_ is affected by the user changing refit at runtime:

>>> clf = GridSearchCV(SVC(), ..., scoring=['f1', 'roc_auc'], refit=True)
>>> clf.fit(X, y)
>>> clf.refit = 'f1'   # equivalently, clf.set_params(refit='f1') or clf.set_params(refit=0)
>>> clf.predict(X_test)  # predict with the best scorer by F1
>>> clf.refit = 'roc_auc'
>>> clf.predict(X_test)  # predict with the best scorer by ROC

This is a bit like changing the K parameter in SelectKBest after fitting the scores once...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel mblondel commented on the diff
sklearn/cross_validation.py
((6 lines not shown))
X_train, y_train = _safe_split(estimator, X, y, train)
X_test, y_test = _safe_split(estimator, X, y, test, train)
+
+ start_time = time.time()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
Owner

@mblondel this PR is still WIP (and have not have found the time to review it myself yet). As it's a non trivial new feature with API impacts I think it's risky to include it in 0.15 in my opinion.

I remove the 0.15 tag. Please let us know if you have an objection.

@ogrisel ogrisel removed this from the 0.15 milestone
@mblondel
Owner

+1

Sorry I forgot to remove the tag myself.

@amueller amueller commented on the diff
sklearn/cross_validation.py
((18 lines not shown))
- scoring_time = time.time() - start_time
+ train_time = time.time() - start_time
+
+ test_scores = _evaluate_scorers(estimator, X_test, y_test, scorers)
+
+ if return_train_scores:
+ if len(scorers) == 1:
@amueller Owner

what is the benefit of this if?

@mblondel Owner

I think the len(scorers) == 1 case doesn't need to be treated separately indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller amueller commented on the diff
sklearn/metrics/scorer.py
((118 lines not shown))
- y_type = type_of_target(y)
- if y_type not in ("binary", "multilabel-indicator"):
- raise ValueError("{0} format is not supported".format(y_type))
-
+ for k, v in self.kwargs.items()])
+ return ("make_scorer(%s%s%s)"
+ % (self.score_func.__name__,
+ "" if self.greater_is_better else ", greater_is_better=False",
+ kwargs_string))
+
+
+def _evaluate_scorers(estimator, X, y, scorers):
+ """Evaluate a list of scorers. `scorers` may contain _Scorer objects or
+ callables of the form callable(estimator, X, y)."""
+
+ if len(scorers) == 1 and not isinstance(scorers[0], _Scorer):
@amueller Owner

Why is the second condition here?

@mblondel Owner

This condition is for handling a user-defined callable. It's a complete black box from our point of view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

What is the rational behind returning a 2d array. I think I would have preferred a dict. Also, I don't understand how (or if?) you return the training score in GridSearchCV.

@amueller
Owner

@mblondel do you plan to work on this? I have half of a mind to take this over ;)

@mblondel
Owner

@amueller That would be nice! (No plan to work on this in the near future...)

@jnothman
Owner

@amueller wrote:

What is the rational behind returning a 2d array. I think I would have preferred a dict.
What are the dict keys when the scorers aren't entered as strings?

I'd really like to see this happen. I'd happily attempt to complete the PR. What do you consider to still be lacking, @mblondel?

I am however a little concerned that any code that attempts efficient calculation of multiple scorers (with the current definition of scorer) is going to be frameworkish to a great extent, and hence will be difficult to get merged. Is there some way to limit this?

@mblondel
Owner
@jnothman
Owner

Note: this PR fixes #1850.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jan 9, 2014
  1. @AlexanderFabisch

    Refactor cv code

    AlexanderFabisch authored
  2. @AlexanderFabisch

    Clean up

    AlexanderFabisch authored
Commits on Jan 10, 2014
  1. @AlexanderFabisch
  2. @AlexanderFabisch
  3. @AlexanderFabisch
Commits on Jan 11, 2014
  1. @AlexanderFabisch

    Return time

    AlexanderFabisch authored
  2. @AlexanderFabisch
  3. @AlexanderFabisch
Commits on Jan 12, 2014
  1. @AlexanderFabisch
  2. @AlexanderFabisch

    Clean up

    AlexanderFabisch authored
  3. @AlexanderFabisch
  4. @AlexanderFabisch
  5. @AlexanderFabisch
Commits on Jan 14, 2014
  1. @AlexanderFabisch
  2. @mblondel
  3. @mblondel
Commits on Jan 15, 2014
  1. @mblondel
  2. @mblondel
  3. @mblondel

    Support SVC.

    mblondel authored
  4. @mblondel

    Handle multi-label case.

    mblondel authored
  5. @mblondel
Commits on Jan 16, 2014
  1. @mblondel
  2. @mblondel

    Merge branch 'master' into multiple_grid_search

    mblondel authored
    Conflicts:
    	sklearn/cross_validation.py
    	sklearn/feature_selection/rfe.py
    	sklearn/grid_search.py
    	sklearn/learning_curve.py
    	sklearn/metrics/scorer.py
    	sklearn/metrics/tests/test_score_objects.py
    	sklearn/tests/test_grid_search.py
  3. @mblondel
  4. @mblondel
  5. @mblondel
  6. @mblondel
  7. @mblondel

    Define `parameters` upfront.

    mblondel authored
    We are not supposed to use `parameters` outside of the loop. And this
    makes the code very difficult to read.
  8. @mblondel

    Use more informative name.

    mblondel authored
  9. @mblondel

    Put __repr__ back.

    mblondel authored
  10. @mblondel

    Deprecate fit_grid_point.

    mblondel authored
  11. @mblondel

    Add grid_search_cv.

    mblondel authored
  12. @mblondel

    Refactor code.

    mblondel authored
  13. @mblondel

    Add randomized_search_cv.

    mblondel authored
  14. @mblondel
  15. @mblondel

    Update docstrings.

    mblondel authored
  16. @mblondel
Commits on Jan 17, 2014
  1. @mblondel

    Support multiple metrics directly in GridSearchCV and

    mblondel authored
    RandomizedSearchCV.
  2. @mblondel
Commits on Feb 3, 2014
  1. @mblondel

    Simplify inner loop.

    mblondel authored
    Code snippet by @jnothman.
  2. @mblondel

    Fix incorrect comment.

    mblondel authored
  3. @mblondel

    Fix comments.

    mblondel authored
  4. @mblondel

    Return training time only.

    mblondel authored
  5. @mblondel

    Remove return_parameters.

    mblondel authored
  6. @mblondel
  7. @mblondel

    Add cross_val_report.

    mblondel authored
Commits on Feb 4, 2014
  1. @mblondel

    Remove score_func from cross_val_report.

    mblondel authored
    This function is deprecated, no need to add it to a new function.
Commits on Feb 7, 2014
  1. @mblondel

    Accept tuples too.

    mblondel authored
  2. @mblondel
  3. @mblondel

    Unused imports.

    mblondel authored
  4. @mblondel

    Clone early.

    mblondel authored
  5. @mblondel

    Merge branch 'master' into multiple_grid_search

    mblondel authored
    Conflicts:
    	sklearn/grid_search.py
    	sklearn/learning_curve.py
Commits on Feb 9, 2014
  1. @mblondel
  2. @mblondel
  3. @mblondel
  4. @mblondel

    Return training times.

    mblondel authored
  5. @mblondel

    Remove cross_val_report.

    mblondel authored
Something went wrong with that request. Please try again.