Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] scorer_params and sample_weight support #3524

Closed
wants to merge 5 commits into from

Conversation

vene
Copy link
Member

@vene vene commented Aug 1, 2014

This is me picking up the remaining commits from #1574, rebasing on master, and adding a different API:

Since I want to write a sample_group-aware scorer, I needed an API to pass arbitrary params to scorers in cross-validation in the same way as fit_params. This supersedes the need for an explicit sample_weights parameter, at the cost of some duplication in the function call (fit_params=dict(sample_weight=sw), scorer_params=dict(sample_weight=sw)). Internally this saves us the need for special treatment, though, and allows scorers to be more powerful.

  • Consider API change: fit_params and scorer_params starting with sample_ will be indexed by train-test splits.
  • support fit_params and scorer_params in learning_curve and RFECV
  • test that scorer_params are getting appropriately indexed.
  • proof of concept learning to rank grid search using this API

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) when pulling f1f6a3c on vene:weighted_score_params into 07560e4 on scikit-learn:master.

@@ -17,7 +17,8 @@
from .utils.fixes import astype


def learning_curve(estimator, X, y, train_sizes=np.linspace(0.1, 1.0, 5),
def learning_curve(estimator, X, y, sample_weight=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changes the function signature :( can we live with it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the changes from @ndawe's original changeset. I would suggest making these functions (learning_curve, validation_curve, RFECV) support fit_params and scorer_params, and adding them at the end so that the API doesn't change, in the same way I did in grid search.

I didn't do this yet because I wanted to get the API proposal out there so we can discuss, and also because I only really need GridSearchCV for what I'm doing now.

@jnothman
Copy link
Member

jnothman commented Aug 2, 2014

I don't dispute the utility of scorer_params in other cases, but I think we are best off supporting a sample_weight parameter directly in GridSearchCV et al. This is consistent with the standard fit interface across scikit-learn, and makes things like nested CV straightforward.

@vene
Copy link
Member Author

vene commented Aug 2, 2014

@jnothman I am not convinced either. But there are two attributes that need to have this behaviour: sample_weight and sample_group, and it made me think that it'd be cleaner to implement them together.

This is clearly a generalizable approach, which is tempting---but I can't come up with other examples of arguments that should be passed like this, though. Also I can't think of other scorer_params that are not data-dependent and can't be set when instantiating a Scorer object.

So I'm torn.
But I need sample_group though.

@jnothman
Copy link
Member

jnothman commented Aug 2, 2014

As far as I'm concerned, scikit-learn is almost entirely concerned with data where samples may be assumed iid. sample_group falls outside (or at least on the periphery) of that space, while sample_weight is something that most estimators, metrics, etc. could/should support. However, it is useful that the more generic components in scikit-learn also be compatible with learning tasks where additional information like sample_group is needed. I think this is sufficient for arguing that sample_weight and sample_group be considered different classes of citizen.

@jnothman jnothman closed this Aug 2, 2014
@jnothman jnothman reopened this Aug 2, 2014
@jnothman
Copy link
Member

jnothman commented Aug 2, 2014

(Sorry, slipped.)

On a related note, split_params may also be needed in the context of #3340 and nested cross-validation, in order to handle LeaveOneLabelOut. Or is that getting too crazy?

On a completely different note, I doubt that scorer_params should be an attribute of GridSearchCV et al., mostly because it will add confusion to the metrics-scorer interfaces. Basically, scorers should be designed such that all parameters other than estimator are data-aligned, and hence scorer_params should only be provided as a kwarg to fit.

Another way of looking at it is that arbitrary kwargs could be provided to fit, and the GridSearchCV constructor could have a parameter that defines their routing: fit_param_routing={'sample_weight': ['fit', 'score'], 'sample_group': ['score'], 'labels': ['split']} (or the inverse mapping). (NB: In nested CV, all parameters should be routed to 'fit'.)

@vene
Copy link
Member Author

vene commented Aug 2, 2014

I see your point and I agree. I was thinking the same: something like sample_group should definitely not be defined in scikit-learn itself, but the cross-val and search API should allow for something like this to be implemented easily on top, without having to copy-paste and hack the code.

Something that kind of collides with your suggestion is that fit_params is already a constructor argument, while by design stuff that goes into fit_params will be data-dependent, so this is a bit weird. I think it would be cleaner to route arbitrary **kwargs passed to the grid search's fit method.

@jnothman
Copy link
Member

jnothman commented Aug 2, 2014

fit_params I think has long legacy, but I don't think it would be harmful to deprecate it. I only worry that the routing approach is a bit too abstracted. Although perhaps the argument that in general the base estimator's fit should not receive sample_group but that a nested cross-validator's fit should, suffices to say that a mechanism of this power is necessary.

for search_cls in (GridSearchCV, RandomizedSearchCV):
params=dict(sample_weight=sample_weight)
grid_search = search_cls(MockClassifier(), est_parameters, cv=cv,
fit_params=params, scorer_params=params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful to support scorer_params="fit_params" to avoid the double declaration. Now, should the default be scorer_params=None or scorer_params="fit_params"?

@adrinjalali
Copy link
Member

Also closing this one since SLEP006 provides the API and it's already implemented for scorers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants