-
-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Fix SGDClassifier never has the attribute "predict_proba" (even with log or modified_huber loss) #12222
[MRG+1] Fix SGDClassifier never has the attribute "predict_proba" (even with log or modified_huber loss) #12222
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
Thanks @rebekahkim, but your test is not failing on master. |
Hm I'm not sure that means we can close the issue as I think there's a bug in the multioutput classifier as well? But this bug here could theoretically be triggered if you grid-search between log-loss and hinge loss and then put the estimator into MultiOutputClassifier. Because then it's impossible to ducktype before fit has been called. |
I wonder if this can also fail in a less obscure situation. @rebekahkim you could also just create a new estimator for the test that only has |
I don't understand "create a new estimator for the test that only has predict_proba after fitting": this is already the case for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM but this would require an entry in doc/whats_new/v0.21.rst
.
sklearn/tests/test_multioutput.py
Outdated
multi_target_linear.fit(X, y) | ||
multi_target_linear.predict_proba(X) | ||
|
||
sgd_linear_clf = SGDClassifier(random_state=1, max_iter=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a inline comment before this line to make it explicit that SGDClassifier
uses loss='hinge' by default which is not a probabilistic loss function and therefore does not expose a
predict_proba method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand "create a new estimator for the test that only has predict_proba after fitting": this is already the case for SGDClassifier(loss='log'), no?
No, as @TomDLT said above, this is not actually the case for SGDClassifier
in master, and the current test is not failing in master.
Hello @rebekahkim , Thank you for participating in the WiMLDS/scikit sprint. We would love to merge all the PRs that were submitted. It would be great if you could follow up on the work that you started! For the PR you submitted, would you please update and re-submit? Please include #wimlds in your PR conversation. Any questions:
cc: @reshamas |
@rebekahkim |
@amueller I'm having trouble finding estimators without the |
GridSearchCV and SVC also have predict_proba only after fitting, as - like
SGDClassifier - they may or may not have probabilistic output, depending on
parameters and even results of fitting
|
@psorianom @GaelVaroquaux |
I think only grid searches currently make their predict_proba appear after fitting, as we cannot be sure whether the best model is probabilistic:
It's certainly a weird edge cas.e |
Although maybe then I've not interpreted the raised issue correctly. |
@jnothman I think you are right; it seems like SGDClassifier with appropriate loss (log or modified_huber) and SVC already has predict_proba before fit (at least in current master as well as in version 0.20.0). As @amueller said, the bug would be triggered by a really obscure case. For example, if each estimator in MultiOutputClassifier( GridSearchCV(
SGDClassifier(), param_grid = {'loss':('hinge', 'log', 'modified_huber')} )) has loss = 'log' or 'modified_huber', only then can you have valid def predict_proba(self, X):
check_is_fitted(self, 'estimators_')
if not hasattr(self.estimator, "predict_proba"): # would fail here
raise ValueError(...) in This actually doesn't happen in iris, wine, breast_cancer, or digits datasets; the estimators don't line up (I tested them all). How should we go about doing this? Try to find a dataset where this happens? Or is there a way to "force" GridSearchCV to choose a certain parameter without setting it on the base estimator in the first place? |
Well you can certainly force GridSearchCV to choose things if they're fake estimators... E.g. define |
@jnothman that's really awesome; I didn't know you could do that! I made the appropriate changes- do you mind taking a look and seeing if it's ready to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do all sorts of contrived and unrealistic things in tests! :) Sometimes even in real code 😮
return 0.0 | ||
grid_clf = GridSearchCV(sgd_linear_clf, param_grid=param, | ||
scoring=custom_scorer, cv=3, error_score=np.nan) | ||
multi_target_linear = MultiOutputClassifier(grid_clf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to assert not hasattr(..., 'predict_proba')
before doing this fit, so that the intention of the test is a bit clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean for multi_target_linear.estimator
, right? Technically, the estimator
still wouldn't have predict_proba
after fit because the underlying estimator (SGDClassifier with default loss='hinge') doesn't have predict_proba
. But all estimators in estimators_
here would (after fit, of course).
If you mean for the multi_target_linear
itself, it would have predict_proba
before and after fit; it just won't be valid (raises ValueError)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've realised what my issue is here. I think this PR is an improvement, and we can merge it as a quick fix, but what we should really be doing here is defining predict_proba
such that hasattr(multioutputclf, 'predict_proba')
is False if the underlying estimator does not have a predict_proba
attribute. See BaseSearchCV.predict_proba
for instance. Would you like to help implementing that, @rebekahkim?
@jnothman I'd like to help with the implementation! I just want to make sure I'm understanding what you're saying and proposing a correct solution. We don't want a multioutput class instance, with base estimator(s) that doesn't have a As a side note, while searching for some examples, I saw that |
Yes, briefly, that is the right approach. I am not certain if that helper
will work immediately here. I've not had time to look.
|
We've been looking at this with @thomasjpfan . You won't be able to use We think the right solution here is to mimic what This way, All that being said, as Joel said we'd be fine merging the PR as-is as a good-enough fix for now, and open another issue to address the |
@rebekahkim We have scheduled the 2019 WiMLDS sprint for Sunday, Aug 25, if you would like to schedule the work up to and around that date. |
@NicolasHug @thomasjpfan Good point! I'll leave the decision up to the sklearn dev team whether to merge this PR (@jnothman). I'll open a new issue to correct the |
That's fine. An alternative (not sure which is more elegant): you could use
|
Please add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
Co-Authored-By: rebekahkim <rebekah.kim@columbia.edu>
@NicolasHug thanks for the style suggestion - it seems like codecov/patch check fails because I've made changes to documentation. Can we ignore this? |
As far as I can tell the proposed changes are tested so it should be fine. Merging, thanks @rebekahkim ! |
…f base estimator (scikit-learn#12222)" This reverts commit 7248191.
…f base estimator (scikit-learn#12222)" This reverts commit 7248191.
Fixes #10113
self.estimators_[0]
, notself.estimator