[MRG+1] Fix SGDClassifier never has the attribute "predict_proba" (even with log or modified_huber loss) #12222

rebekahkim · 2018-09-29T20:19:10Z

predict_proba checks self.estimators_[0], not self.estimator
test for predict_proba

amueller

Looks good, thanks!

TomDLT · 2018-10-01T15:27:52Z

Thanks @rebekahkim, but your test is not failing on master.
Actually, the bug seems to have been fixed in #10961.
I propose to close this PR and the corresponding issue.

amueller · 2018-10-01T15:38:07Z

Hm I'm not sure that means we can close the issue as I think there's a bug in the multioutput classifier as well?
I guess it depends on whether we require ducktyping to be available before fit?

But this bug here could theoretically be triggered if you grid-search between log-loss and hinge loss and then put the estimator into MultiOutputClassifier. Because then it's impossible to ducktype before fit has been called.

amueller · 2018-10-01T15:40:00Z

I wonder if this can also fail in a less obscure situation. @rebekahkim you could also just create a new estimator for the test that only has predict_proba after fitting. I don't think there's harm in checking this as late as possible.

ogrisel · 2018-10-01T21:35:07Z

I don't understand "create a new estimator for the test that only has predict_proba after fitting": this is already the case for SGDClassifier(loss='log'), no?

ogrisel

This LGTM but this would require an entry in doc/whats_new/v0.21.rst.

ogrisel · 2018-10-01T21:29:15Z

sklearn/tests/test_multioutput.py

+    multi_target_linear.fit(X, y)
+    multi_target_linear.predict_proba(X)
+
+    sgd_linear_clf = SGDClassifier(random_state=1, max_iter=5)


Could you please add a inline comment before this line to make it explicit that SGDClassifier uses loss='hinge' by default which is not a probabilistic loss function and therefore does not expose a predict_proba method.

I don't understand "create a new estimator for the test that only has predict_proba after fitting": this is already the case for SGDClassifier(loss='log'), no?

No, as @TomDLT said above, this is not actually the case for SGDClassifier in master, and the current test is not failing in master.

sergulaydore · 2018-11-11T16:41:25Z

Hello @rebekahkim ,

Thank you for participating in the WiMLDS/scikit sprint. We would love to merge all the PRs that were submitted. It would be great if you could follow up on the work that you started! For the PR you submitted, would you please update and re-submit? Please include #wimlds in your PR conversation.

Any questions:

see workflow for reference
ask on this PR conversation or the issue tracker
ask on wimlds gitter with a reference to this PR

cc: @reshamas

reshamas · 2018-12-16T15:52:01Z

@rebekahkim
Will you be completing this PR?

rebekahkim · 2018-12-22T17:52:37Z

I wonder if this can also fail in a less obscure situation. @rebekahkim you could also just create a new estimator for the test that only has predict_proba after fitting. I don't think there's harm in checking this as late as possible.

@amueller I'm having trouble finding estimators without the predict_proba attribute before fitting. Do you mind pointing me to the right direction?

jnothman · 2018-12-25T07:20:33Z

GridSearchCV and SVC also have predict_proba only after fitting, as - like SGDClassifier - they may or may not have probabilistic output, depending on parameters and even results of fitting

reshamas · 2019-02-25T15:29:07Z

@psorianom @GaelVaroquaux
This PR has been languishing for quite some time, and I have been unable to find someone to complete it in our wimlds community. Can it be tagged "help wanted"?
Thank you.

jnothman · 2019-02-25T16:33:55Z

I think only grid searches currently make their predict_proba appear after fitting, as we cannot be sure whether the best model is probabilistic:

GridSearchCV(Pipeline([('clf', None)]), {'clf': ['LogisticRegression()', 'RandomForestClassifier()']})

It's certainly a weird edge cas.e

jnothman · 2019-02-25T16:35:26Z

Although maybe then I've not interpreted the raised issue correctly.

rebekahkim · 2019-02-27T11:42:08Z

@jnothman I think you are right; it seems like SGDClassifier with appropriate loss (log or modified_huber) and SVC already has predict_proba before fit (at least in current master as well as in version 0.20.0).

As @amueller said, the bug would be triggered by a really obscure case. For example, if each estimator in estimators_ from

MultiOutputClassifier( GridSearchCV( 
       SGDClassifier(), param_grid = {'loss':('hinge', 'log', 'modified_huber')} ))

has loss = 'log' or 'modified_huber', only then can you have valid predict_proba and would fail MultiOutputClassifier's predict_proba check

def predict_proba(self, X):
        check_is_fitted(self, 'estimators_')
        if not hasattr(self.estimator, "predict_proba"): # would fail here
            raise ValueError(...)

in master.

This actually doesn't happen in iris, wine, breast_cancer, or digits datasets; the estimators don't line up (I tested them all). How should we go about doing this? Try to find a dataset where this happens? Or is there a way to "force" GridSearchCV to choose a certain parameter without setting it on the base estimator in the first place?

jnothman · 2019-02-27T17:21:32Z

Well you can certainly force GridSearchCV to choose things if they're fake estimators... E.g. define score so that the one with proba returns 1 and the one without returns 0.

rebekahkim · 2019-02-28T14:08:51Z

@jnothman that's really awesome; I didn't know you could do that!

I made the appropriate changes- do you mind taking a look and seeing if it's ready to merge?

jnothman

You can do all sorts of contrived and unrealistic things in tests! :) Sometimes even in real code 😮

jnothman · 2019-03-01T06:34:50Z

sklearn/tests/test_multioutput.py

+            return 0.0
+    grid_clf = GridSearchCV(sgd_linear_clf, param_grid=param,
+                            scoring=custom_scorer, cv=3, error_score=np.nan)
+    multi_target_linear = MultiOutputClassifier(grid_clf)


I would like to assert not hasattr(..., 'predict_proba') before doing this fit, so that the intention of the test is a bit clearer.

You mean for multi_target_linear.estimator, right? Technically, the estimator still wouldn't have predict_proba after fit because the underlying estimator (SGDClassifier with default loss='hinge') doesn't have predict_proba. But all estimators in estimators_ here would (after fit, of course).

If you mean for the multi_target_linear itself, it would have predict_proba before and after fit; it just won't be valid (raises ValueError)

@jnothman thoughts?

jnothman

I've realised what my issue is here. I think this PR is an improvement, and we can merge it as a quick fix, but what we should really be doing here is defining predict_proba such that hasattr(multioutputclf, 'predict_proba') is False if the underlying estimator does not have a predict_proba attribute. See BaseSearchCV.predict_proba for instance. Would you like to help implementing that, @rebekahkim?

rebekahkim · 2019-04-10T17:50:26Z

@jnothman I'd like to help with the implementation! I just want to make sure I'm understanding what you're saying and proposing a correct solution.

We don't want a multioutput class instance, with base estimator(s) that doesn't have a predict_proba, to have the predict_proba property: i.e. hasattr(clf, 'predict_proba') returns False.
I need to do some more digging, but it seems like BaseSearchCV is taking advantage of the if_delegate_has_method decorator to do this and handle the hasattr for sub-estimator(s). I assume we can do something similar with predict_proba in the multioutput classifier. Is this the right approach?

As a side note, while searching for some examples, I saw that SVC might also want a fix for its predict_proba check. See link.

jnothman · 2019-04-11T00:27:14Z

Yes, briefly, that is the right approach. I am not certain if that helper will work immediately here. I've not had time to look.

NicolasHug · 2019-04-12T16:55:50Z

We've been looking at this with @thomasjpfan .

You won't be able to use @if_delegate_has_method since you'd need to pass it self.estimators_[0], and self.estimators_[0] isn't an attribute (it's just the first element of an attribute which is a list).

We think the right solution here is to mimic what SVC is doing: define predict_proba as a property and raise an AttributeError if predict_proba isn't defined in any of the estimators in self.estimators_, or in self.estimator.

This way, hasattr(multioutputclf, 'predict_proba') behaves properly

All that being said, as Joel said we'd be fine merging the PR as-is as a good-enough fix for now, and open another issue to address the hasattr matter.

reshamas · 2019-04-12T17:00:09Z

@rebekahkim We have scheduled the 2019 WiMLDS sprint for Sunday, Aug 25, if you would like to schedule the work up to and around that date.
cc: @NicolasHug @thomasjpfan

rebekahkim · 2019-04-17T03:09:02Z

@NicolasHug @thomasjpfan Good point!

I'll leave the decision up to the sklearn dev team whether to merge this PR (@jnothman). I'll open a new issue to correct the hasattr behavior; let me do some more digging and testing and look into SVC- Thanks for the pointer!

jnothman · 2019-04-17T03:21:26Z

We think the right solution here is to mimic what SVC is doing: define predict_proba as a property and raise an AttributeError if predict_proba isn't defined in any of the estimators in self.estimators_, or in self.estimator. This way, hasattr(multioutputclf, 'predict_proba') behaves properly

That's fine. An alternative (not sure which is more elegant): you could use if_delegate_has_method as long as you define

    @property
    def _example_fitted_estimator(self):
        return self.estimators_[0]

    @if_delegate_has_method(['_example_fitted_estimator', 'estimator'])
    def predict_proba(self, X):

jnothman · 2019-04-17T03:22:12Z

Please add a |Fix| entry to the change log at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

NicolasHug

LGTM otherwise

sklearn/tests/test_multioutput.py

Co-Authored-By: rebekahkim <rebekah.kim@columbia.edu>

rebekahkim · 2019-04-19T07:08:12Z

@NicolasHug thanks for the style suggestion - it seems like codecov/patch check fails because I've made changes to documentation. Can we ignore this?

NicolasHug · 2019-04-19T18:13:42Z

As far as I can tell the proposed changes are tested so it should be fine.

Merging, thanks @rebekahkim !

…stimator (scikit-learn#12222) #WiMLDS

…f base estimator (scikit-learn#12222)" This reverts commit 7248191.

…stimator (scikit-learn#12222) #WiMLDS

rebekahkim added 4 commits September 29, 2018 14:52

fix hasattr param

55d82e8

correct spelling + add test

db2c21d

minor format fix

33f178d

more test

9099e0e

amueller approved these changes Sep 29, 2018

View reviewed changes

eamanu approved these changes Oct 1, 2018

View reviewed changes

ogrisel approved these changes Oct 1, 2018

View reviewed changes

fix spacing/typo

e0cb97c

rebekahkim mentioned this pull request Feb 27, 2019

SGDClassifier never has the attribute "predict_proba" (even with log or modified_huber loss) #10113

Closed

rebekahkim added 5 commits February 28, 2019 13:42

edge case test

39ae51a

Merge branch 'master' into moc-predict-proba

fdac883

add cv to grid search

7c0c12a

fix warnings

d3a3ec1

more test for code coverage

3fdf1dd

agramfort approved these changes Feb 28, 2019

View reviewed changes

agramfort changed the title ~~Fix SGDClassifier never has the attribute "predict_proba" (even with log or modified_huber loss)~~ [MRG+1] Fix SGDClassifier never has the attribute "predict_proba" (even with log or modified_huber loss) Feb 28, 2019

jnothman reviewed Mar 1, 2019

View reviewed changes

add documentation

173b28d

jnothman reviewed Apr 8, 2019

View reviewed changes

jnothman added this to the 0.21 milestone Apr 17, 2019

rebekahkim added 2 commits April 18, 2019 22:56

update doc

8e24d17

Merge branch 'master' into moc-predict-proba

2e5dd2c

NicolasHug approved these changes Apr 18, 2019

View reviewed changes

sklearn/tests/test_multioutput.py Outdated Show resolved Hide resolved

fix error msg

8828c71

Co-Authored-By: rebekahkim <rebekah.kim@columbia.edu>

NicolasHug merged commit d903436 into scikit-learn:master Apr 19, 2019

rebekahkim deleted the moc-predict-proba branch April 20, 2019 02:52

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019

Fix MultiOutputClassifier checking for predict_proba method of base e…

3b54222

…stimator (scikit-learn#12222) #WiMLDS

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Fix MultiOutputClassifier checking for predict_proba method of base e…

7248191

…stimator (scikit-learn#12222) #WiMLDS

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "Fix MultiOutputClassifier checking for predict_proba method o…

6174872

…f base estimator (scikit-learn#12222)" This reverts commit 7248191.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "Fix MultiOutputClassifier checking for predict_proba method o…

8e0ebd8

…f base estimator (scikit-learn#12222)" This reverts commit 7248191.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

Fix MultiOutputClassifier checking for predict_proba method of base e…

714ce1c

…stimator (scikit-learn#12222) #WiMLDS

amueller mentioned this pull request Aug 5, 2019

Fix SGDClassifier never has the attribute "predict_proba" #10389

Closed

This was referenced Nov 2, 2019

MultiOutputClassifier has predict_proba attribute for base classifiers without predict_proba #15488

Closed

property for predict_proba in MultiOutput Classifier #15490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Fix SGDClassifier never has the attribute "predict_proba" (even with log or modified_huber loss) #12222

[MRG+1] Fix SGDClassifier never has the attribute "predict_proba" (even with log or modified_huber loss) #12222

rebekahkim commented Sep 29, 2018

amueller left a comment

TomDLT commented Oct 1, 2018 •

edited

Loading

amueller commented Oct 1, 2018

amueller commented Oct 1, 2018

ogrisel commented Oct 1, 2018

ogrisel left a comment

ogrisel Oct 1, 2018

amueller Oct 1, 2018

sergulaydore commented Nov 11, 2018

reshamas commented Dec 16, 2018

rebekahkim commented Dec 22, 2018

jnothman commented Dec 25, 2018 via email

reshamas commented Feb 25, 2019

jnothman commented Feb 25, 2019

jnothman commented Feb 25, 2019

rebekahkim commented Feb 27, 2019

jnothman commented Feb 27, 2019

rebekahkim commented Feb 28, 2019

jnothman left a comment

jnothman Mar 1, 2019

rebekahkim Mar 1, 2019

rebekahkim Mar 6, 2019

jnothman left a comment

rebekahkim commented Apr 10, 2019

jnothman commented Apr 11, 2019 via email

NicolasHug commented Apr 12, 2019 •

edited

Loading

reshamas commented Apr 12, 2019

rebekahkim commented Apr 17, 2019

jnothman commented Apr 17, 2019

jnothman commented Apr 17, 2019

NicolasHug left a comment

rebekahkim commented Apr 19, 2019

NicolasHug commented Apr 19, 2019

[MRG+1] Fix SGDClassifier never has the attribute "predict_proba" (even with log or modified_huber loss) #12222

[MRG+1] Fix SGDClassifier never has the attribute "predict_proba" (even with log or modified_huber loss) #12222

Conversation

rebekahkim commented Sep 29, 2018

amueller left a comment

Choose a reason for hiding this comment

TomDLT commented Oct 1, 2018 • edited Loading

amueller commented Oct 1, 2018

amueller commented Oct 1, 2018

ogrisel commented Oct 1, 2018

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Oct 1, 2018

Choose a reason for hiding this comment

amueller Oct 1, 2018

Choose a reason for hiding this comment

sergulaydore commented Nov 11, 2018

reshamas commented Dec 16, 2018

rebekahkim commented Dec 22, 2018

jnothman commented Dec 25, 2018 via email

reshamas commented Feb 25, 2019

jnothman commented Feb 25, 2019

jnothman commented Feb 25, 2019

rebekahkim commented Feb 27, 2019

jnothman commented Feb 27, 2019

rebekahkim commented Feb 28, 2019

jnothman left a comment

Choose a reason for hiding this comment

jnothman Mar 1, 2019

Choose a reason for hiding this comment

rebekahkim Mar 1, 2019

Choose a reason for hiding this comment

rebekahkim Mar 6, 2019

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

rebekahkim commented Apr 10, 2019

jnothman commented Apr 11, 2019 via email

NicolasHug commented Apr 12, 2019 • edited Loading

reshamas commented Apr 12, 2019

rebekahkim commented Apr 17, 2019

jnothman commented Apr 17, 2019

jnothman commented Apr 17, 2019

NicolasHug left a comment

Choose a reason for hiding this comment

rebekahkim commented Apr 19, 2019

NicolasHug commented Apr 19, 2019

TomDLT commented Oct 1, 2018 •

edited

Loading

NicolasHug commented Apr 12, 2019 •

edited

Loading