New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] Fix for OvR partial_fit bug #7786

Merged
merged 5 commits into from Nov 5, 2016

Conversation

Projects
None yet
6 participants
@srivatsan-ramesh
Contributor

srivatsan-ramesh commented Oct 29, 2016

Reference Issue

PR #6239 and issue #6203

What does this implement/fix? Explain your changes.

partial_fit() of OvR was not working properly when the mini-batches did not contain all the classes.
The LabelBinarizer.fit() should be called with classes_ parameter and not the y parameter of partial_fit()
And added tests to check the partial_fit() function.

Any other comments?

The PR #6239 did not seem to be the correct fix.

@srivatsan-ramesh srivatsan-ramesh referenced this pull request Oct 29, 2016

Closed

Fix for OvR partial_fit in various edge cases #6239

2 of 3 tasks complete

@srivatsan-ramesh srivatsan-ramesh changed the title from Fix for OvR partial_fit bug to [MRG] Fix for OvR partial_fit bug Oct 29, 2016

@amueller amueller added this to the 0.19 milestone Nov 1, 2016

@amueller amueller added the Bug label Nov 1, 2016

@amueller amueller modified the milestones: 0.18.1, 0.19 Nov 2, 2016

@amueller

This comment has been minimized.

Member

amueller commented Nov 2, 2016

LGTM, thanks :)

@amueller amueller changed the title from [MRG] Fix for OvR partial_fit bug to [MRG + 1] Fix for OvR partial_fit bug Nov 2, 2016

# A new class value which was not in the first call of partial_fit
# It should raise ValueError
y1 = [5] + y[7:-1]
assert_raises(ValueError, ovr.partial_fit, X=X[7:], y=y1)

This comment has been minimized.

@lesteve

lesteve Nov 4, 2016

Member

Can you use assert_raises_regex to check the error message as well?

This comment has been minimized.

@srivatsan-ramesh

This comment has been minimized.

@jnothman

jnothman Nov 5, 2016

Member

but then you can drop this line.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 5, 2016

Travis seems to fail... Could you fix that?

@srivatsan-ramesh

This comment has been minimized.

Contributor

srivatsan-ramesh commented Nov 5, 2016

@raghavrv It says HTTPError : 409 Client Error What do I do?

@raghavrv

This comment has been minimized.

Member

raghavrv commented Nov 5, 2016

I've not looked into it, but I've restarted the build for you... Let's see if it is some spurious error.

@srivatsan-ramesh

This comment has been minimized.

Contributor

srivatsan-ramesh commented Nov 5, 2016

Ok 👍

@srivatsan-ramesh

This comment has been minimized.

Contributor

srivatsan-ramesh commented Nov 5, 2016

@raghavrv Thank you, Travis ran successfully.

@jnothman

Otherwise LGTM

self.label_binarizer_ = LabelBinarizer(sparse_output=True)
self.label_binarizer_.fit(self.classes_)
if not set(self.classes_).issuperset(y):

This comment has been minimized.

@jnothman

jnothman Nov 5, 2016

Member

for large y, iteration may be much slower than using np.setdiff1d..?

This comment has been minimized.

@srivatsan-ramesh

srivatsan-ramesh Nov 5, 2016

Contributor

Yes np.setdiff1d seems to be faster

This comment has been minimized.

@elcombato

elcombato Nov 14, 2016

due to the use of np.setdiff1d it is not possible to partial_fit with a sparse y

assert_equal(np.mean(pred == y), np.mean(pred1 == y))
# Test when mini batches doesn't have all classes
# with SGDClassifier

This comment has been minimized.

@jnothman

jnothman Nov 5, 2016

Member

As a reader of the test, it's not clear why you would want to test with both these base estimators.

Also: can we do it in a loop for clarity that they're the same test?

This comment has been minimized.

@srivatsan-ramesh

srivatsan-ramesh Nov 5, 2016

Contributor

In the previous PR #6239 , the author of the PR was able to create a fix which worked with MultinomialNB but not with SGDClassifier. That's why I added SGDClassifier also. But i think only testing with SGDClassifier is enough, @jnothman ?

This comment has been minimized.

@jnothman

jnothman Nov 5, 2016

Member

Yes, happy for it to just be SGDClassifier

# A new class value which was not in the first call of partial_fit
# It should raise ValueError
y1 = [5] + y[7:-1]
assert_raises(ValueError, ovr.partial_fit, X=X[7:], y=y1)

This comment has been minimized.

@jnothman

jnothman Nov 5, 2016

Member

but then you can drop this line.

@jnothman jnothman merged commit 9fd70a8 into scikit-learn:master Nov 5, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Member

jnothman commented Nov 5, 2016

Thanks, @srivatsan-ramesh

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 5, 2016

Sorry, I forgot to have you write an entry in doc/whats_new.rst Could you please submit a quick PR describing the fix?

@srivatsan-ramesh srivatsan-ramesh deleted the srivatsan-ramesh:dev branch Nov 6, 2016

srivatsan-ramesh added a commit to srivatsan-ramesh/scikit-learn that referenced this pull request Nov 6, 2016

@srivatsan-ramesh

This comment has been minimized.

Contributor

srivatsan-ramesh commented Nov 6, 2016

@jnothman Created a PR!

jnothman added a commit that referenced this pull request Nov 6, 2016

amueller added a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016

[MRG+2] Fix for OvR partial_fit bug (scikit-learn#7786)
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes

amueller added a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016

@elcombato

This comment has been minimized.

elcombato commented Nov 14, 2016

Due to np.setdiff1d in this line of partial_fit, it is not possible to pass a sparse matrix for y.

Both the commit ed08d38 by @srivatsan-ramesh and the reviewed version by @jnothman 5fd925a throw an error with a sparse y.

Or is there an alternative, if I use partial_fit to perform multilabel classification and pass a sparse matrix of y which is binarized with MultiLabelBinarizer?

if y is sparse check with:

if np.setdiff1d(y.indices, self.classes_)
@lesteve

This comment has been minimized.

Member

lesteve commented Nov 15, 2016

Hmmm the docstring does say that sparse matrices are accepted for y. I am not sure what the scikit-learn convention is in general about whether a sparse y should be accepted in fit and similar functions.

I guess it is not really worth using a sparse matrix for y in general. A work-around is to do y = y.todense() before calling partial_fit.

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 15, 2016

sparse y may be used to represent multilabel problems. It could be accepted
wherever multilabel inputs are, in theory.

On 15 November 2016 at 21:42, Loïc Estève notifications@github.com wrote:

Hmmm the docstring does say that sparse matrices are accepted for y. I am
not sure what the scikit-learn convention is in general about whether a
sparse y should be accepted in fit and similar functions.

I guess it is not really worth using a sparse matrix for y in general. A
work-around is to do y = y.todense() before calling partial_fit.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7786 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz66v7s3jkkIhDLEKLSTkjoaDXpj6_ks5q-YyAgaJpZM4KkLcl
.

@lesteve

This comment has been minimized.

Member

lesteve commented Nov 15, 2016

sparse y may be used to represent multilabel problems. It could be accepted wherever multilabel inputs are, in theory.

I see, it does look like we are missing tests for this then ... it would be good to do something similar to #7590 for multilabel problems, i.e. test that sparse and dense y are giving results that are close to each other.

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG+2] Fix for OvR partial_fit bug (scikit-learn#7786)
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

[MRG+2] Fix for OvR partial_fit bug (scikit-learn#7786)
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+2] Fix for OvR partial_fit bug (scikit-learn#7786)
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+2] Fix for OvR partial_fit bug (scikit-learn#7786)
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+2] Fix for OvR partial_fit bug (scikit-learn#7786)
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+2] Fix for OvR partial_fit bug (scikit-learn#7786)
* mini-batch can now contain less number of classes than actual data

* added tests where mini batches doesn't contain all classes

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment