Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] MultiLabelBinarizer ignore unkown class in transform #10913

Merged
merged 18 commits into from May 6, 2018
Merged

[MRG+1] MultiLabelBinarizer ignore unkown class in transform #10913

merged 18 commits into from May 6, 2018

Conversation

rragundez
Copy link
Contributor

@rragundez rragundez commented Apr 3, 2018

Reference Issues/PRs

Fixes #10410

What does this implement/fix? Explain your changes.

Like with other transformers I would expect that transform uses what is picked up/learned during fit (e.g. sklearn.preprocessing.StandardScaler uses the mean and variance learned during fit), to my surprise this was not the case with MultiLabelBinarizer.

Example:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train = [['a'],['a', 'b'], ['a', 'b', 'c']]
mlb.fit(y_train)
y_test = [['a'],['b', 'c'],['d']]
mlb.transform(y_test)
>>> KeyError: 'd'

My case was that I used a container to train the model with a pre-processing step that uses MultiLabelBinarizer, build pipeline and output the pipeline as pkl file. Another container picks it up and predicts on new data. To my surprise the pre-processing pipeline breaks if an unseen label is received. To fix this I had to write an ugly hack using MultiLabelBinarizer's classes_ attribute.

This change addresses that problem. Now transform only uses the seen during fit. If the parameter classes was given on initialization it will use that. If fit_transform is used it respects the optimization already in place.

Example with change:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train = [['a'],['a', 'b'], ['a', 'b', 'c']]
mlb.fit(y_train)
y_test = [['a'],['b', 'c'],['d']]
mlb.transform(y_test)
>>> array([[1, 0, 0],
       [0, 1, 1],
       [0, 0, 0]])

Any other comments?

The Issue 10410 that this change fixes, mentions the idea of adding an argument ignore_unseen. In my opinion this is not the a good solution:

  • It does not respect the pattern of other transformers, scalers, etc.
  • It is not intuitive
  • It adds unnecessary complexity

This only modifies the behaviour when using transform after using
fit or if the classes argument was used during initialization.

When using transform it can come after using fit or directly from
fit_transform. In the former case makes sense to only use the
classes learned by using fit. In the latter case all classes
are used.
Previously this tests will check for a KeyError. MultiLabelBinarizer
now ignores unknown classes.
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this will be useful for some tasks. I would suggest, though, that giving users an error is useful in many cases. We should inform the user is informed when their data is problematic.

I would prefer an option handle_unknown='ignore'/'error' (with error being the default).

for labels in y:
indices.extend(set(class_mapping[label] for label in labels))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire change here could be:

if class_mapping:
    labels = (label for label in labels if label in class_mapping)

or

if class_mapping:
    labels = filter(class_mapping.__contains__, labels)

Copy link
Contributor Author

@rragundez rragundez Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jnothman thanks for your comments.
I agree that it is sometimes useful to give the user an error, but I completely disagree this is one of those cases, as the change I'm proposing only kicks in when the user EXPLICITLY has used fit or given an explicit collection of classes to use in the object initialization.
If you want to go to the realm of "when the data is problematic" it is a lost battle as there are so many exceptions that is impossible to catch in a simple pragmatic and intuitive API.

Is the responsibility of the user to check his/her data, and for example use the classes parameters to include all possible classes. Or if it is just expanding every class on every dataset then the fit_transform will do that. Could you give me an example in practice you have used or seen where the solutions I'm proposing together with fit_transform don't work? I haven't seen one.

Your code doesn't work. I won't go to the specifics but class_mapping can be a defaultdict with a default_factory, which means it increases on demand. I looked for other more compressed forms to write the code (I specially dislike line 798) but unfortunately I couldn't find one which was simple, easy to understand, worked for transform and fit_transform, and will require changing only in one function. But if you can think of one please be my guest :)

Copy link
Contributor Author

@rragundez rragundez Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea to raise a warning on the missing class btw, perhaps with that we address your feedback idea.

Copy link
Member

@jnothman jnothman Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the responsibility of the user to check his/her data, and for example use the classes parameters to include all possible classes.

Or it is the responsibility of the user to clean his/her data to avoid unknowns. But I do see what you mean, in a random sample context.

I suppose in this context I'm okay with changing the error to a warning. Sometimes you'd be surprised what behaviour users expect to remain backwards compatible, though! Please make that change, and we'll see what other core devs think.

Your code doesn't work. I won't go to the specifics but class_mapping can be a defaultdict with a default_factory, which means it increases on demand.

You're right. I meant not empty_mapping not class_mapping: https://github.com/rragundez/scikit-learn/compare/mlb-unkown-class...jnothman:ignore-unknown

Copy link
Contributor Author

@rragundez rragundez Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the warning as suggested :)

I'll also add your piece of code, thanks!

@rragundez
Copy link
Contributor Author

rragundez commented Apr 4, 2018

@jnothman I've added the warning using the logging module. I think this is a quite nice solution.

labels = set(labels) # in case labels is a generator
missing = labels - set(class_mapping)
if missing:
raise Warning("Unkown class(es) found '{0}' "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use warnings.warn

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote and lost a longer response. Perhaps don't report the subset ignored. Better each label or no label, because only distinct warning texts will be shown by default.

Now you should probably use a loop over labels with a try ... except KeyError rather than use a generator expression. Atm you're duplicating lookups in the mapping.

@rragundez
Copy link
Contributor Author

@jnothman I've added the changes

try:
index.add(class_mapping[label])
except KeyError:
warnings.warn("Unkown class {0} found and will be ignored"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make that {0!r} to quote the string correctly


mlb = MultiLabelBinarizer(classes=[1, 2])
assert_raises(KeyError, mlb.fit_transform, [[0]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use assert_warns or a variant

@rragundez
Copy link
Contributor Author

@jnothman great feedback. Changes done.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there


mlb = MultiLabelBinarizer(classes=[1, 2])
assert_raises(KeyError, mlb.fit_transform, [[0]])
assert_array_equal(mlb.fit(y).transform([[0]]), Y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want this raising the warning during testing. You can instead get the return value from assert_warns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

try:
index.add(class_mapping[label])
except KeyError:
warnings.warn("Unkown class {0!r} will be ignored"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry tofuss over this again, but perhaps it's better still to accumulate all the unknown labels and warn at the end

Copy link
Contributor Author

@rragundez rragundez Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I disagree, there is no internal purpose for the accumulation of these unknown classes, the only one could be warning in one go which is not longer necessary with this. The warning is sufficient for the user to go and double check his/her data.

@jnothman
Copy link
Member

jnothman commented Apr 4, 2018 via email

@rragundez
Copy link
Contributor Author

rragundez commented Apr 4, 2018

@jnothman what actual benefit does that have in comparison to what it does now?

@rragundez
Copy link
Contributor Author

@jnothman I made the necessary changes. Do you think is ok now?

@jnothman
Copy link
Member

jnothman commented Apr 4, 2018 via email

@@ -795,8 +797,17 @@ def _transform(self, y, class_mapping):
indices = array.array('i')
indptr = array.array('i', [0])
for labels in y:
indices.extend(set(class_mapping[label] for label in labels))
index, unknown = set(), set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unknown needs to be initialised out of the loop

indptr.append(len(indices))
if unknown:
warnings.warn("Unkown class(es) {0} will be ignored"
.format(unknown))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use sorted for determinism

@jnothman
Copy link
Member

jnothman commented Apr 4, 2018

Btw, those last issues could have been caught by tests if you used assert_raises_message

In the case that classes are mixed (e.g. 1, 'some_class') sorted
will break if comparing different types, hence the addition of
the key=str argument.
@rragundez
Copy link
Contributor Author

@jnothman changes made, and good point about the determinism. BTW I tried the assert_raises_message but it seems it works only for Exceptions, is that correct?

@rragundez
Copy link
Contributor Author

@jnothman yeap, I just saw the docstring for assert_raises_message: Helper function to test the message raised in an exception.

@jnothman
Copy link
Member

jnothman commented Apr 7, 2018 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@rragundez
Copy link
Contributor Author

@jnothman The travis test failed, no idea why. Any update on the merging of this PR?

@jnothman
Copy link
Member

sklearn/preprocessing/tests/test_label.py:17:1: F401 'sklearn.utils.testing.assert_warns' imported but unused

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should also test the case that classes is provided. And perhaps your test should ensure that non-ignored labels are still encoded, and that one error message combines those labels ignored from multiple samples.

@jnothman
Copy link
Member

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@rragundez
Copy link
Contributor Author

rragundez commented Apr 11, 2018

@jnothman I did not really understand the request for the test as I think it was already done in here:

mlb = MultiLabelBinarizer(classes=[1, 2])
matrix = assert_warns_message(UserWarning, w,
mlb.fit(y).transform, [[0]])
assert_array_equal(matrix, Y)

but I updated the test for multiple unknown classes and the case where classes is given.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rragundez
Copy link
Contributor Author

@jnothman should we merge it? or are you waiting for another core developer to do that?

@jnothman
Copy link
Member

Yes, we wait for a second review

@jnothman jnothman changed the title MultiLabelBinarizer ignore unkown class in transform [MRG+1] MultiLabelBinarizer ignore unkown class in transform Apr 15, 2018
Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @rragundez :)

@qinhanmin2014 qinhanmin2014 merged commit 13c2353 into scikit-learn:master May 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiLabelBinarizer breaks when seeing unseen labels...should there be an option to handle this instead?
4 participants