[MRG+1] MultiLabelBinarizer ignore unkown class in transform #10913

rragundez · 2018-04-03T18:38:27Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Like with other transformers I would expect that transform uses what is picked up/learned during fit (e.g. sklearn.preprocessing.StandardScaler uses the mean and variance learned during fit), to my surprise this was not the case with MultiLabelBinarizer.

Example:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train = [['a'],['a', 'b'], ['a', 'b', 'c']]
mlb.fit(y_train)
y_test = [['a'],['b', 'c'],['d']]
mlb.transform(y_test)
>>> KeyError: 'd'

My case was that I used a container to train the model with a pre-processing step that uses MultiLabelBinarizer, build pipeline and output the pipeline as pkl file. Another container picks it up and predicts on new data. To my surprise the pre-processing pipeline breaks if an unseen label is received. To fix this I had to write an ugly hack using MultiLabelBinarizer's classes_ attribute.

This change addresses that problem. Now transform only uses the seen during fit. If the parameter classes was given on initialization it will use that. If fit_transform is used it respects the optimization already in place.

Example with change:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train = [['a'],['a', 'b'], ['a', 'b', 'c']]
mlb.fit(y_train)
y_test = [['a'],['b', 'c'],['d']]
mlb.transform(y_test)
>>> array([[1, 0, 0],
       [0, 1, 1],
       [0, 0, 0]])

Any other comments?

The Issue 10410 that this change fixes, mentions the idea of adding an argument ignore_unseen. In my opinion this is not the a good solution:

It does not respect the pattern of other transformers, scalers, etc.
It is not intuitive
It adds unnecessary complexity

This only modifies the behaviour when using transform after using fit or if the classes argument was used during initialization. When using transform it can come after using fit or directly from fit_transform. In the former case makes sense to only use the classes learned by using fit. In the latter case all classes are used.

Previously this tests will check for a KeyError. MultiLabelBinarizer now ignores unknown classes.

jnothman

I agree that this will be useful for some tasks. I would suggest, though, that giving users an error is useful in many cases. We should inform the user is informed when their data is problematic.

I would prefer an option handle_unknown='ignore'/'error' (with error being the default).

jnothman · 2018-04-03T22:20:43Z

sklearn/preprocessing/label.py

        for labels in y:
-            indices.extend(set(class_mapping[label] for label in labels))


The entire change here could be:

if class_mapping: labels = (label for label in labels if label in class_mapping)

or

if class_mapping: labels = filter(class_mapping.__contains__, labels)

Hi @jnothman thanks for your comments.
I agree that it is sometimes useful to give the user an error, but I completely disagree this is one of those cases, as the change I'm proposing only kicks in when the user EXPLICITLY has used fit or given an explicit collection of classes to use in the object initialization.
If you want to go to the realm of "when the data is problematic" it is a lost battle as there are so many exceptions that is impossible to catch in a simple pragmatic and intuitive API.

Is the responsibility of the user to check his/her data, and for example use the classes parameters to include all possible classes. Or if it is just expanding every class on every dataset then the fit_transform will do that. Could you give me an example in practice you have used or seen where the solutions I'm proposing together with fit_transform don't work? I haven't seen one.

Your code doesn't work. I won't go to the specifics but class_mapping can be a defaultdict with a default_factory, which means it increases on demand. I looked for other more compressed forms to write the code (I specially dislike line 798) but unfortunately I couldn't find one which was simple, easy to understand, worked for transform and fit_transform, and will require changing only in one function. But if you can think of one please be my guest :)

I think it is a good idea to raise a warning on the missing class btw, perhaps with that we address your feedback idea.

Is the responsibility of the user to check his/her data, and for example use the classes parameters to include all possible classes.

Or it is the responsibility of the user to clean his/her data to avoid unknowns. But I do see what you mean, in a random sample context.

I suppose in this context I'm okay with changing the error to a warning. Sometimes you'd be surprised what behaviour users expect to remain backwards compatible, though! Please make that change, and we'll see what other core devs think.

Your code doesn't work. I won't go to the specifics but class_mapping can be a defaultdict with a default_factory, which means it increases on demand.

You're right. I meant not empty_mapping not class_mapping: https://github.com/rragundez/scikit-learn/compare/mlb-unkown-class...jnothman:ignore-unknown

I'll add the warning as suggested :)

I'll also add your piece of code, thanks!

Refactor empty mapping logic

rragundez · 2018-04-04T07:26:16Z

@jnothman I've added the warning using the logging module. I think this is a quite nice solution.

jnothman · 2018-04-04T07:29:32Z

sklearn/preprocessing/label.py

+                labels = set(labels)  # in case labels is a generator
+                missing = labels - set(class_mapping)
+                if missing:
+                    raise Warning("Unkown class(es) found '{0}' "


Use warnings.warn

jnothman

I wrote and lost a longer response. Perhaps don't report the subset ignored. Better each label or no label, because only distinct warning texts will be shown by default.

Now you should probably use a loop over labels with a try ... except KeyError rather than use a generator expression. Atm you're duplicating lookups in the mapping.

rragundez · 2018-04-04T08:27:41Z

@jnothman I've added the changes

jnothman · 2018-04-04T08:48:55Z

sklearn/preprocessing/label.py

+                try:
+                    index.add(class_mapping[label])
+                except KeyError:
+                    warnings.warn("Unkown class {0} found and will be ignored"


Make that {0!r} to quote the string correctly

jnothman · 2018-04-04T08:49:20Z

sklearn/preprocessing/tests/test_label.py


    mlb = MultiLabelBinarizer(classes=[1, 2])
-    assert_raises(KeyError, mlb.fit_transform, [[0]])


Please use assert_warns or a variant

rragundez · 2018-04-04T09:15:00Z

@jnothman great feedback. Changes done.

jnothman

Almost there

jnothman · 2018-04-04T10:43:06Z

sklearn/preprocessing/tests/test_label.py


    mlb = MultiLabelBinarizer(classes=[1, 2])
-    assert_raises(KeyError, mlb.fit_transform, [[0]])
+    assert_array_equal(mlb.fit(y).transform([[0]]), Y)


We don't want this raising the warning during testing. You can instead get the return value from assert_warns

jnothman · 2018-04-04T10:44:04Z

sklearn/preprocessing/label.py

+                try:
+                    index.add(class_mapping[label])
+                except KeyError:
+                    warnings.warn("Unkown class {0!r} will be ignored"


Sorry tofuss over this again, but perhaps it's better still to accumulate all the unknown labels and warn at the end

Why? I disagree, there is no internal purpose for the accumulation of these unknown classes, the only one could be warning in one go which is not longer necessary with this. The warning is sufficient for the user to go and double check his/her data.

jnothman · 2018-04-04T11:33:52Z

shrug. why? so as to issue only one warning altogether

…

On 4 Apr 2018 8:56 pm, "rragundez" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/preprocessing/label.py <#10913 (comment)> : > @@ -795,7 +797,14 @@ def _transform(self, y, class_mapping): indices = array.array('i') indptr = array.array('i', [0]) for labels in y: - indices.extend(set(class_mapping[label] for label in labels)) + index = set() + for label in labels: + try: + index.add(class_mapping[label]) + except KeyError: + warnings.warn("Unkown class {0!r} will be ignored" Why? I disagree, there is no internal purpose for the accumulation of these unknown classes. The warning is sufficient for the user to go and double check his/her data. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10913 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wJ3L3V5u8WhN2twdFZ4t9o1iiKZks5tlKbQgaJpZM4TFk8K> .

rragundez · 2018-04-04T11:44:30Z

@jnothman what actual benefit does that have in comparison to what it does now?

rragundez · 2018-04-04T19:26:52Z

@jnothman I made the necessary changes. Do you think is ok now?

jnothman · 2018-04-04T20:37:14Z

To not flood the user's screen/log if ignoring is the behaviour they want, and perhaps to reduce some computational overhead.

jnothman · 2018-04-04T20:38:52Z

sklearn/preprocessing/label.py

@@ -795,8 +797,17 @@ def _transform(self, y, class_mapping):
        indices = array.array('i')
        indptr = array.array('i', [0])
        for labels in y:
-            indices.extend(set(class_mapping[label] for label in labels))
+            index, unknown = set(), set()


unknown needs to be initialised out of the loop

jnothman · 2018-04-04T20:39:01Z

sklearn/preprocessing/label.py

            indptr.append(len(indices))
+        if unknown:
+            warnings.warn("Unkown class(es) {0} will be ignored"
+                          .format(unknown))


use sorted for determinism

jnothman · 2018-04-04T22:25:21Z

Btw, those last issues could have been caught by tests if you used assert_raises_message

In the case that classes are mixed (e.g. 1, 'some_class') sorted will break if comparing different types, hence the addition of the key=str argument.

rragundez · 2018-04-05T03:52:06Z

@jnothman changes made, and good point about the determinism. BTW I tried the assert_raises_message but it seems it works only for Exceptions, is that correct?

rragundez · 2018-04-05T04:48:39Z

@jnothman yeap, I just saw the docstring for assert_raises_message: Helper function to test the message raised in an exception.

jnothman · 2018-04-07T14:29:26Z

I meant assert_warns_message, sorry!

jnothman

Otherwise LGTM

rragundez · 2018-04-09T05:38:24Z

@jnothman The travis test failed, no idea why. Any update on the merging of this PR?

jnothman · 2018-04-10T07:28:11Z

sklearn/preprocessing/tests/test_label.py:17:1: F401 'sklearn.utils.testing.assert_warns' imported but unused

jnothman

I think you should also test the case that classes is provided. And perhaps your test should ensure that non-ignored labels are still encoded, and that one error message combines those labels ignored from multiple samples.

jnothman · 2018-04-10T22:24:36Z

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

rragundez · 2018-04-11T06:01:37Z

@jnothman I did not really understand the request for the test as I think it was already done in here:

scikit-learn/sklearn/preprocessing/tests/test_label.py

Lines 317 to 320 in b0afb72

    
           mlb = MultiLabelBinarizer(classes=[1, 2]) 
        
           matrix = assert_warns_message(UserWarning, w, 
        
                                         mlb.fit(y).transform, [[0]]) 
        
           assert_array_equal(matrix, Y)

but I updated the test for multiple unknown classes and the case where classes is given.

… change log

jnothman

LGTM!

rragundez · 2018-04-14T15:36:42Z

@jnothman should we merge it? or are you waiting for another core developer to do that?

jnothman · 2018-04-15T03:19:32Z

Yes, we wait for a second review

qinhanmin2014

LGTM. Thanks @rragundez :)

rragundez added 2 commits April 3, 2018 20:03

Update tests for unkown class in MultiLabelBinarizer

04f8520

Previously this tests will check for a KeyError. MultiLabelBinarizer now ignores unknown classes.

jnothman reviewed Apr 3, 2018

View reviewed changes

jnothman and others added 4 commits April 4, 2018 16:29

Blah

c8d5c8a

Merge pull request #1 from jnothman/ignore-unknown

3598be5

Refactor empty mapping logic

Add comment on using known classes by MultiLabelBinarizer

2f57b62

Add Waring if unkown classes are found

5ad9ab6

Use logging module to raise warning

11035a8

jnothman reviewed Apr 4, 2018

View reviewed changes

Raise warning for each unkown label

e8946c4

jnothman reviewed Apr 4, 2018

View reviewed changes

rragundez added 2 commits April 4, 2018 10:59

Quote string correctly in warning

f2ac2f6

Add unittest for warning if unknown class

dd05c95

jnothman reviewed Apr 4, 2018

View reviewed changes

Avoid warnings in tests by using the return value of assert_warns

8230524

Accumulate unkown classes and display a single warning

ea9d9bf

jnothman reviewed Apr 4, 2018

View reviewed changes

rragundez added 2 commits April 5, 2018 05:25

Move initialization of unknown classes outside the for loop

69566dd

Sort unknown nclasses on warning output for determinism

77a1476

In the case that classes are mixed (e.g. 1, 'some_class') sorted will break if comparing different types, hence the addition of the key=str argument.

jnothman approved these changes Apr 7, 2018

View reviewed changes

Modify test to catch warning message for unknown class

ed96225

Remove unused import assert_warns

b0afb72

jnothman reviewed Apr 10, 2018

View reviewed changes

Amplify the coverage of the test for several unknown classes

045accc

Add method update in sklearn.preprocessing.MultiLabelBinarizer to the…

be08c57

… change log

jnothman approved these changes Apr 11, 2018

View reviewed changes

jnothman changed the title ~~MultiLabelBinarizer ignore unkown class in transform~~ [MRG+1] MultiLabelBinarizer ignore unkown class in transform Apr 15, 2018

Fokko approved these changes May 6, 2018

View reviewed changes

qinhanmin2014 approved these changes May 6, 2018

View reviewed changes

qinhanmin2014 merged commit 13c2353 into scikit-learn:master May 6, 2018

		for labels in y:
		indices.extend(set(class_mapping[label] for label in labels))


		mlb = MultiLabelBinarizer(classes=[1, 2])
		assert_raises(KeyError, mlb.fit_transform, [[0]])

[MRG+1] MultiLabelBinarizer ignore unkown class in transform #10913

[MRG+1] MultiLabelBinarizer ignore unkown class in transform #10913

Conversation

rragundez commented Apr 3, 2018 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rragundez Apr 4, 2018 • edited Loading

Choose a reason for hiding this comment

rragundez Apr 4, 2018 • edited Loading

Choose a reason for hiding this comment

jnothman Apr 4, 2018 • edited Loading

Choose a reason for hiding this comment

rragundez Apr 4, 2018 • edited Loading

Choose a reason for hiding this comment

rragundez commented Apr 4, 2018 • edited Loading

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

rragundez commented Apr 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rragundez commented Apr 4, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rragundez Apr 4, 2018 • edited Loading

Choose a reason for hiding this comment

jnothman commented Apr 4, 2018 via email

rragundez commented Apr 4, 2018 • edited Loading

rragundez commented Apr 4, 2018

jnothman commented Apr 4, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Apr 4, 2018

rragundez commented Apr 5, 2018

rragundez commented Apr 5, 2018

jnothman commented Apr 7, 2018 via email

jnothman left a comment

Choose a reason for hiding this comment

rragundez commented Apr 9, 2018

jnothman commented Apr 10, 2018

jnothman left a comment

Choose a reason for hiding this comment

jnothman commented Apr 10, 2018

rragundez commented Apr 11, 2018 • edited Loading

jnothman left a comment

Choose a reason for hiding this comment

rragundez commented Apr 14, 2018

jnothman commented Apr 15, 2018

qinhanmin2014 left a comment

Choose a reason for hiding this comment

rragundez commented Apr 3, 2018 •

edited

Loading

rragundez Apr 4, 2018 •

edited

Loading

rragundez Apr 4, 2018 •

edited

Loading

jnothman Apr 4, 2018 •

edited

Loading

rragundez Apr 4, 2018 •

edited

Loading

rragundez commented Apr 4, 2018 •

edited

Loading

rragundez Apr 4, 2018 •

edited

Loading

rragundez commented Apr 4, 2018 •

edited

Loading

rragundez commented Apr 11, 2018 •

edited

Loading