MultiLabelBinarizer breaks when seeing unseen labels...should there be an option to handle this instead? #10410

pntemi · 2018-01-05T19:11:44Z

Description

I am not sure if it's intended for MultiLabelBinarizer to fit and transform only seen data or not.

However, there are many times that it is not possible/not in our interest to know all of the classes that we're fitting at training time.
For convenience, I am wondering if there should be another parameter that allows us to ignore the unseen classes by just setting them to 0?

Proposed Modification

Example:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(ignore_unseen=True)

y_train = [['a'],['a', 'b'], ['a', 'b', 'c']]
mlb.fit(y_train)

y_test = [['a'],['b'],['d']]
mlb.transform(y_test)

Result:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 0]])

(the current version 0.19.0 would say KeyError: 'd')

I can open a PR for this if this is a desired behavior.

Others also have similar issue:
https://stackoverflow.com/questions/31503874/using-multilabelbinarizer-on-test-data-with-labels-not-in-the-training-set

The text was updated successfully, but these errors were encountered:

jnothman · 2018-01-06T20:48:45Z

Yes, I suppose such a setting would be useful.

…

On 6 January 2018 at 06:13, Ploy Temiyasathit ***@***.***> wrote: Description I am not sure if it's intended for MultiLabelBinarizer to fit and transform only seen data or not. However, there are many times that it is not possible/not in our interest to know all of the classes that we're fitting at training time. For convenience, I am wondering if there should be another parameter that allows us to ignore the unseen classes by just setting them to 0? Proposed Modification Example: from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer(ignore_unseen=True) y_train = [['a'],['a', 'b'], ['a', 'b', 'c']] mlb.fit(y_train) y_test = [['a'],['b'],['d']] mlb.transform(y_test) Result: array([[1, 0, 0], [0, 1, 0], [0, 0, 0]]) (the current version 0.19.0 would say KeyError: 'd') I can open a PR for this if this is a desired behavior. Others also have similar issue: https://stackoverflow.com/questions/31503874/using- multilabelbinarizer-on-test-data-with-labels-not-in-the-training-set — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10410>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz60OD_hPXjQlFF7Qus2WI5LT4pFtCks5tHnPzgaJpZM4RU1I5> .

jnothman · 2018-01-07T22:37:58Z

The original poster stated they would like to submit a PR, so let's wait.

pntemi · 2018-01-08T05:14:45Z

OK. I'm taking this then.

mohdsanadzakirizvi · 2018-01-24T11:17:46Z

if no one is working, I'd like to take up this issue?

samupra · 2018-01-24T11:26:19Z

@mohdsanadzakirizvi looks like the OP said they will deliver

pntemi · 2018-01-24T11:35:06Z

@mohdsanadzakirizvi Hey, sorry for not having much update recently. I've started working on it though, so I guess I will continue.

austinlostinboston · 2018-09-25T17:24:31Z

@pntemi @rragundez hey all, any update on adding this flag to sklearn.preprocessing.MultiLabelBinarizer()? I just checked the source code and there's still no mention of it there.

amueller · 2018-09-25T18:18:10Z

@austinlostinboston if you want to apply this to data, you should use OneHotEncoder.

mkulariya · 2018-09-26T17:54:51Z

@austinlostinboston it is done upgrade to sklearn 0.20
http://scikit-learn.org/stable/whats_new.html#version-0-20

Luttik · 2020-04-09T15:21:40Z

I would still really like an option to mute these warnings.

Ignoring unknown classes is expected behavior especially when you manually provide the classes. Therefore no warning would be better than any in the first place.

Additionally, it makes no sense to spend code and compute to clean the data before calling the MultiLabelBinarizer.

And warnings that we cannot mute are exceptionally annoying in an production environment.

jnothman · 2020-04-11T11:31:52Z

All warnings can be muted with the python warnings module

Luttik · 2020-04-11T11:58:38Z

Id argue that muting warnings is very bad practice you don't want true warning to overlap with warnings for expected behavior.

This is especially true when you are developing a package that depends on sklearn. You don't want side effects like muted warnings from your package.

jnothman · 2020-04-11T15:44:47Z

You can mute warnings filtered on message text

Luttik · 2020-04-11T18:31:26Z

I am fully aware of that. But it is beside the point. If you want to use the MultiLabelBinarizer in e.g. a library your still in the position that you have to either:

filter the data before throwing it in the Binarizer just to prevent those warnings (which is a waste of compute and code complexity)
Introduce unexpected side effects (which is always bad). Where the side-effect is either:
- A needless warning. Prompting users to think something is wrong where there is nothing worth their attention happening.
- Muting warnings which the libraries user might not be aware of. Note that even though this warning makes no sense for the MultiLabelBinarizer (when classes are passed in the constructor) it might make sense in other scenarios so you do not want to mute it.

dhorkel · 2020-06-02T21:12:46Z

It seems inconsistent to not have an ignore option in MultiLabelBinarizer when OneHotEncoder has the handle_unknown option. Conditionally muting warnings is bad practice.

Seems like it would just be a matter of adding the option and changing this line:

scikit-learn/sklearn/preprocessing/_label.py

Line 993 in fd23727

if unknown:

to if unknown and handle_unknown=='error':

jnothman · 2020-06-03T12:34:16Z

Shrug. If you submit a PR, it might result in merge.

jnothman added Easy Well-defined and straightforward way to resolve Enhancement labels Jan 6, 2018

pntemi mentioned this issue Jan 15, 2018

[WIP] Fix #10410 add ignore_unseen parameter to MultilabelBinarizer #10476

Closed

rragundez mentioned this issue Apr 3, 2018

[MRG+1] MultiLabelBinarizer ignore unkown class in transform #10913

Merged

qinhanmin2014 closed this as completed in #10913 May 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiLabelBinarizer breaks when seeing unseen labels...should there be an option to handle this instead? #10410

MultiLabelBinarizer breaks when seeing unseen labels...should there be an option to handle this instead? #10410

pntemi commented Jan 5, 2018

jnothman commented Jan 6, 2018 via email

jnothman commented Jan 7, 2018 via email

pntemi commented Jan 8, 2018

mohdsanadzakirizvi commented Jan 24, 2018

samupra commented Jan 24, 2018

pntemi commented Jan 24, 2018

austinlostinboston commented Sep 25, 2018

amueller commented Sep 25, 2018

mkulariya commented Sep 26, 2018

Luttik commented Apr 9, 2020

jnothman commented Apr 11, 2020 via email

Luttik commented Apr 11, 2020 •

edited

jnothman commented Apr 11, 2020 via email

Luttik commented Apr 11, 2020 •

edited

dhorkel commented Jun 2, 2020

jnothman commented Jun 3, 2020

MultiLabelBinarizer breaks when seeing unseen labels...should there be an option to handle this instead? #10410

MultiLabelBinarizer breaks when seeing unseen labels...should there be an option to handle this instead? #10410

Comments

pntemi commented Jan 5, 2018

Description

Proposed Modification

jnothman commented Jan 6, 2018 via email

jnothman commented Jan 7, 2018 via email

pntemi commented Jan 8, 2018

mohdsanadzakirizvi commented Jan 24, 2018

samupra commented Jan 24, 2018

pntemi commented Jan 24, 2018

austinlostinboston commented Sep 25, 2018

amueller commented Sep 25, 2018

mkulariya commented Sep 26, 2018

Luttik commented Apr 9, 2020

jnothman commented Apr 11, 2020 via email

Luttik commented Apr 11, 2020 • edited

jnothman commented Apr 11, 2020 via email

Luttik commented Apr 11, 2020 • edited

dhorkel commented Jun 2, 2020

jnothman commented Jun 3, 2020

Luttik commented Apr 11, 2020 •

edited

Luttik commented Apr 11, 2020 •

edited