Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiLabelBinarizer breaks when seeing unseen labels...should there be an option to handle this instead? #10410

Closed
pntemi opened this issue Jan 5, 2018 · 16 comments · Fixed by #10913
Labels
Easy Well-defined and straightforward way to resolve Enhancement

Comments

@pntemi
Copy link

pntemi commented Jan 5, 2018

Description

I am not sure if it's intended for MultiLabelBinarizer to fit and transform only seen data or not.

However, there are many times that it is not possible/not in our interest to know all of the classes that we're fitting at training time.
For convenience, I am wondering if there should be another parameter that allows us to ignore the unseen classes by just setting them to 0?

Proposed Modification

Example:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(ignore_unseen=True)

y_train = [['a'],['a', 'b'], ['a', 'b', 'c']]
mlb.fit(y_train)

y_test = [['a'],['b'],['d']]
mlb.transform(y_test)

Result:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 0]])

(the current version 0.19.0 would say KeyError: 'd')

I can open a PR for this if this is a desired behavior.

Others also have similar issue:
https://stackoverflow.com/questions/31503874/using-multilabelbinarizer-on-test-data-with-labels-not-in-the-training-set

@jnothman
Copy link
Member

jnothman commented Jan 6, 2018 via email

@jnothman jnothman added Easy Well-defined and straightforward way to resolve Enhancement labels Jan 6, 2018
@jnothman
Copy link
Member

jnothman commented Jan 7, 2018 via email

@pntemi
Copy link
Author

pntemi commented Jan 8, 2018

OK. I'm taking this then.

@mohdsanadzakirizvi
Copy link

if no one is working, I'd like to take up this issue?

@samupra
Copy link

samupra commented Jan 24, 2018

@mohdsanadzakirizvi looks like the OP said they will deliver

@pntemi
Copy link
Author

pntemi commented Jan 24, 2018

@mohdsanadzakirizvi Hey, sorry for not having much update recently. I've started working on it though, so I guess I will continue.

@austinlostinboston
Copy link

@pntemi @rragundez hey all, any update on adding this flag to sklearn.preprocessing.MultiLabelBinarizer()? I just checked the source code and there's still no mention of it there.

@amueller
Copy link
Member

@austinlostinboston if you want to apply this to data, you should use OneHotEncoder.

@mkulariya
Copy link

@austinlostinboston it is done upgrade to sklearn 0.20
http://scikit-learn.org/stable/whats_new.html#version-0-20

@Luttik
Copy link

Luttik commented Apr 9, 2020

I would still really like an option to mute these warnings.

Ignoring unknown classes is expected behavior especially when you manually provide the classes. Therefore no warning would be better than any in the first place.

Additionally, it makes no sense to spend code and compute to clean the data before calling the MultiLabelBinarizer.

And warnings that we cannot mute are exceptionally annoying in an production environment.

@jnothman
Copy link
Member

jnothman commented Apr 11, 2020 via email

@Luttik
Copy link

Luttik commented Apr 11, 2020

Id argue that muting warnings is very bad practice you don't want true warning to overlap with warnings for expected behavior.

This is especially true when you are developing a package that depends on sklearn. You don't want side effects like muted warnings from your package.

@jnothman
Copy link
Member

jnothman commented Apr 11, 2020 via email

@Luttik
Copy link

Luttik commented Apr 11, 2020

I am fully aware of that. But it is beside the point. If you want to use the MultiLabelBinarizer in e.g. a library your still in the position that you have to either:

  • filter the data before throwing it in the Binarizer just to prevent those warnings (which is a waste of compute and code complexity)
  • Introduce unexpected side effects (which is always bad). Where the side-effect is either:
    • A needless warning. Prompting users to think something is wrong where there is nothing worth their attention happening.
    • Muting warnings which the libraries user might not be aware of. Note that even though this warning makes no sense for the MultiLabelBinarizer (when classes are passed in the constructor) it might make sense in other scenarios so you do not want to mute it.

@dhorkel
Copy link

dhorkel commented Jun 2, 2020

It seems inconsistent to not have an ignore option in MultiLabelBinarizer when OneHotEncoder has the handle_unknown option. Conditionally muting warnings is bad practice.

Seems like it would just be a matter of adding the option and changing this line:


to if unknown and handle_unknown=='error':

@jnothman
Copy link
Member

jnothman commented Jun 3, 2020

Shrug. If you submit a PR, it might result in merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Enhancement
Projects
None yet
9 participants