-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiLabelBinarizer breaks when seeing unseen labels...should there be an option to handle this instead? #10410
Comments
Yes, I suppose such a setting would be useful.
…On 6 January 2018 at 06:13, Ploy Temiyasathit ***@***.***> wrote:
Description
I am not sure if it's intended for MultiLabelBinarizer to fit and
transform only seen data or not.
However, there are many times that it is not possible/not in our interest
to know all of the classes that we're fitting at training time.
For convenience, I am wondering if there should be another parameter that
allows us to ignore the unseen classes by just setting them to 0?
Proposed Modification
Example:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(ignore_unseen=True)
y_train = [['a'],['a', 'b'], ['a', 'b', 'c']]
mlb.fit(y_train)
y_test = [['a'],['b'],['d']]
mlb.transform(y_test)
Result:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 0]])
(the current version 0.19.0 would say KeyError: 'd')
I can open a PR for this if this is a desired behavior.
Others also have similar issue:
https://stackoverflow.com/questions/31503874/using-
multilabelbinarizer-on-test-data-with-labels-not-in-the-training-set
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#10410>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz60OD_hPXjQlFF7Qus2WI5LT4pFtCks5tHnPzgaJpZM4RU1I5>
.
|
The original poster stated they would like to submit a PR, so let's wait.
|
OK. I'm taking this then. |
if no one is working, I'd like to take up this issue? |
@mohdsanadzakirizvi looks like the OP said they will deliver |
@mohdsanadzakirizvi Hey, sorry for not having much update recently. I've started working on it though, so I guess I will continue. |
@pntemi @rragundez hey all, any update on adding this flag to |
@austinlostinboston if you want to apply this to data, you should use OneHotEncoder. |
@austinlostinboston it is done upgrade to sklearn 0.20 |
I would still really like an option to mute these warnings. Ignoring unknown classes is expected behavior especially when you manually provide the classes. Therefore no warning would be better than any in the first place. Additionally, it makes no sense to spend code and compute to clean the data before calling the MultiLabelBinarizer. And warnings that we cannot mute are exceptionally annoying in an production environment. |
All warnings can be muted with the python warnings module
|
Id argue that muting warnings is very bad practice you don't want true warning to overlap with warnings for expected behavior. This is especially true when you are developing a package that depends on sklearn. You don't want side effects like muted warnings from your package. |
You can mute warnings filtered on message text
|
I am fully aware of that. But it is beside the point. If you want to use the MultiLabelBinarizer in e.g. a library your still in the position that you have to either:
|
It seems inconsistent to not have an ignore option in MultiLabelBinarizer when OneHotEncoder has the Seems like it would just be a matter of adding the option and changing this line: scikit-learn/sklearn/preprocessing/_label.py Line 993 in fd23727
to if unknown and handle_unknown=='error':
|
Shrug. If you submit a PR, it might result in merge. |
Description
I am not sure if it's intended for MultiLabelBinarizer to fit and transform only seen data or not.
However, there are many times that it is not possible/not in our interest to know all of the classes that we're fitting at training time.
For convenience, I am wondering if there should be another parameter that allows us to ignore the unseen classes by just setting them to 0?
Proposed Modification
Example:
Result:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 0]])
(the current version 0.19.0 would say
KeyError: 'd'
)I can open a PR for this if this is a desired behavior.
Others also have similar issue:
https://stackoverflow.com/questions/31503874/using-multilabelbinarizer-on-test-data-with-labels-not-in-the-training-set
The text was updated successfully, but these errors were encountered: