-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] MultiLabelBinarizer ignore unkown class in transform #10913
[MRG+1] MultiLabelBinarizer ignore unkown class in transform #10913
Conversation
This only modifies the behaviour when using transform after using fit or if the classes argument was used during initialization. When using transform it can come after using fit or directly from fit_transform. In the former case makes sense to only use the classes learned by using fit. In the latter case all classes are used.
Previously this tests will check for a KeyError. MultiLabelBinarizer now ignores unknown classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this will be useful for some tasks. I would suggest, though, that giving users an error is useful in many cases. We should inform the user is informed when their data is problematic.
I would prefer an option handle_unknown='ignore'/'error'
(with error being the default).
sklearn/preprocessing/label.py
Outdated
for labels in y: | ||
indices.extend(set(class_mapping[label] for label in labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire change here could be:
if class_mapping:
labels = (label for label in labels if label in class_mapping)
or
if class_mapping:
labels = filter(class_mapping.__contains__, labels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jnothman thanks for your comments.
I agree that it is sometimes useful to give the user an error, but I completely disagree this is one of those cases, as the change I'm proposing only kicks in when the user EXPLICITLY has used fit
or given an explicit collection of classes
to use in the object initialization.
If you want to go to the realm of "when the data is problematic" it is a lost battle as there are so many exceptions that is impossible to catch in a simple pragmatic and intuitive API.
Is the responsibility of the user to check his/her data, and for example use the classes
parameters to include all possible classes. Or if it is just expanding every class on every dataset then the fit_transform
will do that. Could you give me an example in practice you have used or seen where the solutions I'm proposing together with fit_transform
don't work? I haven't seen one.
Your code doesn't work. I won't go to the specifics but class_mapping
can be a defaultdict
with a default_factory
, which means it increases on demand. I looked for other more compressed forms to write the code (I specially dislike line 798) but unfortunately I couldn't find one which was simple, easy to understand, worked for transform
and fit_transform
, and will require changing only in one function. But if you can think of one please be my guest :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a good idea to raise a warning on the missing class btw, perhaps with that we address your feedback idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the responsibility of the user to check his/her data, and for example use the classes parameters to include all possible classes.
Or it is the responsibility of the user to clean his/her data to avoid unknowns. But I do see what you mean, in a random sample context.
I suppose in this context I'm okay with changing the error to a warning. Sometimes you'd be surprised what behaviour users expect to remain backwards compatible, though! Please make that change, and we'll see what other core devs think.
Your code doesn't work. I won't go to the specifics but class_mapping can be a defaultdict with a default_factory, which means it increases on demand.
You're right. I meant not empty_mapping
not class_mapping
: https://github.com/rragundez/scikit-learn/compare/mlb-unkown-class...jnothman:ignore-unknown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the warning as suggested :)
I'll also add your piece of code, thanks!
@jnothman I've added the warning using the logging module. I think this is a quite nice solution. |
sklearn/preprocessing/label.py
Outdated
labels = set(labels) # in case labels is a generator | ||
missing = labels - set(class_mapping) | ||
if missing: | ||
raise Warning("Unkown class(es) found '{0}' " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use warnings.warn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote and lost a longer response. Perhaps don't report the subset ignored. Better each label or no label, because only distinct warning texts will be shown by default.
Now you should probably use a loop over labels with a try ... except KeyError
rather than use a generator expression. Atm you're duplicating lookups in the mapping.
@jnothman I've added the changes |
sklearn/preprocessing/label.py
Outdated
try: | ||
index.add(class_mapping[label]) | ||
except KeyError: | ||
warnings.warn("Unkown class {0} found and will be ignored" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make that {0!r} to quote the string correctly
|
||
mlb = MultiLabelBinarizer(classes=[1, 2]) | ||
assert_raises(KeyError, mlb.fit_transform, [[0]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use assert_warns or a variant
@jnothman great feedback. Changes done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there
|
||
mlb = MultiLabelBinarizer(classes=[1, 2]) | ||
assert_raises(KeyError, mlb.fit_transform, [[0]]) | ||
assert_array_equal(mlb.fit(y).transform([[0]]), Y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want this raising the warning during testing. You can instead get the return value from assert_warns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/preprocessing/label.py
Outdated
try: | ||
index.add(class_mapping[label]) | ||
except KeyError: | ||
warnings.warn("Unkown class {0!r} will be ignored" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry tofuss over this again, but perhaps it's better still to accumulate all the unknown labels and warn at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I disagree, there is no internal purpose for the accumulation of these unknown classes, the only one could be warning in one go which is not longer necessary with this. The warning is sufficient for the user to go and double check his/her data.
shrug. why? so as to issue only one warning altogether
…On 4 Apr 2018 8:56 pm, "rragundez" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/preprocessing/label.py
<#10913 (comment)>
:
> @@ -795,7 +797,14 @@ def _transform(self, y, class_mapping):
indices = array.array('i')
indptr = array.array('i', [0])
for labels in y:
- indices.extend(set(class_mapping[label] for label in labels))
+ index = set()
+ for label in labels:
+ try:
+ index.add(class_mapping[label])
+ except KeyError:
+ warnings.warn("Unkown class {0!r} will be ignored"
Why? I disagree, there is no internal purpose for the accumulation of
these unknown classes. The warning is sufficient for the user to go and
double check his/her data.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10913 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wJ3L3V5u8WhN2twdFZ4t9o1iiKZks5tlKbQgaJpZM4TFk8K>
.
|
@jnothman what actual benefit does that have in comparison to what it does now? |
@jnothman I made the necessary changes. Do you think is ok now? |
To not flood the user's screen/log if ignoring is the behaviour they want,
and perhaps to reduce some computational overhead.
|
sklearn/preprocessing/label.py
Outdated
@@ -795,8 +797,17 @@ def _transform(self, y, class_mapping): | |||
indices = array.array('i') | |||
indptr = array.array('i', [0]) | |||
for labels in y: | |||
indices.extend(set(class_mapping[label] for label in labels)) | |||
index, unknown = set(), set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unknown needs to be initialised out of the loop
sklearn/preprocessing/label.py
Outdated
indptr.append(len(indices)) | ||
if unknown: | ||
warnings.warn("Unkown class(es) {0} will be ignored" | ||
.format(unknown)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use sorted for determinism
Btw, those last issues could have been caught by tests if you used assert_raises_message |
In the case that classes are mixed (e.g. 1, 'some_class') sorted will break if comparing different types, hence the addition of the key=str argument.
@jnothman changes made, and good point about the determinism. BTW I tried the |
@jnothman yeap, I just saw the docstring for |
I meant assert_warns_message, sorry!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
@jnothman The travis test failed, no idea why. Any update on the merging of this PR? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should also test the case that classes
is provided. And perhaps your test should ensure that non-ignored labels are still encoded, and that one error message combines those labels ignored from multiple samples.
Please add an entry to the change log at |
@jnothman I did not really understand the request for the test as I think it was already done in here: scikit-learn/sklearn/preprocessing/tests/test_label.py Lines 317 to 320 in b0afb72
but I updated the test for multiple unknown classes and the case where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@jnothman should we merge it? or are you waiting for another core developer to do that? |
Yes, we wait for a second review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @rragundez :)
Reference Issues/PRs
Fixes #10410
What does this implement/fix? Explain your changes.
Like with other transformers I would expect that
transform
uses what is picked up/learned duringfit
(e.g.sklearn.preprocessing.StandardScaler
uses the mean and variance learned duringfit
), to my surprise this was not the case withMultiLabelBinarizer
.Example:
My case was that I used a container to train the model with a pre-processing step that uses
MultiLabelBinarizer
, build pipeline and output the pipeline as pkl file. Another container picks it up andpredicts
on new data. To my surprise the pre-processing pipeline breaks if an unseen label is received. To fix this I had to write an ugly hack usingMultiLabelBinarizer
'sclasses_
attribute.This change addresses that problem. Now
transform
only uses the seen duringfit
. If the parameterclasses
was given on initialization it will use that. Iffit_transform
is used it respects the optimization already in place.Example with change:
Any other comments?
The Issue 10410 that this change fixes, mentions the idea of adding an argument
ignore_unseen
. In my opinion this is not the a good solution: