-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
FIX Fixes bug OneHotEncoder's drop_idx_ when there are infrequent categories #25589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Fixes bug OneHotEncoder's drop_idx_ when there are infrequent categories #25589
Conversation
I'm not very familiar with all the one hot encorder internals, so I'm having a hard time reviewing the implementation :) I think it would be worth documenting more precisely the interaction between Other than that I can't see a way around having a separate attribute either. So I'd say LGTM. ping @glemaitre or @ogrisel who might be more familiar with the internals of ohe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I don't see a more straightforward way to handle the remapping. I propose only minor changes.
Since we are discussing about a remapping, just wondering if a private dict
together with a property
could make allow to store of a single attribute. But I did not think if it was actually feasible.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @thomasjpfan.
Here are just two nitpicks.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
…egories (scikit-learn#25589) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
…egories (#25589) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Reference Issues/PRs
Fixes #25550
What does this implement/fix? Explain your changes.
This PR adds a
_drop_idx_internal
toOneHotEncoder
that is used to drop the categories._drop_idx_internal
was already precomputed to take into account the grouped infrequent categories.The public
drop_idx_
attribute needs to be remapped to reference back to the category that was actually dropped. There are tests in this PR to assert this behavior.Any other comments?
I was not able to think of a simpler way to do this without adding another attribute.