Skip to content

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #25550

What does this implement/fix? Explain your changes.

This PR adds a _drop_idx_internal to OneHotEncoder that is used to drop the categories. _drop_idx_internal was already precomputed to take into account the grouped infrequent categories.

The public drop_idx_ attribute needs to be remapped to reference back to the category that was actually dropped. There are tests in this PR to assert this behavior.

Any other comments?

I was not able to think of a simpler way to do this without adding another attribute.

@thomasjpfan thomasjpfan added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Feb 10, 2023
@thomasjpfan thomasjpfan added this to the 1.2.2 milestone Feb 10, 2023
@jeremiedbb
Copy link
Member

I'm not very familiar with all the one hot encorder internals, so I'm having a hard time reviewing the implementation :)
But the test seems to capture the behavior we expect.

I think it would be worth documenting more precisely the interaction between drop and infrequent categories. i.e. that infrequent categories are first grouped and then drop is resolved. In particular, "if_binary" applies to the categories after they've been grouped. This might not be very intuitive.

Other than that I can't see a way around having a separate attribute either. So I'd say LGTM.

ping @glemaitre or @ogrisel who might be more familiar with the internals of ohe

@jeremiedbb jeremiedbb mentioned this pull request Mar 7, 2023
12 tasks
@glemaitre glemaitre self-requested a review March 7, 2023 17:05
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I don't see a more straightforward way to handle the remapping. I propose only minor changes.

Since we are discussing about a remapping, just wondering if a private dict together with a property could make allow to store of a single attribute. But I did not think if it was actually feasible.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @thomasjpfan.

Here are just two nitpicks.

jeremiedbb and others added 2 commits March 8, 2023 12:03
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@jeremiedbb jeremiedbb merged commit 1380e3c into scikit-learn:main Mar 8, 2023
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Mar 8, 2023
…egories (scikit-learn#25589)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
jeremiedbb added a commit that referenced this pull request Mar 8, 2023
…egories (#25589)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:preprocessing To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OneHotEncoder drop_idx_ attribute description in presence of infrequent categories
4 participants