Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Selective OneHotEncoder #89

Closed
thismlguy opened this issue Mar 13, 2018 · 1 comment
Closed

Support Selective OneHotEncoder #89

thismlguy opened this issue Mar 13, 2018 · 1 comment
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@thismlguy
Copy link

Currently the OHE encodes all the values, but there might be 1000s of them which will blow up the training data. It'll be good to have a control to limit the number of columns generated after OHE.

Some solutions from the discussions on slack:

  1. Have a parameter "N" which defines the number of columns and most occurring N categories will be kept.
  2. Have the list of required categories be defined in the feturespec. The user will define a list of N categories required and only those will be OHE, others ignored.
  3. Have a percentile defined, e.g. keep all categories accounting for say 90% of the observations.

From initial discussions it seemed, #1 and #3 are harder to implement that #2? Open to discussion.

@thismlguy thismlguy reopened this Mar 13, 2018
@nevillelyh nevillelyh added enhancement New feature or request help wanted Extra attention is needed labels Mar 13, 2018
@richwhitjr
Copy link
Contributor

richwhitjr commented Mar 14, 2018

1 can also be done without too much effort using some of the heavy hitter monoids we use in other transformers. 3 can be difficult since we don't know how large to make the sketchmap in the first place to pick up all of the counts to calculate the distribution.

fallonchen added a commit that referenced this issue Mar 14, 2018
fallonchen added a commit that referenced this issue Mar 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants