Support Selective OneHotEncoder #89

thismlguy · 2018-03-13T15:35:09Z

Currently the OHE encodes all the values, but there might be 1000s of them which will blow up the training data. It'll be good to have a control to limit the number of columns generated after OHE.

Some solutions from the discussions on slack:

Have a parameter "N" which defines the number of columns and most occurring N categories will be kept.
Have the list of required categories be defined in the feturespec. The user will define a list of N categories required and only those will be OHE, others ignored.
Have a percentile defined, e.g. keep all categories accounting for say 90% of the observations.

From initial discussions it seemed, #1 and #3 are harder to implement that #2? Open to discussion.

richwhitjr · 2018-03-14T15:38:31Z

1 can also be done without too much effort using some of the heavy hitter monoids we use in other transformers. 3 can be difficult since we don't know how large to make the sketchmap in the first place to pick up all of the counts to calculate the distribution.

thismlguy closed this as completed Mar 13, 2018

thismlguy reopened this Mar 13, 2018

nevillelyh added enhancement New feature or request help wanted Extra attention is needed labels Mar 13, 2018

fallonchen added a commit that referenced this issue Mar 14, 2018

add TopNOneHotEncoder, fix #89

8413470

fallonchen added a commit that referenced this issue Mar 14, 2018

add TopNOneHotEncoder, fix #89

bc4ea78

nevillelyh closed this as completed in b7f0559 Mar 14, 2018

nevillelyh assigned nevillelyh and fallonchen and unassigned nevillelyh Mar 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Selective OneHotEncoder #89

Support Selective OneHotEncoder #89

thismlguy commented Mar 13, 2018

richwhitjr commented Mar 14, 2018 •

edited

Support Selective OneHotEncoder #89

Support Selective OneHotEncoder #89

Comments

thismlguy commented Mar 13, 2018

richwhitjr commented Mar 14, 2018 • edited

richwhitjr commented Mar 14, 2018 •

edited