You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the OHE encodes all the values, but there might be 1000s of them which will blow up the training data. It'll be good to have a control to limit the number of columns generated after OHE.
Some solutions from the discussions on slack:
Have a parameter "N" which defines the number of columns and most occurring N categories will be kept.
Have the list of required categories be defined in the feturespec. The user will define a list of N categories required and only those will be OHE, others ignored.
Have a percentile defined, e.g. keep all categories accounting for say 90% of the observations.
From initial discussions it seemed, #1 and #3 are harder to implement that #2? Open to discussion.
The text was updated successfully, but these errors were encountered:
1 can also be done without too much effort using some of the heavy hitter monoids we use in other transformers. 3 can be difficult since we don't know how large to make the sketchmap in the first place to pick up all of the counts to calculate the distribution.
Currently the OHE encodes all the values, but there might be 1000s of them which will blow up the training data. It'll be good to have a control to limit the number of columns generated after OHE.
Some solutions from the discussions on slack:
From initial discussions it seemed, #1 and #3 are harder to implement that #2? Open to discussion.
The text was updated successfully, but these errors were encountered: