-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CountFeaturizer for categorical data #5853
Comments
I think this is a good idea. Many people have asked for this. The question is a little bit in which format the categorical variable should be. I guess not yet one-hot-encoded, so the groupings are clear. And then a boolean matrix or indices need to be provided to specify which columns to encode. This is "easy" but it's a new estimator, so probably not for a first time contribution. |
I am quite interested in adding this in. |
@chenhe95 indeed |
The supervised variant from microsoft can also be blended with a prior marginal probability of the target for rare categories to limit overfitting on those. This what is called BayesEncoding in this package: http://hccencoding-project.readthedocs.io/en/latest/ This "average target value" encoder can also be used for regression: instead of averaging the probability of target classes we directly average the raw target variable values. I think this kind of supervised categorical encoding is very useful for ensemble of trees. |
Correction: the logodds transform of the microsoft doc actually does the prior smoothing. |
This seems interesting. Will take a look - thank you. |
This slide deck is very interesting as well: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750 |
Maybe |
I find Featurizer hard to understand. But I suppose that won't be a problem
once people get used to the name..
|
Can I work on this feature? |
@venkyyuvy Sorry, I have an implementation of this and putting my final touches on it before submit a PR. My overall plan is to split this PR into two: one for regression targets and another for classification targets. The major blocker is that I can not seem to find a dataset that show cases how a target encoder (for regression) is useful. Have you had experience with using this encoder successfully? I thought this encoder would have comparable performance in tree ensembles but with faster training times. In my experimenting, I see faster training times, but worst performance. My implementation uses a multilevel partial pooling encoding scheme as shown in #9614 (comment). My other concern is that I do not know if this encoder meets our inclusion criterion. |
Thanks for your update @thomasjpfan. My kind request is that if you already have started on a feature, please mention that in the thread. Sorry, I don't have personal experience in using target encoder. Could you please explain what do you mean when you say Looking forward for your PR. Thank you |
Better training times but worst on metrics such as |
I have used a data transformation based on count data with some success on kaggle:
https://www.kaggle.com/c/sf-crime/forums/t/15836/predicting-crime-categories-with-address-featurization-and-neural-nets
This is similar to what Azure does: https://msdn.microsoft.com/en-us/library/azure/dn913056.aspx
I've also found that adding an extra column that gives the frequency of each individual label over all predictive categories adds to the information.
The implementation would use a contingency_matrix to first calculate the frequencies, then add laplacian noise to avoid overfitting and finally return the new features.
Is there any interest to include something like this in sklearn?
Best
The text was updated successfully, but these errors were encountered: