Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CountFeaturizer for categorical data #5853

Closed
papadopc opened this issue Nov 16, 2015 · 13 comments · Fixed by #25334
Closed

CountFeaturizer for categorical data #5853

papadopc opened this issue Nov 16, 2015 · 13 comments · Fixed by #25334

Comments

@papadopc
Copy link

I have used a data transformation based on count data with some success on kaggle:
https://www.kaggle.com/c/sf-crime/forums/t/15836/predicting-crime-categories-with-address-featurization-and-neural-nets

This is similar to what Azure does: https://msdn.microsoft.com/en-us/library/azure/dn913056.aspx
I've also found that adding an extra column that gives the frequency of each individual label over all predictive categories adds to the information.

The implementation would use a contingency_matrix to first calculate the frequencies, then add laplacian noise to avoid overfitting and finally return the new features.

Is there any interest to include something like this in sklearn?

Best

@amueller amueller added New Feature Easy Well-defined and straightforward way to resolve Need Contributor labels Oct 7, 2016
@amueller
Copy link
Member

amueller commented Oct 7, 2016

I think this is a good idea. Many people have asked for this. The question is a little bit in which format the categorical variable should be. I guess not yet one-hot-encoded, so the groupings are clear. And then a boolean matrix or indices need to be provided to specify which columns to encode.

This is "easy" but it's a new estimator, so probably not for a first time contribution.
I don't know what a good name would be. I also think probabilities might be better than counts.
Something in the name should suggest this is for categorical variables. It should work on the same data as the new / enhanced OneHotEncoder.

@chenhe95
Copy link
Contributor

I am quite interested in adding this in.
So from what I read on the given Microsoft article, for each class y_i in unique(Y_labels), we are adding in a new column with the "probability" that this corresponding x_i belongs in y_i (except it's count instead of probability)
So one new additional column for each class? Is my understanding correct?

@amueller
Copy link
Member

@chenhe95 indeed

@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2018

The supervised variant from microsoft can also be blended with a prior marginal probability of the target for rare categories to limit overfitting on those. This what is called BayesEncoding in this package: http://hccencoding-project.readthedocs.io/en/latest/

This "average target value" encoder can also be used for regression: instead of averaging the probability of target classes we directly average the raw target variable values.

I think this kind of supervised categorical encoding is very useful for ensemble of trees. CountFeaturizer is probably not a suitable name for the supervised encoding schemes. We can probably use a dedicated class for the supervised case. Maybe TargetCategoricalEncoder?

@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2018

Correction: the logodds transform of the microsoft doc actually does the prior smoothing.

@chenhe95
Copy link
Contributor

This seems interesting. Will take a look - thank you.

@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2018

This slide deck is very interesting as well: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750

@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2018

We can probably use a dedicated class for the supervised case. Maybe TargetCategoricalEncoder

Maybe TargetFeaturizer is fine.

@jnothman
Copy link
Member

jnothman commented Jun 13, 2018 via email

@amueller amueller removed the Easy Well-defined and straightforward way to resolve label Aug 6, 2019
@venkyyuvy
Copy link
Contributor

Can I work on this feature?

@thomasjpfan thomasjpfan self-assigned this May 23, 2020
@thomasjpfan
Copy link
Member

@venkyyuvy Sorry, I have an implementation of this and putting my final touches on it before submit a PR. My overall plan is to split this PR into two: one for regression targets and another for classification targets.

The major blocker is that I can not seem to find a dataset that show cases how a target encoder (for regression) is useful. Have you had experience with using this encoder successfully?

I thought this encoder would have comparable performance in tree ensembles but with faster training times. In my experimenting, I see faster training times, but worst performance. My implementation uses a multilevel partial pooling encoding scheme as shown in #9614 (comment).

My other concern is that I do not know if this encoder meets our inclusion criterion.

@venkyyuvy
Copy link
Contributor

venkyyuvy commented May 24, 2020

Thanks for your update @thomasjpfan. My kind request is that if you already have started on a feature, please mention that in the thread.

Sorry, I don't have personal experience in using target encoder.

Could you please explain what do you mean when you say performance because you had already mentioned that you are able to get better training times.

Looking forward for your PR. Thank you

@thomasjpfan
Copy link
Member

Could you please explain what do you mean when you say performance because you had already mentioned that you are able to get better training times.

Better training times but worst on metrics such as r2_score or accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants