Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CountFeaturizer for categorical data #5853

Open
papadopc opened this issue Nov 16, 2015 · 9 comments
Open

CountFeaturizer for categorical data #5853

papadopc opened this issue Nov 16, 2015 · 9 comments

Comments

@papadopc
Copy link

@papadopc papadopc commented Nov 16, 2015

I have used a data transformation based on count data with some success on kaggle:
https://www.kaggle.com/c/sf-crime/forums/t/15836/predicting-crime-categories-with-address-featurization-and-neural-nets

This is similar to what Azure does: https://msdn.microsoft.com/en-us/library/azure/dn913056.aspx
I've also found that adding an extra column that gives the frequency of each individual label over all predictive categories adds to the information.

The implementation would use a contingency_matrix to first calculate the frequencies, then add laplacian noise to avoid overfitting and finally return the new features.

Is there any interest to include something like this in sklearn?

Best

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Oct 7, 2016

I think this is a good idea. Many people have asked for this. The question is a little bit in which format the categorical variable should be. I guess not yet one-hot-encoded, so the groupings are clear. And then a boolean matrix or indices need to be provided to specify which columns to encode.

This is "easy" but it's a new estimator, so probably not for a first time contribution.
I don't know what a good name would be. I also think probabilities might be better than counts.
Something in the name should suggest this is for categorical variables. It should work on the same data as the new / enhanced OneHotEncoder.

@chenhe95

This comment has been minimized.

Copy link
Contributor

@chenhe95 chenhe95 commented Oct 21, 2016

I am quite interested in adding this in.
So from what I read on the given Microsoft article, for each class y_i in unique(Y_labels), we are adding in a new column with the "probability" that this corresponding x_i belongs in y_i (except it's count instead of probability)
So one new additional column for each class? Is my understanding correct?

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Oct 21, 2016

@chenhe95 indeed

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Jun 13, 2018

The supervised variant from microsoft can also be blended with a prior marginal probability of the target for rare categories to limit overfitting on those. This what is called BayesEncoding in this package: http://hccencoding-project.readthedocs.io/en/latest/

This "average target value" encoder can also be used for regression: instead of averaging the probability of target classes we directly average the raw target variable values.

I think this kind of supervised categorical encoding is very useful for ensemble of trees. CountFeaturizer is probably not a suitable name for the supervised encoding schemes. We can probably use a dedicated class for the supervised case. Maybe TargetCategoricalEncoder?

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Jun 13, 2018

Correction: the logodds transform of the microsoft doc actually does the prior smoothing.

@chenhe95

This comment has been minimized.

Copy link
Contributor

@chenhe95 chenhe95 commented Jun 13, 2018

This seems interesting. Will take a look - thank you.

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Jun 13, 2018

This slide deck is very interesting as well: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Jun 13, 2018

We can probably use a dedicated class for the supervised case. Maybe TargetCategoricalEncoder

Maybe TargetFeaturizer is fine.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jun 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Categorical
  
To do
Andy's pets
Awaiting triage
5 participants
You can’t perform that action at this time.