Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature selection for categorical variables #8480

Open
amueller opened this issue Mar 1, 2017 · 11 comments

Comments

@amueller
Copy link
Member

@amueller amueller commented Mar 1, 2017

Currently feature selection on categorical variables is hard.
Using one-hot encoding we can select some of the categories, but we can't easily remove a whole variable.
I think this is something we should be able to do.

One way would be to have feature selection methods that are aware of categorical variables - I guess SelectFromModel(RandomForestClassifier()) would do that after we add categorical variable support.

We should have some more simple test-based methods, though.
Maybe f_regression and f_classif could be extended so that they can take categorical features into account?

Alternatively we could try to remove groups of features that correspond to the same original variable. That seems theoretically possible if we add feature name support, but using feature names for this would be putting semantics in the names which I don't entirely like. Doing it "properly" would require adding meta-data about the features, similarly to sample_props, only on the other axis. That seems pretty heavy-handed though.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Mar 1, 2017

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Mar 1, 2017

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Mar 3, 2017

That's what I meant by the "heavy handed" approach. I'm not sure how we would pass the grouping information around.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Mar 31, 2017

Hm I guess using hierarchical column-indices would solve the problem, if we could use data-frames....

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Feb 7, 2019

Thank you, past me, for opening this issue that I just wanted to open again.

I think with the column transformer we now have a way to do this using separate score functions. That should be pretty straight-forward, right?
We couldn't do model-based selection easily with that (I think?), but having f_regression_categorical or something like that for a feature union should be good.

Interesting question though: what should the input look like? (which is related to the meta-data question above).
If we assume everything is one-hot-encoded we need to have metadata about groups.
My "easy" solution would be to pass the data via column transformer before doing one-hot-encoding. But that means that either the scoring function needs to somehow call to OrdinalEncoder (or OneHotEncoder) internally, or we have to ask the user to call OrdinalEncoder first (which seems a bit weird but also not that weird).

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Feb 7, 2019

@caiohamamura

This comment has been minimized.

Copy link

@caiohamamura caiohamamura commented Apr 8, 2019

Wouldn't f_classif expect categorical variables x continuous response variable?

When I convert the categorical variable manually as integers, the integers themselves are being used somehow to compute f-score, isn't it an undesirable behavior, shouldn't f-score always be the same regardless the label used to represent the categories?

This is the behavior when I use the same data against scipy.stats.f_oneway or statsmodels.formula.api.ols.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Apr 9, 2019

@caiohamamura

This comment has been minimized.

Copy link

@caiohamamura caiohamamura commented Apr 9, 2019

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Apr 9, 2019

@thanasissdr

This comment has been minimized.

Copy link

@thanasissdr thanasissdr commented Aug 30, 2019

Maybe there's a misunderstanding of mine, here, but from what I've read so far, I agree with @caiohamamura . Please correct me if I'm wrong. f_classif seems to apply the f_oneway ANOVA test and as far as I know the independent variable should be categorical and the response variable continuous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Categorical
  
To do
4 participants
You can’t perform that action at this time.