Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PolynomialFeatures: allow user defined combinations of features #19533

Open
dhimmel opened this issue Feb 23, 2021 · 3 comments
Open

PolynomialFeatures: allow user defined combinations of features #19533

dhimmel opened this issue Feb 23, 2021 · 3 comments

Comments

@dhimmel
Copy link

dhimmel commented Feb 23, 2021

As of v0.24.1, sklearn.preprocessing.PolynomialFeatures has three options that determine which combinations of features are generated:

  1. degree: the maximum number of features to combine into a polynomial feature
  2. interaction_only: filters out any combinations that include the same feature multiple times
  3. include_bias: adds a column of ones

These are nice options, but are unable to capture every use case for generating polynomial feature combinations. For example, my data has 4 features: a, b, c, d. I want to transform these features into a, ab, ac, ad. I didn't see any way to achieve this will PolynomialFeatures directly, so instead I created a subsequent step in my pipeline to select a subset of columns from the PolynomialFeatures output.

Letting users specify any combination of features would be a general purpose solution. For example, I propose supporting something like:

# combinations by feature name (for situations when feature names are available)
PolynomialFeatures(combinations=[("a",), ("a", "b"), ("a", "c"), ("a", "d")])

# combinations by index
PolynomialFeatures(combinations=[(0,), (0, 1), (0, 2), (0, 3)])

# another set of combinations by index that isn't currently possible
PolynomialFeatures(combinations=[(0,) (1,), (0, 0), (1, 1, 1), (0, 1)])

Does this make sense? Is there some other way of generating custom combinations of polynomial features in a pipeline that I am overlooking?

@dhimmel
Copy link
Author

dhimmel commented Feb 23, 2021

One challenge is the user might not always know what features exist at some intermediate stage of a pipeline where PolynomialFeatures is applied. Therefore, perhaps combinations could also be a function that takes a list of feature names (or even just the number of features) and returns an iterable of combinations. This would allow combinations to be determined dynamically based on the input features.

This is more complex, and shouldn't detract from solving the case where the user does know all the features of the input data and would like to provide specific static combinations.

@ogrisel
Copy link
Member

ogrisel commented Feb 23, 2021

Indeed specifying feature combinations based on names is probably safer than by position. But this would require to propagate feature metadata in transformers in a pipeline.

This is a use case to keep in mind for SLEP015 scikit-learn/enhancement_proposals#48

@jnothman
Copy link
Member

jnothman commented Feb 23, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants