Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whitespace is a terrible separator for feature names in PolynomialFeatures.get_feature_names #10742

Open
amueller opened this issue Mar 2, 2018 · 9 comments

Comments

@amueller
Copy link
Member

amueller commented Mar 2, 2018

If your feature names have white spaces in them, it's hard to see which features are interactions right now.
Maybe we should make the separator an option of get_feature_names or use something like *?

@mohamed-ali
Copy link
Contributor

mohamed-ali commented Mar 28, 2018

@amueller I will try to reproduce the issue, if confirmed I will push a PR to fix it by adding separator as an option.

@mohamed-ali
Copy link
Contributor

Here's an example that reproduces the issue

>>> import pandas as pd 
>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns=["A B", "C", "D"])
>>> df
   A B  C  D
0    1  2  3
1    4  5  6
2    7  8  9
>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly = PolynomialFeatures(2)
>>> poly.fit(df)
PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)
>>> poly.transform(df)
array([[  1.,   1.,   2.,   3.,   1.,   2.,   3.,   4.,   6.,   9.],
       [  1.,   4.,   5.,   6.,  16.,  20.,  24.,  25.,  30.,  36.],
       [  1.,   7.,   8.,   9.,  49.,  56.,  63.,  64.,  72.,  81.]])
>>> poly.get_feature_names() 
['1', 'x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2', 'x2^2']
>>> poly.get_feature_names(input_features=df.columns.tolist()) 
['1', 'A B', 'C', 'D', 'A B^2', 'A B C', 'A B D', 'C^2', 'C D', 'D^2']

When including a feature which has a whitespace, in this case the feature "A B" output is ambiguous.

@mohamed-ali
Copy link
Contributor

Changing the code in here from

name = " ".join("%s^%d" % (input_features[ind], exp)
                                if exp != 1 else input_features[ind]
                                for ind, exp in zip(inds, row[inds]))

to this

name = "*".join("%s^%d" % (input_features[ind], exp)
                                if exp != 1 else input_features[ind]
                                for ind, exp in zip(inds, row[inds]))

removes the ambiguity.

Here's the above example run with the suggested change:

>>> from sklearn.preprocessing import PolynomialFeatures
>>> import pandas as pd 
>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns=["A B", "C", "D"])
>>> df
   A B  C  D
0    1  2  3
1    4  5  6
2    7  8  9
>>> poly = PolynomialFeatures(2)
>>> poly.fit(df)
PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)
>>> poly.transform(df)
array([[  1.,   1.,   2.,   3.,   1.,   2.,   3.,   4.,   6.,   9.],
       [  1.,   4.,   5.,   6.,  16.,  20.,  24.,  25.,  30.,  36.],
       [  1.,   7.,   8.,   9.,  49.,  56.,  63.,  64.,  72.,  81.]])
>>> poly.get_feature_names()
['1', 'x0', 'x1', 'x2', 'x0^2', 'x0*x1', 'x0*x2', 'x1^2', 'x1*x2', 'x2^2']
>>> poly.get_feature_names(input_features=df.columns.tolist())
['1', 'A B', 'C', 'D', 'A B^2', 'A B*C', 'A B*D', 'C^2', 'C*D', 'D^2']
>>> 

@jnothman
Copy link
Member

jnothman commented Mar 28, 2018 via email

@mohamed-ali
Copy link
Contributor

@jnothman That might be true, but at least 'A B*C' adds a bit of readability compared to 'A B C'.
Can you think of something else that would be more readable?

I thought about these:

-'"A B"*"C"'
-'(A B)*(C)'

But I think they are ugly. We can also throw a warning and replace any whitespace within a feature name with underscore:

  • 'A B C' => 'A_B C' or
  • 'A B C' => 'A_B*C'

@amueller what do you think?

@jnothman
Copy link
Member

jnothman commented Mar 29, 2018 via email

@mohamed-ali
Copy link
Contributor

@jnothman, I agree that there is no perfect solution to this, but I think we need at least to throw a warning in case the input_features contain a whitespace.

@mohamed-ali
Copy link
Contributor

@amueller what do you think?

@euanmacinnes
Copy link

Add a custom separator='' parameter to the function, then the dev can make it what they want. I wouldn't bother with also processing the A^2 stuff, leave that as a formatting example afterwards. default is space for backwards compatibility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants