Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] add get_feature_names to PolynomialFeatures #6372

Merged
merged 4 commits into from
Feb 25, 2016

Conversation

amueller
Copy link
Member

Fixes #6185, replaces #6216

@amueller
Copy link
Member Author

ping @jakevdp ;)

@@ -1182,6 +1182,35 @@ def powers_(self):
return np.vstack(np.bincount(c, minlength=self.n_input_features_)
for c in combinations)

def get_feature_names(self, input_features=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been suggesting for a while that this is an appropriate way to deal with feature names in extractor/transformer pipelines. I'm happy to see this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Very nice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wanting to do it but didn't have time. I'm just writing the book chapter about preprocessing and I'm embarrassed not having that feature ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, who knew writing books would be good for something?

@jnothman
Copy link
Member

LGTM.

@jnothman
Copy link
Member

I qualify that LGTM. I think we need to decide whether x_1, x_2, ... is the appropriate naming scheme. If we are to go with get_feature_names transforming its input feature name list, the same convention will apply in, e.g. feature selectors and feature agglomeration.

A further small issue: should these be unicodes in Python 2? We should at least test that when unicode input names are passed, an error is not triggered.

----------
input_features : list of string, length n_features, optional
String names for input features if available. By default,
"x0", "x1", ... "x_n_features" is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have underscores (x_0, x_1) to match code below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be cleaner without underscores? I find things like x_1^2 x_2^1 harder to visually parse than x1^2 x2^1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... well I was thinking about latex. But maybe that's silly.

@amueller
Copy link
Member Author

I changed the naming scheme to x%d for easier visual parsing. I think x for data is good. We could also do x[%d], which is what we use in trees.

@amueller
Copy link
Member Author

added a unicode test. Is that what you had in mind @jnothman ?

# test some unicode
poly = PolynomialFeatures(degree=1, include_bias=True).fit(X)
feature_names = poly.get_feature_names([u"\u0001F40D", u"\u262E", u"\u05D0"])
assert_array_equal(["1", u"\u0001F40D^1", u"\u262E^1", u"\u05D0^1"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit awkward that 1 is not unicode. Can we find a simple convention for such things?

@jnothman
Copy link
Member

We could also do x[%d], which is what we use in trees.

That consistency (I assume you mean with visualisation) might be worthwhile, though I know that the use of the first subscript (as opposed to X[:, %d]) confuses people, despite the lowercase x.

@amueller
Copy link
Member Author

hm, I can't reproduce the error... it seems to be in powers_. Weird.

@amueller
Copy link
Member Author

powers_ is never tested and is broken in numpy 0.16.1 (fixed in 0.16.2), because it used the bincount of an empty sequence

@amueller
Copy link
Member Author

should be fixed. I have no opinion re x[0] vs x0, The second has less "noise" I guess?

@jakevdp
Copy link
Member

jakevdp commented Feb 20, 2016

I would prefer x0 vs x[0], because it's half the number of characters and leads to easier-to-read variable names.

@amueller
Copy link
Member Author

appveyor failure due to #4914.
@jakevdp @jnothman +1?

@jakevdp
Copy link
Member

jakevdp commented Feb 22, 2016

+1 for merge

@jnothman
Copy link
Member

+1

@jnothman jnothman changed the title [MRG] add get_feature_names to PolynomialFeatures [MRG+2] add get_feature_names to PolynomialFeatures Feb 23, 2016

Returns
-------
output_feature_names : list of string, length n_output_features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpick here:
Maybe there should be a "." in the end of this line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conventionally not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see.
So this line:
https://github.com/amueller/scikit-learn/blob/poly_feature_names/sklearn/preprocessing/data.py#L1586
ends with a "." because it's a new sentence used to describe output?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is description, not type.

@amueller
Copy link
Member Author

should be good now.

@jakevdp
Copy link
Member

jakevdp commented Feb 24, 2016

Once tests pass I'd say we can merge

@jakevdp
Copy link
Member

jakevdp commented Feb 25, 2016

Thanks @amueller!

jakevdp added a commit that referenced this pull request Feb 25, 2016
[MRG+2] add get_feature_names to PolynomialFeatures
@jakevdp jakevdp merged commit 7895d38 into scikit-learn:master Feb 25, 2016
@amueller amueller deleted the poly_feature_names branch May 19, 2017 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants