Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH get column names by default in PDP when passing data… #15429

Merged
merged 8 commits into from Nov 7, 2019

Conversation

glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Nov 1, 2019

follow-up of #14028
partially addressed #14969

This allows not having to specify feature_names with pandas DataFrame by taking X.column.tolist() by default.

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Nov 1, 2019

Copy link
Member

@NicolasHug NicolasHug left a comment

Mostly looks good

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
sklearn/inspection/_partial_dependence.py Show resolved Hide resolved
@NicolasHug NicolasHug added this to the 0.22 milestone Nov 1, 2019
@glemaitre glemaitre changed the title ENH get column names by default in PDP when passing dataframe [MRG] ENH get column names by default in PDP when passing dataframe Nov 4, 2019
Copy link
Member

@adrinjalali adrinjalali left a comment

Is there a way to check if the correct feature names are used in the plot?

if not(hasattr(X, '__array__') or sparse.issparse(X)):
X = check_array(X, force_all_finite='allow-nan', dtype=np.object)
Copy link
Member

@adrinjalali adrinjalali Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like at some point this should be inside the check_array. Also, why not pass accept_sparse to check_array and not check it here?

Copy link
Contributor Author

@glemaitre glemaitre Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like at some point this should be inside the check_array

Agreed. I think that @jorisvandenbossche intended a PR on this a while ago.

Also, why not pass accept_sparse to check_array and not check it here?

I think that the idea was to delegate the check to the underlying pipeline.
In fact, I am not sure that we need to make any checking at all. If we only need to get a column, _safe_indexing should be smart enough to deal with list.

Copy link
Contributor Author

@glemaitre glemaitre Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the next line will fail n_features=X.shape[0] so we need something else than a list.

if hasattr(X, "loc"):
# get the column names for a pandas dataframe
feature_names = X.columns.tolist()
Copy link
Member

@adrinjalali adrinjalali Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not explicitly check for columns instead?

Copy link
Contributor Author

@glemaitre glemaitre Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now, we always ducktyped dataframe in this way.

Copy link
Member

@NicolasHug NicolasHug Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's time to have a is_dataframe helper

Copy link
Member

@adrinjalali adrinjalali Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can leave it for another PR though. I'm happy as is for this PR

Copy link
Member

@NicolasHug NicolasHug left a comment

LGTM, thanks @glemaitre

Is there a way to check if the correct feature names are used in the plot?

That'd be nice, I think @thomasjpfan would know that?

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Nov 4, 2019

That'd be nice, I think @thomasjpfan would know that?

It is done already in l.131-132 in the file test_plot_partial_dependence.py

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Nov 4, 2019

Actually, my change in this test create a dataframe and therefore, we do not test anymore with numpy arrays. We should do both as well there.

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Nov 4, 2019

I added back the test where the input data is a numpy array and feature_names is given.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Nov 7, 2019

Merged with master to get fix for CI. Will merge when green.

@thomasjpfan thomasjpfan changed the title [MRG] ENH get column names by default in PDP when passing dataframe ENH get column names by default in PDP when passing data… Nov 7, 2019
@thomasjpfan thomasjpfan merged commit 2e881f5 into scikit-learn:master Nov 7, 2019
20 checks passed
@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Nov 7, 2019

Thank you @glemaitre !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants