Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH Add support for dataframe in PDP #14028

Merged
merged 110 commits into from Oct 31, 2019

Conversation

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jun 5, 2019

build on:

This PR adds support for:

  • pandas DataFrame
  • ColumnTransformer as a preprocessing within a Pipeline.

Additional features for future PRs:

  • Request of PDP for augmented features. For instance, one might compute the interaction between features (e.g. PolynomialFeatures) and would be interested in the PDP of these new features. It seems that we will need support for get_feature_names for the pipeline to ease such use case.
  • Support of categorical data (bar plot should be OK for those).
@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Jun 6, 2019

Some questions regarding the expected behaviour:

  1. When using a pipeline, do we expect values to reflect the data scale after or before preprocessing. Values before preprocessing will require support of inverse_transform which will be challenging for some transformer (e.g., ColumnTransformer, PolynomialFeature, etc.);
  2. In the case of data augmentation within a pipeline (e.g., PolynomialFeature), one would be interested to know the interaction of an augmented feature.

@glemaitre glemaitre closed this Jun 6, 2019
@glemaitre glemaitre reopened this Jun 6, 2019
@jnothman
Copy link
Member

@jnothman jnothman commented Jun 6, 2019

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Jun 6, 2019

If a pipeline is passed, we should be quantifying in terms of its input features... If the user wants otherwise they should slice it.

After couple of iterations this morning is what I thought as well. This will make thing so difficult otherwise. get_feature_names will be handy in case a user need to slice it.

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jun 7, 2019

Feel free to ping when you need reviews! (I'll look at #14035)

@glemaitre glemaitre changed the title [WIP] Add support for dataframe in PDP [MRG] Add support for dataframe in PDP Jul 23, 2019
@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Jul 23, 2019

I think this is good to be reviewed @thomasjpfan @NicolasHug

I will probably add an example to illustrate to we can use the ColumnTransformer

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Oct 28, 2019

I dropped the support and put a meaningful error message depending of the type of X

Copy link
Member

@ogrisel ogrisel left a comment

I started to review but I have now the following question:

sklearn/inspection/_partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_partial_dependence.py Outdated Show resolved Hide resolved
@adrinjalali adrinjalali added this to In progress in Meeting Issues via automation Oct 29, 2019
@adrinjalali adrinjalali moved this from In progress to Review in progress in Meeting Issues Oct 29, 2019
Meeting Issues automation moved this from Review in progress to Reviewer approved Oct 30, 2019
Copy link
Member

@ogrisel ogrisel left a comment

LGTM.

I think the example https://scikit-learn.org/dev/auto_examples/inspection/plot_partial_dependence.html should be updated to use a data frame and to pass features names instead of feature indices to make it more readable. But this can be done in a later PR if your prefer. As you wish.

``X`` is used both to generate a grid of values for the
``features``, and to compute the averaged predictions when
method is 'brute'.
features : list or array-like of int
features : array-like of {int, str}
The target features for which the partial dependency should be
computed.
Copy link
Member

@ogrisel ogrisel Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make it explicit that it should be either a single feature or a pair of 2 features.

Also "target features" does not really mean anything.

May I suggest the following: "The feature or pair of interacting features for which the partial dependency should be computed."

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Oct 30, 2019

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Oct 30, 2019

I updated the examples. However, be aware that we still have to specify features and features_names. I only updated partial_dependence and not plot_partial_dependence. So the next upcoming PR is to make plot_partial_dependence infer the features_names for dataframe such that one has only to give a list of features names.

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Oct 31, 2019

@NicolasHug Can we merge this one.

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Oct 31, 2019

@thomasjpfan as well :P

@thomasjpfan thomasjpfan merged commit 7ea7861 into scikit-learn:master Oct 31, 2019
19 checks passed
Meeting Issues automation moved this from Reviewer approved to Done Oct 31, 2019
@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Oct 31, 2019

Thanks @glemaitre !

@glemaitre glemaitre moved this from WAITING FOR REVIEW to REVIEWED AND WAITING FOR CHANGES in Guillaume's pet Oct 31, 2019
@glemaitre glemaitre moved this from REVIEWED AND WAITING FOR CHANGES to MERGED in Guillaume's pet Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet