Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH Add support for dataframe in PDP #14028

Merged
merged 110 commits into from Oct 31, 2019

Conversation

@glemaitre
Copy link
Contributor

glemaitre commented Jun 5, 2019

build on:

This PR adds support for:

  • pandas DataFrame
  • ColumnTransformer as a preprocessing within a Pipeline.

Additional features for future PRs:

  • Request of PDP for augmented features. For instance, one might compute the interaction between features (e.g. PolynomialFeatures) and would be interested in the PDP of these new features. It seems that we will need support for get_feature_names for the pipeline to ease such use case.
  • Support of categorical data (bar plot should be OK for those).
@glemaitre glemaitre force-pushed the glemaitre:pdp_dataframe branch from 7ae21e6 to 35325f2 Jun 5, 2019
@glemaitre glemaitre force-pushed the glemaitre:pdp_dataframe branch from 35325f2 to f2035fe Jun 5, 2019
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jun 6, 2019

Some questions regarding the expected behaviour:

  1. When using a pipeline, do we expect values to reflect the data scale after or before preprocessing. Values before preprocessing will require support of inverse_transform which will be challenging for some transformer (e.g., ColumnTransformer, PolynomialFeature, etc.);
  2. In the case of data augmentation within a pipeline (e.g., PolynomialFeature), one would be interested to know the interaction of an augmented feature.
@glemaitre glemaitre closed this Jun 6, 2019
@glemaitre glemaitre reopened this Jun 6, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 6, 2019

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jun 6, 2019

If a pipeline is passed, we should be quantifying in terms of its input features... If the user wants otherwise they should slice it.

After couple of iterations this morning is what I thought as well. This will make thing so difficult otherwise. get_feature_names will be handy in case a user need to slice it.

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Jun 7, 2019

Feel free to ping when you need reviews! (I'll look at #14035)

glemaitre added 8 commits Jul 18, 2019
fix
@glemaitre glemaitre changed the title [WIP] Add support for dataframe in PDP [MRG] Add support for dataframe in PDP Jul 23, 2019
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Jul 23, 2019

I think this is good to be reviewed @thomasjpfan @NicolasHug

I will probably add an example to illustrate to we can use the ColumnTransformer

glemaitre added 4 commits Jul 23, 2019
DOC
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Oct 28, 2019

I dropped the support and put a meaningful error message depending of the type of X

Copy link
Member

ogrisel left a comment

I started to review but I have now the following question:

sklearn/inspection/_partial_dependence.py Outdated Show resolved Hide resolved
sklearn/inspection/_partial_dependence.py Outdated Show resolved Hide resolved
@adrinjalali adrinjalali added this to In progress in Meeting Issues via automation Oct 29, 2019
@adrinjalali adrinjalali moved this from In progress to Review in progress in Meeting Issues Oct 29, 2019
Meeting Issues automation moved this from Review in progress to Reviewer approved Oct 30, 2019
Copy link
Member

ogrisel left a comment

LGTM.

I think the example https://scikit-learn.org/dev/auto_examples/inspection/plot_partial_dependence.html should be updated to use a data frame and to pass features names instead of feature indices to make it more readable. But this can be done in a later PR if your prefer. As you wish.

``X`` is used both to generate a grid of values for the
``features``, and to compute the averaged predictions when
method is 'brute'.
features : list or array-like of int
features : array-like of {int, str}
The target features for which the partial dependency should be
computed.

This comment has been minimized.

Copy link
@ogrisel

ogrisel Oct 30, 2019

Member

Maybe make it explicit that it should be either a single feature or a pair of 2 features.

Also "target features" does not really mean anything.

May I suggest the following: "The feature or pair of interacting features for which the partial dependency should be computed."

@ogrisel

This comment has been minimized.

glemaitre added 3 commits Oct 30, 2019
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Oct 30, 2019

I updated the examples. However, be aware that we still have to specify features and features_names. I only updated partial_dependence and not plot_partial_dependence. So the next upcoming PR is to make plot_partial_dependence infer the features_names for dataframe such that one has only to give a list of features names.

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Oct 31, 2019

@NicolasHug Can we merge this one.

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Oct 31, 2019

@thomasjpfan as well :P

@thomasjpfan thomasjpfan merged commit 7ea7861 into scikit-learn:master Oct 31, 2019
17 of 19 checks passed
17 of 19 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python 1 fixed alert
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 97.15%)
Details
codecov/project Absolute coverage decreased by -1.12% but relative coverage increased by +2.84% compared to 132ad99
Details
scikit-learn.scikit-learn Build #20191030.72 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
Meeting Issues automation moved this from Reviewer approved to Done Oct 31, 2019
@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Oct 31, 2019

Thanks @glemaitre !

@glemaitre glemaitre moved this from WAITING FOR REVIEW to REVIEWED AND WAITING FOR CHANGES in Guillaume's pet Oct 31, 2019
@glemaitre glemaitre moved this from REVIEWED AND WAITING FOR CHANGES to MERGED in Guillaume's pet Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
You can’t perform that action at this time.