Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial Dependence Plots for Random Forests. #4405

Closed
Autodidact24 opened this issue Mar 18, 2015 · 18 comments · Fixed by #12599

Comments

@Autodidact24
Copy link

@Autodidact24 Autodidact24 commented Mar 18, 2015

Does scikit-learn have any capacity for partial dependence plots and associated data arrays for random forest analyses?
I can find the plot for GradientBoostingRegressor here http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html.

Doing the same for RF outputs:

File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/partial_dependence.py", line 239, in plot_partial_dependence
    raise ValueError('gbrt has to be an instance of BaseGradientBoosting')
ValueError: gbrt has to be an instance of BaseGradientBoosting
@DonBeo

This comment has been minimized.

Copy link

@DonBeo DonBeo commented Apr 27, 2015

I have the same problem. It would be nice to have partial dependence plot for random forest or extra trees

@dchudz

This comment has been minimized.

Copy link

@dchudz dchudz commented Jun 1, 2015

Is there any particular reason this was implemented only for gradient boosted models? Seems like partial dependence plots should be a very general idea.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jun 1, 2015

Not really I think. We should probably add a more general helper

@olivermueller

This comment has been minimized.

Copy link

@olivermueller olivermueller commented Sep 23, 2015

I'm running into the same problem. It would be nice to have partial dependence plots for random forests (or any other classifier).

@trevorstephens

This comment has been minimized.

Copy link
Contributor

@trevorstephens trevorstephens commented Sep 23, 2015

I'll work on this one.

@sniemi

This comment has been minimized.

Copy link

@sniemi sniemi commented Oct 22, 2015

+1, this would be very valuable. Any update?

@trevorstephens

This comment has been minimized.

Copy link
Contributor

@trevorstephens trevorstephens commented Oct 22, 2015

I expect to be pushing a WIP PR this weekend.

@pauljacksonrodgers

This comment has been minimized.

Copy link

@pauljacksonrodgers pauljacksonrodgers commented Oct 29, 2015

The partial dependence can be computed efficiently for GBMs, but estimating the partial dependence for a generic model (by actually simulating predictions) would be slow without some sampling...is the suggestion here to use some efficient method to calculate the PDP efficiently on random forests and decision trees, or to implement a "naive" method to estimate the PDP for any model?

The latter is slow but has the added advantage of being able to accommodate a pipeline, too.

@trevorstephens

This comment has been minimized.

Copy link
Contributor

@trevorstephens trevorstephens commented Nov 1, 2015

@pauljacksonrodgers Yes, the partial plots on a generic model can take some time on 2D plots, but I should have a WIP PR up tomorrow that helps with an "estimated" option that runs rather quickly on any model. Almost there! A bit more involved than I originally expected, but one week late in open source is essentially really, really; really, ridiculously early :-)

current %timeit implementation on 1D for a GBM on the breast_cancer binary classification dataset (~500 obs):

recursion
1000 loops, best of 3: 983 µs per loop
exact
100 loops, best of 3: 4.98 ms per loop
estimated
1000 loops, best of 3: 567 µs per loop

and on 2D:

recursion
1000 loops, best of 3: 1.71 ms per loop
exact
10 loops, best of 3: 24.9 ms per loop
estimated
1000 loops, best of 3: 1.2 ms per loop

on boston dataset for 1D regression (similar shape):

recursion
1000 loops, best of 3: 957 µs per loop
exact
100 loops, best of 3: 3.23 ms per loop
estimated
1000 loops, best of 3: 483 µs per loop

and for 2D:

recursion
1000 loops, best of 3: 1.64 ms per loop
exact
100 loops, best of 3: 15.5 ms per loop
estimated
1000 loops, best of 3: 996 µs per loop

where "exact" is the predict_proba call you refer to as potentially slow (it is), "recursion" is the current implementation, and "estimated" is a little magic using dataset means. Stay tuned.

(where 1D/2D refers to the number of variables represented by the pdplot)

@thommiano

This comment has been minimized.

Copy link

@thommiano thommiano commented Oct 14, 2016

Any status updates on this?

@DonBeo

This comment has been minimized.

Copy link

@DonBeo DonBeo commented Oct 14, 2016

I think partial dependency plot should be a general function available for every class with a predict function.

@brityboy

This comment has been minimized.

Copy link

@brityboy brityboy commented Nov 8, 2016

Yes please, partial dependence plots for random forests would be much appreciated

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Nov 8, 2016

@brityboy your comments may be welcome at #5653. @trevorstephens has put in a great effort there, and it's probably something you can play with now, but it's going to take more work in code, documentation and review.

@darthdeus

This comment has been minimized.

Copy link

@darthdeus darthdeus commented Apr 20, 2018

Is there any update on this? I just ran into the same issue of

gbrt has to be an instance of BaseGradientBoosting

while trying to use xgboost. I might be missing something, but why would it matter what classifier is being used?

@lucyhuo

This comment has been minimized.

Copy link

@lucyhuo lucyhuo commented May 20, 2018

Same issue here.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented May 20, 2018

@lucyhuo @darthdeus the current implementation in sklearn only works for the sklearn estimator. We are working on a general implementation at #5653

@DrEhrfurchtgebietend

This comment has been minimized.

Copy link

@DrEhrfurchtgebietend DrEhrfurchtgebietend commented Jun 11, 2018

Hi @amueller Just want to clarify current state. I am using 0.19.1-2 in python 3.5. I hit the check

if not isinstance(gbrt, BaseGradientBoosting):

on line 123 of sklearn\ensemble\partial_dependence.py when using BaggingClassifier which is definitely a sklearn estimator. So the current implementation seems to only work for BaseGradientBoosting not all sklearn estimators. Does #5653 intend to expand this to all estimators or all sklearn estimators? Is there a release plan?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jun 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.