Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Does scikit-learn have any capacity for partial dependence plots and associated data arrays for random forest analyses?
Doing the same for RF outputs:
File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/partial_dependence.py", line 239, in plot_partial_dependence raise ValueError('gbrt has to be an instance of BaseGradientBoosting') ValueError: gbrt has to be an instance of BaseGradientBoosting
The partial dependence can be computed efficiently for GBMs, but estimating the partial dependence for a generic model (by actually simulating predictions) would be slow without some sampling...is the suggestion here to use some efficient method to calculate the PDP efficiently on random forests and decision trees, or to implement a "naive" method to estimate the PDP for any model?
The latter is slow but has the added advantage of being able to accommodate a pipeline, too.
@pauljacksonrodgers Yes, the partial plots on a generic model can take some time on 2D plots, but I should have a WIP PR up tomorrow that helps with an "estimated" option that runs rather quickly on any model. Almost there! A bit more involved than I originally expected, but one week late in open source is essentially really, really; really, ridiculously early :-)
current %timeit implementation on 1D for a GBM on the
and on 2D:
and for 2D:
where "exact" is the
(where 1D/2D refers to the number of variables represented by the pdplot)
Hi @amueller Just want to clarify current state. I am using 0.19.1-2 in python 3.5. I hit the check
on line 123 of sklearn\ensemble\partial_dependence.py when using BaggingClassifier which is definitely a sklearn estimator. So the current implementation seems to only work for BaseGradientBoosting not all sklearn estimators. Does #5653 intend to expand this to all estimators or all sklearn estimators? Is there a release plan?