-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor plot_dependence and implement fast version of pdp #85
Conversation
I just added some comments regarding the |
Thanks!!! I will try to fix the mypy errors this week. If you think you can do it go ahead. Or if you have more tips to share they are going to be more than welcome. |
Ok! Here is an initial feedback from a user perspective. I tried the changes in https://www.pymc.io/projects/examples/en/latest/case_studies/BART_introduction.html For the The ICE plots also work but take much longer (3 min in my Mac intel), but maybe as expected ... For the I tried adding more threes to see if it would capture more complexity, but it did not work, although the results look a bit better as the ranges are more reasonable. So I think the plots are working as expected. The linear response issue might be related to the model predictions in the linear response case, right? |
Right, ICE are expected to be much slower. They requiere a lot of predictions. I think the only way to accelerate it is to improve the inner working of the tree methods. And probably that requiere rewriting them in cython, RUST or something similar. There is something funny going on with the linear response predictions. Still not sure what exactly. If we take the mean of the values stored at the leaf nodes, the plots looks reasonable. So yes, there is something with the predictions that we need to fix |
p_d += weight * node.value | ||
else: | ||
# this produce nonsensical results | ||
p_d += weight * ((params[0] + params[1] * x[node.idx_split_variable]) / m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One quick (naive) question...when fitting the linear model, we use as covariates X[idx_data_point, selected_predictor]
(see https://github.com/pymc-devs/pymc-bart/blob/main/pymc_bart/pgbart.py#L433) ... how does this map to x[node.idx_split_variable]
aren't we missing the selected predictor specification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what we call selected_predictor
here X[idx_data_point, selected_predictor]
is equivalent to what we call idx_split_variable]
here x[node.idx_split_variable]
, in the first case X
is a matrix and has many idx_data_points
in the second case x
is a vector or a single data_point.
By doing node.value.mean()
we are using the mean of the stored/fitted values when doing p_d += weight * ((params[0] + params[1] * x[node.idx_split_variable]) / m)
we are generating new predictions. This seems to indicate the fitting is ok, but the predictions are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Thanks!
@aloctavodia I was just looking into the ICE plots above... thy seem shifted on the y-axis as they have the same shape but can be negative ... I do not think this is expected right? Their mean should be the pdp right? |
The scale will be off unless you specify the argument |
This implements a faster version of the partial dependence plots (~15x faster). I also took the opportunity to clean the code in utils,
plot_dependence
is deprecated in favor of two separated functionsplot_pdp
andplot_ice
.