-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] MNT remove duplicated call to children_impurity() #18203
Conversation
cdef double impurity_left | ||
cdef double impurity_right | ||
|
||
self.children_impurity(&impurity_left, &impurity_right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the call that was removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @NicolasHug!
I think that with this PR, we can avoid duplicate calculations.
self.criterion.children_impurity(&best.impurity_left, | ||
&best.impurity_right) | ||
best.improvement = self.criterion.impurity_improvement( | ||
impurity, best.impurity_left, best.impurity_right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, moving the call to children_impurity
before impurity_improvement
allow to pass the impurity of the children. Therefore, we avoid these duplicate calculations:
scikit-learn/sklearn/tree/_criterion.pyx
Lines 197 to 200 in 395d6c1
cdef double impurity_left | |
cdef double impurity_right | |
self.children_impurity(&impurity_left, &impurity_right) |
Am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so!
Co-authored-by: Juan Carlos Alfaro Jiménez <JuanCarlos.Alfaro@uclm.es>
@@ -1305,23 +1308,3 @@ cdef class FriedmanMSE(MSE): | |||
self.weighted_n_left * total_sum_right) | |||
|
|||
return diff * diff / (self.weighted_n_left * self.weighted_n_right) | |||
|
|||
cdef double impurity_improvement(self, double impurity) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this change here?
Are we calling only the proxy_impurity_improvement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm it seems that with this change we will use the MSE impurity improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks. I didn't realize Friedman MSE used a non-conventional improvement. Which makes me going through a rabbit hole, wondering whether this makes sense at all, but that's irrelevant for this PR. I put it back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went down the same rabbit hole. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did you get out? I'm still super confused about so many things. Like does friedman_mse make sense outside of GBDTs, and do we really want to allow a MAE splitting criteria when we already have the LAD loss... so many questions lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not planning to get out anytime soon. There isn't many references of this criterion outside of https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... Here are my notes so far. I'll open an issue when I have a better idea of this all, but I'm happy to sync with you prior!
Does it really make sense to allow a criterion to be passed to GBDT? All trees
should be built via LS anyway. Friedman does mention e.g. for LAD that trees
could be built with LAD criterion but LS is just much faster.
WTF is friedman_mse?
- was introduced here [MRG] Gradient Boosting enhancements #2570
- it used to be the hardcoded and non-overidable default for all GBDT
- Then [MRG+3] Add mean absolute error splitting criterion to DecisionTreeRegressor #6667 introduced LAD
criterion for all trees, and started exposing a criterion param to all
models. - Does it really make sense to allow it for all tree models??
- It was introduced in the context of multiclass GB.
- Also WTF are the weights? Our implementation completely differs from what the
paper defines. the weights in the paper aren't sample weights, they're the hessians
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥕 ?
@NicolasHug I was thinking/doubting about friedman_mse myself and would appreciate an issue if you come up with one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -1305,23 +1308,3 @@ cdef class FriedmanMSE(MSE): | |||
self.weighted_n_left * total_sum_right) | |||
|
|||
return diff * diff / (self.weighted_n_left * self.weighted_n_right) | |||
|
|||
cdef double impurity_improvement(self, double impurity) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went down the same rabbit hole. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @NicolasHug Nice catch!
@NicolasHug Would you like a what's new entry? It's supposedly a slight performance improvement. |
I think we are fine merging as it is. @NicolasHug Feel free to open an issue on the Friedman MSE. They are some relics in the code that only some old wizard knows about :) (even git blame does not help there) |
…t-learn#18203) Co-authored-by: Juan Carlos Alfaro Jiménez <JuanCarlos.Alfaro@uclm.es>
In the tree Splitter we call
impurity_improvement()
and thenchildren_impurity()
.But
impurity_improvement()
itself will callchildren_impurity()
internally so this PR removes the duplicated work.CC @lorentzenchr @thomasjpfan