Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] MNT remove duplicated call to children_impurity() #18203

Merged
merged 5 commits into from
Aug 21, 2020

Conversation

NicolasHug
Copy link
Member

In the tree Splitter we call impurity_improvement() and then children_impurity().

But impurity_improvement() itself will call children_impurity() internally so this PR removes the duplicated work.

CC @lorentzenchr @thomasjpfan

cdef double impurity_left
cdef double impurity_right

self.children_impurity(&impurity_left, &impurity_right)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the call that was removed.

Copy link
Member

@alfaro96 alfaro96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @NicolasHug!

I think that with this PR, we can avoid duplicate calculations.

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
Comment on lines 437 to +440
self.criterion.children_impurity(&best.impurity_left,
&best.impurity_right)
best.improvement = self.criterion.impurity_improvement(
impurity, best.impurity_left, best.impurity_right)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, moving the call to children_impurity before impurity_improvement allow to pass the impurity of the children. Therefore, we avoid these duplicate calculations:

cdef double impurity_left
cdef double impurity_right
self.children_impurity(&impurity_left, &impurity_right)

Am I right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so!

Co-authored-by: Juan Carlos Alfaro Jiménez <JuanCarlos.Alfaro@uclm.es>
@@ -1305,23 +1308,3 @@ cdef class FriedmanMSE(MSE):
self.weighted_n_left * total_sum_right)

return diff * diff / (self.weighted_n_left * self.weighted_n_right)

cdef double impurity_improvement(self, double impurity) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this change here?
Are we calling only the proxy_impurity_improvement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm it seems that with this change we will use the MSE impurity improvement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. I didn't realize Friedman MSE used a non-conventional improvement. Which makes me going through a rabbit hole, wondering whether this makes sense at all, but that's irrelevant for this PR. I put it back

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went down the same rabbit hole. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you get out? I'm still super confused about so many things. Like does friedman_mse make sense outside of GBDTs, and do we really want to allow a MAE splitting criteria when we already have the LAD loss... so many questions lol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not planning to get out anytime soon. There isn't many references of this criterion outside of https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... Here are my notes so far. I'll open an issue when I have a better idea of this all, but I'm happy to sync with you prior!

Does it really make sense to allow a criterion to be passed to GBDT? All trees
should be built via LS anyway. Friedman does mention e.g. for LAD that trees
could be built with LAD criterion but LS is just much faster.

WTF is friedman_mse?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥕 ?
@NicolasHug I was thinking/doubting about friedman_mse myself and would appreciate an issue if you come up with one.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1305,23 +1308,3 @@ cdef class FriedmanMSE(MSE):
self.weighted_n_left * total_sum_right)

return diff * diff / (self.weighted_n_left * self.weighted_n_right)

cdef double impurity_improvement(self, double impurity) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went down the same rabbit hole. :)

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @NicolasHug Nice catch!

@lorentzenchr
Copy link
Member

@NicolasHug Would you like a what's new entry? It's supposedly a slight performance improvement.

@glemaitre
Copy link
Member

@NicolasHug Would you like a what's new entry? It's supposedly a slight performance improvement.

I think we are fine merging as it is.

@NicolasHug Feel free to open an issue on the Friedman MSE. They are some relics in the code that only some old wizard knows about :) (even git blame does not help there)

@glemaitre glemaitre merged commit 22f232e into scikit-learn:master Aug 21, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…t-learn#18203)

Co-authored-by: Juan Carlos Alfaro Jiménez <JuanCarlos.Alfaro@uclm.es>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants