Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] MNT remove duplicated call to children_impurity() #18203

Merged
merged 5 commits into from
Aug 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 3 additions & 1 deletion sklearn/tree/_criterion.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,9 @@ cdef class Criterion:
cdef void children_impurity(self, double* impurity_left,
double* impurity_right) nogil
cdef void node_value(self, double* dest) nogil
cdef double impurity_improvement(self, double impurity) nogil
cdef double impurity_improvement(self, double impurity_parent,
double impurity_left,
double impurity_right) nogil
cdef double proxy_impurity_improvement(self) nogil

cdef class ClassificationCriterion(Criterion):
Expand Down
31 changes: 18 additions & 13 deletions sklearn/tree/_criterion.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,9 @@ cdef class Criterion:
return (- self.weighted_n_right * impurity_right
- self.weighted_n_left * impurity_left)

cdef double impurity_improvement(self, double impurity) nogil:
cdef double impurity_improvement(self, double impurity_parent,
double impurity_left,
double impurity_right) nogil:
"""Compute the improvement in impurity

This method computes the improvement in impurity when a split occurs.
Expand All @@ -186,24 +188,25 @@ cdef class Criterion:

Parameters
----------
impurity : double
The initial impurity of the node before the split
impurity_parent : double
The initial impurity of the parent node before the split

impurity_left : double
The impurity of the left child

impurity_right : double
The impurity of the right child

Return
------
double : improvement in impurity after the split occurs
"""

cdef double impurity_left
cdef double impurity_right

self.children_impurity(&impurity_left, &impurity_right)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the call that was removed.


return ((self.weighted_n_node_samples / self.weighted_n_samples) *
(impurity - (self.weighted_n_right /
self.weighted_n_node_samples * impurity_right)
- (self.weighted_n_left /
self.weighted_n_node_samples * impurity_left)))
(impurity_parent - (self.weighted_n_right /
self.weighted_n_node_samples * impurity_right)
- (self.weighted_n_left /
self.weighted_n_node_samples * impurity_left)))


cdef class ClassificationCriterion(Criterion):
Expand Down Expand Up @@ -1306,7 +1309,9 @@ cdef class FriedmanMSE(MSE):

return diff * diff / (self.weighted_n_left * self.weighted_n_right)

cdef double impurity_improvement(self, double impurity) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this change here?
Are we calling only the proxy_impurity_improvement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm it seems that with this change we will use the MSE impurity improvement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. I didn't realize Friedman MSE used a non-conventional improvement. Which makes me going through a rabbit hole, wondering whether this makes sense at all, but that's irrelevant for this PR. I put it back

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went down the same rabbit hole. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you get out? I'm still super confused about so many things. Like does friedman_mse make sense outside of GBDTs, and do we really want to allow a MAE splitting criteria when we already have the LAD loss... so many questions lol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not planning to get out anytime soon. There isn't many references of this criterion outside of https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... Here are my notes so far. I'll open an issue when I have a better idea of this all, but I'm happy to sync with you prior!

Does it really make sense to allow a criterion to be passed to GBDT? All trees
should be built via LS anyway. Friedman does mention e.g. for LAD that trees
could be built with LAD criterion but LS is just much faster.

WTF is friedman_mse?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥕 ?
@NicolasHug I was thinking/doubting about friedman_mse myself and would appreciate an issue if you come up with one.

cdef double impurity_improvement(self, double impurity_parent, double
impurity_left, double impurity_right) nogil:
# Note: none of the arguments are used here
cdef double* sum_left = self.sum_left
cdef double* sum_right = self.sum_right

Expand Down
16 changes: 10 additions & 6 deletions sklearn/tree/_splitter.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -434,9 +434,10 @@ cdef class BestSplitter(BaseDenseSplitter):

self.criterion.reset()
self.criterion.update(best.pos)
best.improvement = self.criterion.impurity_improvement(impurity)
self.criterion.children_impurity(&best.impurity_left,
&best.impurity_right)
best.improvement = self.criterion.impurity_improvement(
impurity, best.impurity_left, best.impurity_right)
Comment on lines 437 to +440
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, moving the call to children_impurity before impurity_improvement allow to pass the impurity of the children. Therefore, we avoid these duplicate calculations:

cdef double impurity_left
cdef double impurity_right
self.children_impurity(&impurity_left, &impurity_right)

Am I right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so!


# Respect invariant for constant features: the original order of
# element in features[:n_known_constants] must be preserved for sibling
Expand Down Expand Up @@ -745,9 +746,10 @@ cdef class RandomSplitter(BaseDenseSplitter):

self.criterion.reset()
self.criterion.update(best.pos)
best.improvement = self.criterion.impurity_improvement(impurity)
self.criterion.children_impurity(&best.impurity_left,
&best.impurity_right)
best.improvement = self.criterion.impurity_improvement(
impurity, best.impurity_left, best.impurity_right)

# Respect invariant for constant features: the original order of
# element in features[:n_known_constants] must be preserved for sibling
Expand Down Expand Up @@ -1293,9 +1295,10 @@ cdef class BestSparseSplitter(BaseSparseSplitter):

self.criterion.reset()
self.criterion.update(best.pos)
best.improvement = self.criterion.impurity_improvement(impurity)
self.criterion.children_impurity(&best.impurity_left,
&best.impurity_right)
best.improvement = self.criterion.impurity_improvement(
impurity, best.impurity_left, best.impurity_right)

# Respect invariant for constant features: the original order of
# element in features[:n_known_constants] must be preserved for sibling
Expand Down Expand Up @@ -1504,10 +1507,10 @@ cdef class RandomSparseSplitter(BaseSparseSplitter):

if current_proxy_improvement > best_proxy_improvement:
best_proxy_improvement = current_proxy_improvement
current.improvement = self.criterion.impurity_improvement(impurity)

self.criterion.children_impurity(&current.impurity_left,
&current.impurity_right)
current.improvement = self.criterion.impurity_improvement(
impurity, current.impurity_left, current.impurity_right)
best = current

# Reorganize into samples[start:best.pos] + samples[best.pos:end]
Expand All @@ -1521,9 +1524,10 @@ cdef class RandomSparseSplitter(BaseSparseSplitter):

self.criterion.reset()
self.criterion.update(best.pos)
best.improvement = self.criterion.impurity_improvement(impurity)
self.criterion.children_impurity(&best.impurity_left,
&best.impurity_right)
best.improvement = self.criterion.impurity_improvement(
impurity, best.impurity_left, best.impurity_right)

# Respect invariant for constant features: the original order of
# element in features[:n_known_constants] must be preserved for sibling
Expand Down