[MRG+1] correct condition in decision tree construction #7441

Merged
merged 4 commits into from Sep 29, 2016

Conversation

Projects
None yet
6 participants
@nelson-liu
Contributor

nelson-liu commented Sep 15, 2016

Reference Issue

Continuation of #7340

What does this implement/fix? Explain your changes.

Removes an unused leaf condition in _tree.pyx.

Any other comments?

ping @jnothman, who I previously discussed this with


This change is Reviewable

@nelson-liu nelson-liu changed the title from remove unused condition in decision tree construction to [MRG] remove unused condition in decision tree construction Sep 15, 2016

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 15, 2016

Member

While I think these conditions are logically inappropriate, just in case I'm wrong, have you confirmed that they are never executed at least by the test suite?

Member

jnothman commented Sep 15, 2016

While I think these conditions are logically inappropriate, just in case I'm wrong, have you confirmed that they are never executed at least by the test suite?

@nelson-liu

This comment has been minimized.

Show comment
Hide comment
@nelson-liu

nelson-liu Sep 15, 2016

Contributor

just verified; in the original code, never is weighted_n_node_samples < min_weight_leaf True after the is_leaf conditional changed here.

Contributor

nelson-liu commented Sep 15, 2016

just verified; in the original code, never is weighted_n_node_samples < min_weight_leaf True after the is_leaf conditional changed here.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 15, 2016

Member

I mean with something like printing a message if the condition is ever true
here

On 16 September 2016 at 07:12, Nelson Liu notifications@github.com wrote:

just verified; in the original code, never is weighted_n_node_samples <
min_weight_leaf True after the is_leaf conditional changed here.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yo7jjjxDvHfGJuRipCQDXJ3Dqh-ks5qqbTOgaJpZM4J-TZK
.

Member

jnothman commented Sep 15, 2016

I mean with something like printing a message if the condition is ever true
here

On 16 September 2016 at 07:12, Nelson Liu notifications@github.com wrote:

just verified; in the original code, never is weighted_n_node_samples <
min_weight_leaf True after the is_leaf conditional changed here.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yo7jjjxDvHfGJuRipCQDXJ3Dqh-ks5qqbTOgaJpZM4J-TZK
.

@nelson-liu

This comment has been minimized.

Show comment
Hide comment
@nelson-liu

nelson-liu Sep 15, 2016

Contributor

yes, that's what i did (and nothing was printed). sorry for being unclear.

Contributor

nelson-liu commented Sep 15, 2016

yes, that's what i did (and nothing was printed). sorry for being unclear.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 15, 2016

Member

Great, thanks. LGTM

On 16 September 2016 at 08:46, Nelson Liu notifications@github.com wrote:

yes, that's what i did (and nothing was printed). sorry for being unclear.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6zNg3-eqW__PJocp5O2LgZZVKYyhks5qqcqogaJpZM4J-TZK
.

Member

jnothman commented Sep 15, 2016

Great, thanks. LGTM

On 16 September 2016 at 08:46, Nelson Liu notifications@github.com wrote:

yes, that's what i did (and nothing was printed). sorry for being unclear.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6zNg3-eqW__PJocp5O2LgZZVKYyhks5qqcqogaJpZM4J-TZK
.

@jnothman jnothman changed the title from [MRG] remove unused condition in decision tree construction to [MRG+1] remove unused condition in decision tree construction Sep 15, 2016

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 15, 2016

Member

would be great to have @glouppe of @jmschrei look at this.

Member

amueller commented Sep 15, 2016

would be great to have @glouppe of @jmschrei look at this.

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Sep 16, 2016

Member

Hmmm can you explain why we should remove it? What if you make a test with sample_weight such that sample_weight.sum() < min_weight_leaf from the beginning?

Member

glouppe commented Sep 16, 2016

Hmmm can you explain why we should remove it? What if you make a test with sample_weight such that sample_weight.sum() < min_weight_leaf from the beginning?

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Sep 16, 2016

Member

Is it possible the test suite just isn't comprehensive enough to test cases where this might be important? Can you create by hand a dataset where you would expect that the weight would cause a difference in the splits and confirm?

Member

jmschrei commented Sep 16, 2016

Is it possible the test suite just isn't comprehensive enough to test cases where this might be important? Can you create by hand a dataset where you would expect that the weight would cause a difference in the splits and confirm?

@nelson-liu

This comment has been minimized.

Show comment
Hide comment
@nelson-liu

nelson-liu Sep 16, 2016

Contributor

Is it possible the test suite just isn't comprehensive enough to test cases where this might be important?

yes, that's very possible. Are said "cases where this might be important" just when you have a dataset with sample_weight such that sample_weight.sum() < min_weight_leaf? or is there something else that i am missing?

Contributor

nelson-liu commented Sep 16, 2016

Is it possible the test suite just isn't comprehensive enough to test cases where this might be important?

yes, that's very possible. Are said "cases where this might be important" just when you have a dataset with sample_weight such that sample_weight.sum() < min_weight_leaf? or is there something else that i am missing?

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Sep 16, 2016

Member

Yeah, but sample_weight for the samples in the current node, not the entire dataset. I haven't read through the test cases recently, but do we have a test where we should make a split on weighted data (not unweighted data), but don't because of min_weight_leaf?

Member

jmschrei commented Sep 16, 2016

Yeah, but sample_weight for the samples in the current node, not the entire dataset. I haven't read through the test cases recently, but do we have a test where we should make a split on weighted data (not unweighted data), but don't because of min_weight_leaf?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 17, 2016

Member

What does it mean logically to say that something should be a leaf if its weight is so small that it violates the min_weight_leaf condition?

Member

jnothman commented Sep 17, 2016

What does it mean logically to say that something should be a leaf if its weight is so small that it violates the min_weight_leaf condition?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 21, 2016

Member

@glouppe:

Hmmm can you explain why we should remove it? What if you make a test with sample_weight such that sample_weight.sum() < min_weight_leaf from the beginning?

This shouldn't be possible through the public interface, which sets min_weight_fraction_leaf, not min_weight_leaf, and requires the former to be at most .5 * sample_weight.sum().

(If this isn't a valid change to make, then surely changing < to <= is required in these conditions.)

Member

jnothman commented Sep 21, 2016

@glouppe:

Hmmm can you explain why we should remove it? What if you make a test with sample_weight such that sample_weight.sum() < min_weight_leaf from the beginning?

This shouldn't be possible through the public interface, which sets min_weight_fraction_leaf, not min_weight_leaf, and requires the former to be at most .5 * sample_weight.sum().

(If this isn't a valid change to make, then surely changing < to <= is required in these conditions.)

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Sep 21, 2016

Member

IIRC Nothing would break even if you remove the whole set of conditions as the splitter will check it again.

The reason why this check is given here is to check and avoid entering the splitter at all if we know it is a leaf for sure. And I think the condition should instead be weighted_n_node_samples < 2 * min_weight_leaf...

just verified; in the original code, never is weighted_n_node_samples < min_weight_leaf True after the is_leaf conditional changed here.

Where does "here" refer to? How did you verify?

The flow, IIUC, is as follows

Assume min_weight_leaf=0.1 and min_samples_leaf=1
Assume a particular node's sample weights to be [0.05, 0.05, 0.05, 0.05, 0.05]
The weighted_n_node_samples for that node is now 0.25

  • Let that node be the one that is currently popped from stack.
  • Since one can split the data into 2 parts having weights at right [0.05, 0.05, 0.05] and left [0.05, 0.05] , it is allowed to enter the splitter.
  • Assume the splitter splits it that way. Splitter tells us it is not a leaf and gives us the 2 partitions.
  • The right/left partitions are stacked.
  • The left is popped. Now since the cumulative weights for the left partition is 0.1, there is no way we can split the data without violating the min_weight_leaf=0.1.
  • Hence the condition that you just removed (if corrected as mentioned above) marks it as leaf before it enters the splitter and saves us some computational time.
Member

raghavrv commented Sep 21, 2016

IIRC Nothing would break even if you remove the whole set of conditions as the splitter will check it again.

The reason why this check is given here is to check and avoid entering the splitter at all if we know it is a leaf for sure. And I think the condition should instead be weighted_n_node_samples < 2 * min_weight_leaf...

just verified; in the original code, never is weighted_n_node_samples < min_weight_leaf True after the is_leaf conditional changed here.

Where does "here" refer to? How did you verify?

The flow, IIUC, is as follows

Assume min_weight_leaf=0.1 and min_samples_leaf=1
Assume a particular node's sample weights to be [0.05, 0.05, 0.05, 0.05, 0.05]
The weighted_n_node_samples for that node is now 0.25

  • Let that node be the one that is currently popped from stack.
  • Since one can split the data into 2 parts having weights at right [0.05, 0.05, 0.05] and left [0.05, 0.05] , it is allowed to enter the splitter.
  • Assume the splitter splits it that way. Splitter tells us it is not a leaf and gives us the 2 partitions.
  • The right/left partitions are stacked.
  • The left is popped. Now since the cumulative weights for the left partition is 0.1, there is no way we can split the data without violating the min_weight_leaf=0.1.
  • Hence the condition that you just removed (if corrected as mentioned above) marks it as leaf before it enters the splitter and saves us some computational time.
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 21, 2016

Member

And I think the condition should instead be weighted_n_node_samples < 2 * min_weight_leaf

I had originally thought that condition inappropriate (@nelson-liu had suggested it), but now I think you might be right. If that condition is true, there is no way to split the points at that node such that both splits satisfy min_weight_leaf, is there?

Yes, I now think this is appropriate as a fast path. However, this will change the fitted trees, so we should note that in what's new.

Thanks.

Member

jnothman commented Sep 21, 2016

And I think the condition should instead be weighted_n_node_samples < 2 * min_weight_leaf

I had originally thought that condition inappropriate (@nelson-liu had suggested it), but now I think you might be right. If that condition is true, there is no way to split the points at that node such that both splits satisfy min_weight_leaf, is there?

Yes, I now think this is appropriate as a fast path. However, this will change the fitted trees, so we should note that in what's new.

Thanks.

@jnothman jnothman changed the title from [MRG+1] remove unused condition in decision tree construction to [MRG] remove unused condition in decision tree construction Sep 22, 2016

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Sep 22, 2016

Member

No it won't change the fitted trees, as before the splitter used to mark it as leaf. Now after the corrected condition, it will not enter the splitter at all. And yes with the corrected condition in place there is no way to split it such that min_weight_leaf is not violated...

Member

raghavrv commented Sep 22, 2016

No it won't change the fitted trees, as before the splitter used to mark it as leaf. Now after the corrected condition, it will not enter the splitter at all. And yes with the corrected condition in place there is no way to split it such that min_weight_leaf is not violated...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 22, 2016

Member

Yes, that will change the trees due to random_state.

On 22 September 2016 at 10:07, Raghav RV notifications@github.com wrote:

No it won't change the fitted trees, as before the splitter used to mark
it as leaf. Now after the corrected condition, it will not enter the
splitter at all. And yes with the corrected condition in place there is no
way to split it such that min_weight_leaf is not violated...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz60j4jCAoADkX6G-eomh3qWVusDwuks5qscavgaJpZM4J-TZK
.

Member

jnothman commented Sep 22, 2016

Yes, that will change the trees due to random_state.

On 22 September 2016 at 10:07, Raghav RV notifications@github.com wrote:

No it won't change the fitted trees, as before the splitter used to mark
it as leaf. Now after the corrected condition, it will not enter the
splitter at all. And yes with the corrected condition in place there is no
way to split it such that min_weight_leaf is not violated...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz60j4jCAoADkX6G-eomh3qWVusDwuks5qscavgaJpZM4J-TZK
.

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Sep 22, 2016

Member

Wow. Good catch! I totally missed that. Indeed the tree will change!

Member

raghavrv commented Sep 22, 2016

Wow. Good catch! I totally missed that. Indeed the tree will change!

@nelson-liu

This comment has been minimized.

Show comment
Hide comment
@nelson-liu

nelson-liu Sep 28, 2016

Contributor

following up with this; should we change the condition to weighted_n_node_samples < 2 * min_weight_leaf as suggested by @raghavrv ?

Contributor

nelson-liu commented Sep 28, 2016

following up with this; should we change the condition to weighted_n_node_samples < 2 * min_weight_leaf as suggested by @raghavrv ?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 28, 2016

Member

I think so. (Sorry for changing my mind!)

On 29 September 2016 at 00:14, Nelson Liu notifications@github.com wrote:

following up with this; should we change the condition to weighted_n_node_samples
< 2 * min_weight_leaf as suggested by @raghavrv
https://github.com/raghavrv ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6y804A24W9B4d6ZJfPk_r9SRKhNcks5qunZbgaJpZM4J-TZK
.

Member

jnothman commented Sep 28, 2016

I think so. (Sorry for changing my mind!)

On 29 September 2016 at 00:14, Nelson Liu notifications@github.com wrote:

following up with this; should we change the condition to weighted_n_node_samples
< 2 * min_weight_leaf as suggested by @raghavrv
https://github.com/raghavrv ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6y804A24W9B4d6ZJfPk_r9SRKhNcks5qunZbgaJpZM4J-TZK
.

@nelson-liu

This comment has been minimized.

Show comment
Hide comment
@nelson-liu

nelson-liu Sep 28, 2016

Contributor

no need to be sorry, I'm glad we were able to come to a better solution. pushed the new change.

Contributor

nelson-liu commented Sep 28, 2016

no need to be sorry, I'm glad we were able to come to a better solution. pushed the new change.

sklearn/tree/_tree.pyx
@@ -218,8 +218,8 @@ cdef class DepthFirstTreeBuilder(TreeBuilder):
is_leaf = ((depth >= max_depth) or

This comment has been minimized.

@jnothman

jnothman Sep 28, 2016

Member

Do you mind removing all the extraneous parentheses too? :/

@jnothman

jnothman Sep 28, 2016

Member

Do you mind removing all the extraneous parentheses too? :/

This comment has been minimized.

@nelson-liu

nelson-liu Sep 28, 2016

Contributor

yeah, sorry about that.

@nelson-liu

nelson-liu Sep 28, 2016

Contributor

yeah, sorry about that.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 28, 2016

Member

please change PR title

Member

jnothman commented Sep 28, 2016

please change PR title

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 28, 2016

Member

actually, I will, seeing as I'm adding a +1. LGTM

Member

jnothman commented Sep 28, 2016

actually, I will, seeing as I'm adding a +1. LGTM

@jnothman jnothman changed the title from [MRG] remove unused condition in decision tree construction to [MRG+1] remove unused condition in decision tree construction Sep 28, 2016

@jnothman jnothman changed the title from [MRG+1] remove unused condition in decision tree construction to [MRG+1] correct condition in decision tree construction Sep 28, 2016

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 28, 2016

Member

What I'd like to see is a what's new entries that says it's more efficient but trees using the parameter will change.

Member

jnothman commented Sep 28, 2016

What I'd like to see is a what's new entries that says it's more efficient but trees using the parameter will change.

nelson-liu added some commits Sep 15, 2016

@nelson-liu

This comment has been minimized.

Show comment
Hide comment
@nelson-liu

nelson-liu Sep 28, 2016

Contributor

What I'd like to see is a what's new entries that says it's more efficient but trees using the parameter will change.

@jnothman thanks for changing the title for me. I added a what's new entry and removed the extra parens.

Contributor

nelson-liu commented Sep 28, 2016

What I'd like to see is a what's new entries that says it's more efficient but trees using the parameter will change.

@jnothman thanks for changing the title for me. I added a what's new entry and removed the extra parens.

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Sep 29, 2016

Member

Thanks for the change. This looks good to me. I think it would be better if @glouppe gave his final +1 and merge :)

Member

raghavrv commented Sep 29, 2016

Thanks for the change. This looks good to me. I think it would be better if @glouppe gave his final +1 and merge :)

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Sep 29, 2016

Member

LGTM too! Merging

Member

glouppe commented Sep 29, 2016

LGTM too! Merging

@glouppe glouppe merged commit 2f4b661 into scikit-learn:master Sep 29, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
+ - Edited criterion for leaf nodes in decision tree criterion by declaring a
+ node as a leaf if the weighted number of samples at the node is less than
+ 2 * the minimum weight specified to be at a node. This makes growth more
+ efficient, but trees using parameters that modify the weight at each leaf

This comment has been minimized.

@jnothman

jnothman Sep 30, 2016

Member

I think this is a confusing way of stating it. Will fix it up in master.

@jnothman

jnothman Sep 30, 2016

Member

I think this is a confusing way of stating it. Will fix it up in master.

This comment has been minimized.

@nelson-liu

nelson-liu Sep 30, 2016

Contributor

thanks @jnothman

This comment has been minimized.

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+1] correct condition in decision tree construction (#7441)
* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

[MRG+1] correct condition in decision tree construction (#7441)
* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] correct condition in decision tree construction (#7441)
* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] correct condition in decision tree construction (#7441)
* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment