Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] correct condition in decision tree construction #7441

Merged
merged 4 commits into from Sep 29, 2016

Conversation

@nelson-liu
Copy link
Contributor

@nelson-liu nelson-liu commented Sep 15, 2016

Reference Issue

Continuation of #7340

What does this implement/fix? Explain your changes.

Removes an unused leaf condition in _tree.pyx.

Any other comments?

ping @jnothman, who I previously discussed this with


This change is Reviewable

@nelson-liu nelson-liu changed the title remove unused condition in decision tree construction [MRG] remove unused condition in decision tree construction Sep 15, 2016
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 15, 2016

While I think these conditions are logically inappropriate, just in case I'm wrong, have you confirmed that they are never executed at least by the test suite?

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 15, 2016

just verified; in the original code, never is weighted_n_node_samples < min_weight_leaf True after the is_leaf conditional changed here.

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 15, 2016

I mean with something like printing a message if the condition is ever true
here

On 16 September 2016 at 07:12, Nelson Liu notifications@github.com wrote:

just verified; in the original code, never is weighted_n_node_samples <
min_weight_leaf True after the is_leaf conditional changed here.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yo7jjjxDvHfGJuRipCQDXJ3Dqh-ks5qqbTOgaJpZM4J-TZK
.

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 15, 2016

yes, that's what i did (and nothing was printed). sorry for being unclear.

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 15, 2016

Great, thanks. LGTM

On 16 September 2016 at 08:46, Nelson Liu notifications@github.com wrote:

yes, that's what i did (and nothing was printed). sorry for being unclear.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6zNg3-eqW__PJocp5O2LgZZVKYyhks5qqcqogaJpZM4J-TZK
.

@jnothman jnothman changed the title [MRG] remove unused condition in decision tree construction [MRG+1] remove unused condition in decision tree construction Sep 15, 2016
@amueller
Copy link
Member

@amueller amueller commented Sep 15, 2016

would be great to have @glouppe of @jmschrei look at this.

@glouppe
Copy link
Member

@glouppe glouppe commented Sep 16, 2016

Hmmm can you explain why we should remove it? What if you make a test with sample_weight such that sample_weight.sum() < min_weight_leaf from the beginning?

@jmschrei
Copy link
Member

@jmschrei jmschrei commented Sep 16, 2016

Is it possible the test suite just isn't comprehensive enough to test cases where this might be important? Can you create by hand a dataset where you would expect that the weight would cause a difference in the splits and confirm?

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 16, 2016

Is it possible the test suite just isn't comprehensive enough to test cases where this might be important?

yes, that's very possible. Are said "cases where this might be important" just when you have a dataset with sample_weight such that sample_weight.sum() < min_weight_leaf? or is there something else that i am missing?

@jmschrei
Copy link
Member

@jmschrei jmschrei commented Sep 16, 2016

Yeah, but sample_weight for the samples in the current node, not the entire dataset. I haven't read through the test cases recently, but do we have a test where we should make a split on weighted data (not unweighted data), but don't because of min_weight_leaf?

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 17, 2016

What does it mean logically to say that something should be a leaf if its weight is so small that it violates the min_weight_leaf condition?

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 21, 2016

@glouppe:

Hmmm can you explain why we should remove it? What if you make a test with sample_weight such that sample_weight.sum() < min_weight_leaf from the beginning?

This shouldn't be possible through the public interface, which sets min_weight_fraction_leaf, not min_weight_leaf, and requires the former to be at most .5 * sample_weight.sum().

(If this isn't a valid change to make, then surely changing < to <= is required in these conditions.)

@raghavrv
Copy link
Member

@raghavrv raghavrv commented Sep 21, 2016

IIRC Nothing would break even if you remove the whole set of conditions as the splitter will check it again.

The reason why this check is given here is to check and avoid entering the splitter at all if we know it is a leaf for sure. And I think the condition should instead be weighted_n_node_samples < 2 * min_weight_leaf...

just verified; in the original code, never is weighted_n_node_samples < min_weight_leaf True after the is_leaf conditional changed here.

Where does "here" refer to? How did you verify?

The flow, IIUC, is as follows

Assume min_weight_leaf=0.1 and min_samples_leaf=1
Assume a particular node's sample weights to be [0.05, 0.05, 0.05, 0.05, 0.05]
The weighted_n_node_samples for that node is now 0.25

  • Let that node be the one that is currently popped from stack.
  • Since one can split the data into 2 parts having weights at right [0.05, 0.05, 0.05] and left [0.05, 0.05] , it is allowed to enter the splitter.
  • Assume the splitter splits it that way. Splitter tells us it is not a leaf and gives us the 2 partitions.
  • The right/left partitions are stacked.
  • The left is popped. Now since the cumulative weights for the left partition is 0.1, there is no way we can split the data without violating the min_weight_leaf=0.1.
  • Hence the condition that you just removed (if corrected as mentioned above) marks it as leaf before it enters the splitter and saves us some computational time.
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 21, 2016

And I think the condition should instead be weighted_n_node_samples < 2 * min_weight_leaf

I had originally thought that condition inappropriate (@nelson-liu had suggested it), but now I think you might be right. If that condition is true, there is no way to split the points at that node such that both splits satisfy min_weight_leaf, is there?

Yes, I now think this is appropriate as a fast path. However, this will change the fitted trees, so we should note that in what's new.

Thanks.

@jnothman jnothman changed the title [MRG+1] remove unused condition in decision tree construction [MRG] remove unused condition in decision tree construction Sep 22, 2016
@raghavrv
Copy link
Member

@raghavrv raghavrv commented Sep 22, 2016

No it won't change the fitted trees, as before the splitter used to mark it as leaf. Now after the corrected condition, it will not enter the splitter at all. And yes with the corrected condition in place there is no way to split it such that min_weight_leaf is not violated...

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 22, 2016

Yes, that will change the trees due to random_state.

On 22 September 2016 at 10:07, Raghav RV notifications@github.com wrote:

No it won't change the fitted trees, as before the splitter used to mark
it as leaf. Now after the corrected condition, it will not enter the
splitter at all. And yes with the corrected condition in place there is no
way to split it such that min_weight_leaf is not violated...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz60j4jCAoADkX6G-eomh3qWVusDwuks5qscavgaJpZM4J-TZK
.

@raghavrv
Copy link
Member

@raghavrv raghavrv commented Sep 22, 2016

Wow. Good catch! I totally missed that. Indeed the tree will change!

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 28, 2016

following up with this; should we change the condition to weighted_n_node_samples < 2 * min_weight_leaf as suggested by @raghavrv ?

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 28, 2016

I think so. (Sorry for changing my mind!)

On 29 September 2016 at 00:14, Nelson Liu notifications@github.com wrote:

following up with this; should we change the condition to weighted_n_node_samples
< 2 * min_weight_leaf as suggested by @raghavrv
https://github.com/raghavrv ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7441 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6y804A24W9B4d6ZJfPk_r9SRKhNcks5qunZbgaJpZM4J-TZK
.

@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 28, 2016

no need to be sorry, I'm glad we were able to come to a better solution. pushed the new change.

@@ -218,8 +218,8 @@ cdef class DepthFirstTreeBuilder(TreeBuilder):

is_leaf = ((depth >= max_depth) or

This comment has been minimized.

@jnothman

jnothman Sep 28, 2016
Member

Do you mind removing all the extraneous parentheses too? :/

This comment has been minimized.

@nelson-liu

nelson-liu Sep 28, 2016
Author Contributor

yeah, sorry about that.

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 28, 2016

please change PR title

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 28, 2016

actually, I will, seeing as I'm adding a +1. LGTM

@jnothman jnothman changed the title [MRG] remove unused condition in decision tree construction [MRG+1] remove unused condition in decision tree construction Sep 28, 2016
@jnothman jnothman changed the title [MRG+1] remove unused condition in decision tree construction [MRG+1] correct condition in decision tree construction Sep 28, 2016
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 28, 2016

What I'd like to see is a what's new entries that says it's more efficient but trees using the parameter will change.

nelson-liu added 4 commits Sep 15, 2016
@nelson-liu nelson-liu force-pushed the nelson-liu:remove_wrong_tree_condition branch from 640d413 to 35ed851 Sep 28, 2016
@nelson-liu
Copy link
Contributor Author

@nelson-liu nelson-liu commented Sep 28, 2016

What I'd like to see is a what's new entries that says it's more efficient but trees using the parameter will change.

@jnothman thanks for changing the title for me. I added a what's new entry and removed the extra parens.

@raghavrv
Copy link
Member

@raghavrv raghavrv commented Sep 29, 2016

Thanks for the change. This looks good to me. I think it would be better if @glouppe gave his final +1 and merge :)

@glouppe
Copy link
Member

@glouppe glouppe commented Sep 29, 2016

LGTM too! Merging

@glouppe glouppe merged commit 2f4b661 into scikit-learn:master Sep 29, 2016
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
- Edited criterion for leaf nodes in decision tree criterion by declaring a
node as a leaf if the weighted number of samples at the node is less than
2 * the minimum weight specified to be at a node. This makes growth more
efficient, but trees using parameters that modify the weight at each leaf

This comment has been minimized.

@jnothman

jnothman Sep 30, 2016
Member

I think this is a confusing way of stating it. Will fix it up in master.

This comment has been minimized.

@nelson-liu

nelson-liu Sep 30, 2016
Author Contributor

thanks @jnothman

This comment has been minimized.

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
…#7441)

* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new
afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017
…#7441)

* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new
Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…#7441)

* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…#7441)

* remove unused condition in decision tree construction

* edit is_leaf condition for min_weight_leaf

* edit ordering of statements

* remove extra parens and add whats new
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.