New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided #7301

Merged
merged 11 commits into from Sep 28, 2016

Conversation

Projects
None yet
6 participants
@nelson-liu
Contributor

nelson-liu commented Aug 31, 2016

Reference Issue

Fixes #6945, previous PR at #6947

What does this implement/fix? Explain your changes.

min_weight_fraction_leaf should work even when sample_weight is not given (in which case samples are assumed to have equal weight). I also tweaked the description of min_weight_fraction_leaf to make its purpose a bit more clear.

Any other comments?

if sample_weight is None:
min_weight_leaf = int(ceil(self.min_weight_fraction_leaf *
n_samples))
min_samples_leaf = max(min_samples_leaf, min_weight_leaf)

This comment has been minimized.

@nelson-liu

nelson-liu Aug 31, 2016

Contributor

I'm not sure whether it's better to have this block as what it is, or

min_samples_leaf = max(min_samples_leaf, int(ceil(self.min_weight_fraction_leaf * n_samples)))
min_weight_leaf = 0 # (seeing as min_weight_leaf is unnecessary when sample_weight is None)

or simply omit the min_samples_leaf = max(...) and just set min_weight_leaf since both min_weight_leaf and min_samples_leaf are provided to the Criterion object.

This comment has been minimized.

@ogrisel

ogrisel Aug 31, 2016

Member

I don't understand what you mean by:

or simply omit the min_samples_leaf = max(...) and just set min_weight_leaf since both min_weight_leaf and min_samples_leaf are provided to the Criterion object.

But I think I am fine with the first two suggestions.

This comment has been minimized.

@nelson-liu

nelson-liu Aug 31, 2016

Contributor

ok, i'll just leave it as it is.

This comment has been minimized.

@amueller

amueller Aug 31, 2016

Member

The max seems odd to me. Why not just remove it?

This comment has been minimized.

@nelson-liu

nelson-liu Aug 31, 2016

Contributor

Yeah, that's what I was trying to propose in my last sentence on my original comment (was probably unclear, sorry). I haven't tested it but it should work... And it seems a bit more logical too

This comment has been minimized.

@amueller

amueller Aug 31, 2016

Member

I think I'm for that.

This comment has been minimized.

@nelson-liu

nelson-liu Aug 31, 2016

Contributor

yeah, it seems like that removing the call to max works as well. The max here is implicit because it's implemented in Criterion anyway.

# test case with no weights passed in
total_weight = X.shape[0]
for max_leaf_nodes, frac in product((None, 1000), np.linspace(0, 0.5, 6)):

This comment has been minimized.

@ogrisel

ogrisel Aug 31, 2016

Member

What is the duration of this test? If it's more than a couple hundred milliseconds, please reduce to np.linspace(0, 0.5, 3).

This comment has been minimized.

@nelson-liu

nelson-liu Aug 31, 2016

Contributor

runs in .106ms on my machine, late 2013 basic macbook pro.

This comment has been minimized.

@ogrisel

ogrisel Aug 31, 2016

Member

Ok that's fine then.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Aug 31, 2016

LGTM.

@ogrisel ogrisel changed the title from [MRG] Fix min_weight_fraction_leaf to work when sample_weights are not provided to [MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided Aug 31, 2016

@ogrisel ogrisel added this to the 0.18 milestone Aug 31, 2016

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 31, 2016

You don't currently test interaction between min_samples_leaf and this

@jnothman jnothman modified the milestone: 0.18 Aug 31, 2016

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Aug 31, 2016

You don't currently test interaction between min_samples_leaf and this

Good idea, their interaction is simply that the max is the bound, right?

@amueller

This comment has been minimized.

Member

amueller commented Aug 31, 2016

maybe test that with no sample weights, the two parameters have the same effect?

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Aug 31, 2016

@amueller so I'm actually of the opinion now that they shouldn't have the same effect... I pushed a test that checks if they are the same; it crashes and burns, but in a justified manner I think.

First, I changed the min_weight_leaf calculation formula to actually provide the minimum weight as opposed to using the same formula as min_samples_leaf. The parameter deals with weights, so we shouldn't be rounding up and turning it into an int (thus representing the min samples in the leaf). To further reinforce this, the Cython builder classes actually take min_weight_leaf as a double. This also mirrors the implementation in the else clause.

So onto why they shouldn't have the same effect:
When calculating whether a node is a leaf in _tree.pyx, the following conditional is used:

is_leaf = ((depth >= max_depth) or
                (n_node_samples < min_samples_split) or
                (n_node_samples < 2 * min_samples_leaf) or
                (weighted_n_node_samples < min_weight_leaf))

Say that we fit on a dataset with 5 samples, and provide no sample_weight at fit (thus uniform weighting). If we set min_weight_leaf = .1 and min_samples_leaf = .1, the comparisons will be (assuming this is the first split, so n_node_samples / weighted_n_node_samples = 5):

5 < 2 * int(ceil(.1*5)) = 2* 1 < 2
5 < (.1*5) = (.5)

Which is very different. This approach does have the downside that (in this case), setting min_weight_leaf = .1 essentially does nothing because the value of weighted_n_node_samples will always be greater than .1 (it's always greater than 1). However, I think that's staying true to the parameter definition.

I think we should take this approach, but I feel like a warning might be good if min_weight_leaf * sum(sample_weights) < min(sample_weights). What do you think?

edit: there are cases where the two parameters have the same effect, but they do not always.

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Sep 1, 2016

There are probably some hand-crafted values that could yield equal values...I suppose that those could serve as a useful test? Not sure if it is necessary, though. what do you guys think?

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Sep 4, 2016

since the two parameters don't give the same effect on uniform weighted data when frac*total_weight == int(ceil(frac*total_weight)) (and this seems incorrect), I've opened an issue for it @ #7338. If we decide to adopt the solutions in that issue to give the two parameters an equivalent interpretation, I'll verify that the test (such that the two trees are equal) works. If not, i don't think it's a suitable test...

It does seem reasonable for the two grown trees to be equal under this scenario, though.

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Sep 5, 2016

this is ready to be looked at again; removing the +1 because there have been significant changes to the tests.

@nelson-liu nelson-liu changed the title from [MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided to [MRG] Fix min_weight_fraction_leaf to work when sample_weights are not provided Sep 5, 2016

@raghavrv

This comment has been minimized.

Member

raghavrv commented Sep 6, 2016

Say that we fit on a dataset with 5 samples, and provide no sample_weight at fit (thus uniform weighting). If we set min_weight_leaf = .1 and min_samples_leaf = .1

I am sorry but I don't get what you are trying to say. min_samples_leaf is an int and min_weight_leaf is a float. Why are you giving a float value to min_samples_leaf?

@@ -796,7 +796,8 @@ class RandomForestClassifier(ForestClassifier):
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
leaf node where weights are determined by ``sample_weight`` provided

This comment has been minimized.

@raghavrv

raghavrv Sep 6, 2016

Member

I'd write it as -

The minimum weighted fraction of the sum total of weights (of all the input samples) required
 to be at a leaf node.
@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Sep 6, 2016

I am sorry but I don't get what you are trying to say. min_samples_leaf is an int and min_weight_leaf is a float. Why are you giving a float value to min_samples_leaf?

min_samples_leaf can actually be a float, from the documentation:

If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

@amueller

This comment has been minimized.

Member

amueller commented Sep 6, 2016

On 08/31/2016 07:52 PM, Nelson Liu wrote:

Say that we fit on a dataset with 5 samples, and provide no
|sample_weight| at fit (thus uniform weighting). If we set
|min_weight_leaf = .1| and |min_samples_leaf = .1|, the comparisons
will be (assuming this is the first split, so |n_node_samples /
weighted_n_node_samples = 5|):

I was talking about the case where min_weight_leaf * n_samples is an
integer. It seems to me that in this case they should have the same
behavior.
I'm surprised about the 2* with min_samples_leaf but not with
min_weight_leaf. Why is that?

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Sep 6, 2016

I was talking about the case where min_weight_leaf * n_samples is an
integer. It seems to me that in this case they should have the same
behavior.

@jnothman and I discussed this a bit in #7338 (specifically #7338 (comment)) , could you check it out?

I'm surprised about the 2* with min_samples_leaf but not with
min_weight_leaf. Why is that?

I believe it's because weights can be split, but samples cannot

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 6, 2016

I'm surprised about the 2* with min_samples_leaf but not with min_weight_leaf. Why is that?

The min_weight_leaf condition shouldn't be there. It's plain wrong. The min_samples_leaf condition is only an optimisation, and the same kind of optimisation can't be easily achieved for weight. The real work happens in the splitter.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 7, 2016

I'm tempted to follow the second option at #6945 and either raise an error or a warning if sample_weight is None and min_weight_fraction_leaf is not None.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 7, 2016

But I'm okay with the "assume sample_weight=1" solution too..?

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Sep 7, 2016

I'm tempted to follow the second option at #6945 and either raise an error or a warning if sample_weight is None and min_weight_fraction_leaf is not None.

That solution would definitely be far easier to implement, but I think that assuming sample_weight=1 is more intuitive. However, it's equally counterintuitive that the results for min_weight_fraction_leaf do not match up with those for min_samples_leaf in this case. Ideally, the best thing to do would be to fix what's going in the splitter / tree to make the two grown trees the same, but doing that seems a bit out of scope of this PR. I'm fine with leaving it as it is or raising a warning / error, how about others?

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 8, 2016

No, the difference is not a bug and should not be fixed.

On 8 September 2016 at 08:09, Nelson Liu notifications@github.com wrote:

I'm tempted to follow the second option at #6945
#6945 and either
raise an error or a warning if sample_weight is None and
min_weight_fraction_leaf is not None.

That solution would definitely be far easier to implement, but I think
that assuming sample_weight=1 is more intuitive. However, it's equally
counterintuitive that the results for min_weight_fraction_leaf do not
match up with those for min_samples_leaf in this case. Ideally, the best
thing to do would be to fix what's going in the splitter / tree to make the
two grown trees the same, but doing that seems a bit out of scope of this
PR. I'm fine with leaving it as it is or raising a warning / error, how
about others?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7301 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz69Pm4wZqxPqPKal1akEPBQnEvDrjks5qnzYogaJpZM4JxNyX
.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 8, 2016

I'm happy to accept what you implemented too.

On 8 September 2016 at 10:58, Joel Nothman joel.nothman@gmail.com wrote:

No, the difference is not a bug and should not be fixed.

On 8 September 2016 at 08:09, Nelson Liu notifications@github.com wrote:

I'm tempted to follow the second option at #6945
#6945 and either
raise an error or a warning if sample_weight is None and
min_weight_fraction_leaf is not None.

That solution would definitely be far easier to implement, but I think
that assuming sample_weight=1 is more intuitive. However, it's equally
counterintuitive that the results for min_weight_fraction_leaf do not
match up with those for min_samples_leaf in this case. Ideally, the best
thing to do would be to fix what's going in the splitter / tree to make the
two grown trees the same, but doing that seems a bit out of scope of this
PR. I'm fine with leaving it as it is or raising a warning / error, how
about others?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7301 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz69Pm4wZqxPqPKal1akEPBQnEvDrjks5qnzYogaJpZM4JxNyX
.

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Sep 8, 2016

No, the difference is not a bug and should not be fixed.

ok

I'm happy to accept what you implemented too.

In that vein, I've addressed @raghavrv 's comments. Perhaps this would be good for 0.18?

The minimum weighted fraction of the input samples required to be at a
leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node.

This comment has been minimized.

@jnothman

jnothman Sep 8, 2016

Member

It it worth noting, "Samples have equal weight when sample_weight is not provided, but min_samples_leaf is more efficient."

@jnothman jnothman changed the title from [MRG] Fix min_weight_fraction_leaf to work when sample_weights are not provided to [MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided Sep 26, 2016

leaf node.
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided, but

This comment has been minimized.

@jnothman

jnothman Sep 26, 2016

Member

I think we should now drop "but min_samples_leaf is more efficient" if we've realised we can use the 2 * min_weight_leaf change.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 26, 2016

This actually has @ogrisel's +1 above as well. So it's got +2 assuming nothing substantial has changed since then. Also LGTM

@jnothman jnothman changed the title from [MRG+1] Fix min_weight_fraction_leaf to work when sample_weights are not provided to [MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided Sep 26, 2016

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 26, 2016

Pending that minor change, I should say.

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Sep 27, 2016

sorry for getting back to this so late, been a bit busy recently. I pushed the changes to the docstrings that were requested, is there anything else needed?

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 27, 2016

A what's new entry is needed. Please put under 0.19, as I think this has
missed the boat for 0.18.

On 28 September 2016 at 03:29, Nelson Liu notifications@github.com wrote:

sorry for getting back to this so late, been a bit busy recently. I pushed
the changes to the docstrings that were requested, is there anything else
needed?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7301 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz670gWKcRX5AD0RrbyXjjgn6aKfKdks5quVJxgaJpZM4JxNyX
.

nelson-liu added some commits Aug 31, 2016

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Sep 28, 2016

@jnothman thanks for the reminder. I put this under "Enhancements"; do you think "Bug Fixes" would be better?

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 28, 2016

If you state "previously it was silently ignored", then it belongs in bug fixes :)

@nelson-liu

This comment has been minimized.

Contributor

nelson-liu commented Sep 28, 2016

@jnothman good point, that info is important. added and moved to bugfixes.

@jnothman jnothman merged commit da118d0 into scikit-learn:master Sep 28, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Member

jnothman commented Sep 28, 2016

thanks @nelson-liu

@amueller, I've currently assumed this is not for 0.18. Feel free to backport and move the what's new if you disagree.

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are …
…not provided (scikit-learn#7301)

* fix min_weight_fraction_leaf when sample_weights is None

* fix flake8 error

* remove added newline and unnecessary assignment

* remove max bc it's implemented in cython and add interaction test

* edit weight calculation formula and add test to check equality

* remove test that sees if two parameter build the same tree

* reword min_weight_fraction_leaf docstring

* clarify uniform weight in forest docstrings

* update docstrings for all classes

* add what's new entry

* move whatsnew entry to bug fixes and explain previous behavior

amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 14, 2016

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are …
…not provided (scikit-learn#7301)

* fix min_weight_fraction_leaf when sample_weights is None

* fix flake8 error

* remove added newline and unnecessary assignment

* remove max bc it's implemented in cython and add interaction test

* edit weight calculation formula and add test to check equality

* remove test that sees if two parameter build the same tree

* reword min_weight_fraction_leaf docstring

* clarify uniform weight in forest docstrings

* update docstrings for all classes

* add what's new entry

* move whatsnew entry to bug fixes and explain previous behavior

# Conflicts:
#	doc/whats_new.rst

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are …
…not provided (scikit-learn#7301)

* fix min_weight_fraction_leaf when sample_weights is None

* fix flake8 error

* remove added newline and unnecessary assignment

* remove max bc it's implemented in cython and add interaction test

* edit weight calculation formula and add test to check equality

* remove test that sees if two parameter build the same tree

* reword min_weight_fraction_leaf docstring

* clarify uniform weight in forest docstrings

* update docstrings for all classes

* add what's new entry

* move whatsnew entry to bug fixes and explain previous behavior

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are …
…not provided (scikit-learn#7301)

* fix min_weight_fraction_leaf when sample_weights is None

* fix flake8 error

* remove added newline and unnecessary assignment

* remove max bc it's implemented in cython and add interaction test

* edit weight calculation formula and add test to check equality

* remove test that sees if two parameter build the same tree

* reword min_weight_fraction_leaf docstring

* clarify uniform weight in forest docstrings

* update docstrings for all classes

* add what's new entry

* move whatsnew entry to bug fixes and explain previous behavior

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are …
…not provided (scikit-learn#7301)

* fix min_weight_fraction_leaf when sample_weights is None

* fix flake8 error

* remove added newline and unnecessary assignment

* remove max bc it's implemented in cython and add interaction test

* edit weight calculation formula and add test to check equality

* remove test that sees if two parameter build the same tree

* reword min_weight_fraction_leaf docstring

* clarify uniform weight in forest docstrings

* update docstrings for all classes

* add what's new entry

* move whatsnew entry to bug fixes and explain previous behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment