New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] FIX enforce deterministic behaviour in BaseBagging #9723

Merged
merged 20 commits into from Jul 19, 2018

Conversation

5 participants
@glemaitre
Contributor

glemaitre commented Sep 9, 2017

Reference Issue

closes #9524

What does this implement/fix? Explain your changes.

Enforce deterministic results when random_state is set in a BaggingClassifier.
The sample_indices needed to be masked to emulate the same processing than
with estimator_samples_ property.

Any other comments?

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Sep 9, 2017

Contributor

@jnothman I forgot about this one.

@ogrisel you might want to give a look as well since you saw the issue during the sprint in Euroscipy.

Contributor

glemaitre commented Sep 9, 2017

@jnothman I forgot about this one.

@ogrisel you might want to give a look as well since you saw the issue during the sprint in Euroscipy.

@jnothman

I wonder whether this in practice degrades the quality of the bagging.

Certainly, it needs a mention in what's new under changed models, and for the subtlety of the change to be described: "The base estimator may be presented with the same features in a different order to in previous versions; now the features will always be in input order."

Show outdated Hide outdated sklearn/ensemble/bagging.py Outdated
@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Sep 10, 2017

Contributor

I wonder whether this in practice degrades the quality of the bagging.

Could you elaborate on it. I don't really see what it will degrade?

Contributor

glemaitre commented Sep 10, 2017

I wonder whether this in practice degrades the quality of the bagging.

Could you elaborate on it. I don't really see what it will degrade?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 17, 2017

Member

Hmm... This emailed comment disappeared...:

I just meant that sometimes estimators could be sensitive to feature order. The new version is like randomly sampling then sorting, which is not quite as random. But probably negligible effect.

Member

jnothman commented Oct 17, 2017

Hmm... This emailed comment disappeared...:

I just meant that sometimes estimators could be sensitive to feature order. The new version is like randomly sampling then sorting, which is not quite as random. But probably negligible effect.

@jnothman

LGTM

Show outdated Hide outdated doc/whats_new/v0.20.rst Outdated
@@ -721,6 +722,34 @@ def test_estimators_samples():
assert_array_almost_equal(orig_coefs, new_coefs)
def test_estimators_samples_deterministic():
# This test is a regression test to check that with a random step

This comment has been minimized.

@jnothman

jnothman Oct 17, 2017

Member

Might be worth mentioning the issue #9524

@jnothman

jnothman Oct 17, 2017

Member

Might be worth mentioning the issue #9524

@jnothman jnothman changed the title from [MRG] FIX enforce deterministic behaviour in BaseBagging to [MRG+1] FIX enforce deterministic behaviour in BaseBagging Oct 17, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 20, 2017

Member

@glemaitre can you fix the conflicts?

Member

lesteve commented Oct 20, 2017

@glemaitre can you fix the conflicts?

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Oct 20, 2017

Contributor

@lesteve done and addressed the @jnothman comments

Contributor

glemaitre commented Oct 20, 2017

@lesteve done and addressed the @jnothman comments

@lesteve

My main worry is that using a mask rather than an array of indices only works if indices does not have repeated values, does it not? This is not necessarily the case if bootstrap=True I think.

Show outdated Hide outdated doc/whats_new/v0.20.rst Outdated
Show outdated Hide outdated sklearn/ensemble/bagging.py Outdated
@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Oct 20, 2017

Contributor

My main worry is that using a mask rather than an array of indices only works if indices does not have repeated values, does it not? This is not necessarily the case if bootstrap=True I think.

True, I did not see it at first. So we need to modify the estimators_samples_ such that it always gives the indices instead of a mask. In this case, it seems that it will affect the OOB:

https://github.com/glemaitre/scikit-learn/blob/ece9e81742f8cd40fa48b5ccae63d36edb3c4e18/sklearn/ensemble/bagging.py#L594

I am not really familiar with this part, but it seems that it should not change anything in the computation itself.

@lesteve am I correct and go forward or I missed something?

Contributor

glemaitre commented Oct 20, 2017

My main worry is that using a mask rather than an array of indices only works if indices does not have repeated values, does it not? This is not necessarily the case if bootstrap=True I think.

True, I did not see it at first. So we need to modify the estimators_samples_ such that it always gives the indices instead of a mask. In this case, it seems that it will affect the OOB:

https://github.com/glemaitre/scikit-learn/blob/ece9e81742f8cd40fa48b5ccae63d36edb3c4e18/sklearn/ensemble/bagging.py#L594

I am not really familiar with this part, but it seems that it should not change anything in the computation itself.

@lesteve am I correct and go forward or I missed something?

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Oct 20, 2017

Member

@lesteve am I correct and go forward or I missed something?

You probably understand the code better than I do ...

Member

lesteve commented Oct 20, 2017

@lesteve am I correct and go forward or I missed something?

You probably understand the code better than I do ...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 21, 2017

Member

Good catch, @lesteve.

Member

jnothman commented Oct 21, 2017

Good catch, @lesteve.

@jnothman jnothman changed the title from [MRG+1] FIX enforce deterministic behaviour in BaseBagging to [MRG] FIX enforce deterministic behaviour in BaseBagging Oct 21, 2017

@glemaitre glemaitre changed the title from [MRG] FIX enforce deterministic behaviour in BaseBagging to [WIP] FIX enforce deterministic behaviour in BaseBagging Oct 23, 2017

@glemaitre glemaitre changed the title from [WIP] FIX enforce deterministic behaviour in BaseBagging to [MRG] FIX enforce deterministic behaviour in BaseBagging Oct 23, 2017

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Oct 23, 2017

Contributor

ping @jnothman @lesteve

I think this solution is fine. However, I wanted to have you thoughts regarding a potential backward compatibility. I would consider a bug that estimators_samples_ was returning a mask instead of the index.

What are you thoughts on that, shall we raise anything regarding the change of behaviour?

Contributor

glemaitre commented Oct 23, 2017

ping @jnothman @lesteve

I think this solution is fine. However, I wanted to have you thoughts regarding a potential backward compatibility. I would consider a bug that estimators_samples_ was returning a mask instead of the index.

What are you thoughts on that, shall we raise anything regarding the change of behaviour?

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Oct 25, 2017

Contributor

ping @lesteve @jnothman I added a non-regression test for the bagging.
The PR is good to be reviewed.

Contributor

glemaitre commented Oct 25, 2017

ping @lesteve @jnothman I added a non-regression test for the bagging.
The PR is good to be reviewed.

@jnothman

Need an API changes entry in what's new.

There are still Attributes sections of docstrings that are not up-to-date.

Show outdated Hide outdated sklearn/ensemble/tests/test_bagging.py Outdated
Show outdated Hide outdated sklearn/ensemble/tests/test_bagging.py Outdated
@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Oct 31, 2017

Contributor

@jnothman @lesteve fix the last glitches

Contributor

glemaitre commented Oct 31, 2017

@jnothman @lesteve fix the last glitches

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 31, 2017

Member

Ordinarily this would reference the PR and contributor

Member

jnothman commented on doc/whats_new/v0.20.rst in 453a78a Oct 31, 2017

Ordinarily this would reference the PR and contributor

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 31, 2017

Member

I might not have had enough coffee but I'm struggling to see what's going on here.

Member

amueller commented Oct 31, 2017

I might not have had enough coffee but I'm struggling to see what's going on here.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 31, 2017

Member

Ah, so estimator_samples_ used to be a boolean mask, which didn't take into account repeated samples of data points, so it wasn't the same process... got it... really not enough coffee... but can you maybe say that in the whatsnew or somewhere?
(the issue title etc made me think this has something to do with random state, which threw me off)

Member

amueller commented Oct 31, 2017

Ah, so estimator_samples_ used to be a boolean mask, which didn't take into account repeated samples of data points, so it wasn't the same process... got it... really not enough coffee... but can you maybe say that in the whatsnew or somewhere?
(the issue title etc made me think this has something to do with random state, which threw me off)

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 31, 2017

Member

we're breaking backward compatibility of the attribute, right? Are we considering this a bug-fix?

Member

amueller commented Oct 31, 2017

we're breaking backward compatibility of the attribute, right? Are we considering this a bug-fix?

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Oct 31, 2017

Contributor

but can you maybe say that in the whatsnew or somewhere?

It does not cost anything. I think that I could make it clearer in the Change API entry

Contributor

glemaitre commented Oct 31, 2017

but can you maybe say that in the whatsnew or somewhere?

It does not cost anything. I think that I could make it clearer in the Change API entry

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Oct 31, 2017

Contributor

we're breaking backward compatibility of the attribute, right? Are we considering this a bug-fix?

It was also one of my question: #9723 (comment)

IMHO, I would call this a bug-fix since returning indices is what I would expect. Otherwise this is impossible to reproduce the training. In addition, this is still possible to get the samples by converting the indices to a mask.

WDYT @amueller

Contributor

glemaitre commented Oct 31, 2017

we're breaking backward compatibility of the attribute, right? Are we considering this a bug-fix?

It was also one of my question: #9723 (comment)

IMHO, I would call this a bug-fix since returning indices is what I would expect. Otherwise this is impossible to reproduce the training. In addition, this is still possible to get the samples by converting the indices to a mask.

WDYT @amueller

glemaitre added some commits Oct 31, 2017

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre
Contributor

glemaitre commented Nov 28, 2017

@glemaitre glemaitre added this to the 0.20 milestone Jun 8, 2018

@ogrisel ogrisel added this to PRs tagged in scikit-learn 0.20 Jul 16, 2018

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 17, 2018

Member

I fixed the conflicts

Member

GaelVaroquaux commented Jul 17, 2018

I fixed the conflicts

@GaelVaroquaux

LGTM.

+1 for merge if CI is green.

@GaelVaroquaux GaelVaroquaux changed the title from [MRG] FIX enforce deterministic behaviour in BaseBagging to [MRG+1] FIX enforce deterministic behaviour in BaseBagging Jul 17, 2018

@GaelVaroquaux GaelVaroquaux changed the title from [MRG+1] FIX enforce deterministic behaviour in BaseBagging to [MRG+2] FIX enforce deterministic behaviour in BaseBagging Jul 17, 2018

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 18, 2018

Member

@glemaitre : can you merge with master, to fix the CI.

Member

GaelVaroquaux commented Jul 18, 2018

@glemaitre : can you merge with master, to fix the CI.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Jul 19, 2018

Contributor

@GaelVaroquaux This is green.

Contributor

glemaitre commented Jul 19, 2018

@GaelVaroquaux This is green.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Jul 19, 2018

Contributor

@rth if you are there, you can merge this one.

Contributor

glemaitre commented Jul 19, 2018

@rth if you are there, you can merge this one.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 19, 2018

Member

Merging.

Member

GaelVaroquaux commented Jul 19, 2018

Merging.

@GaelVaroquaux GaelVaroquaux merged commit a365714 into scikit-learn:master Jul 19, 2018

7 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.3%)
Details
codecov/project 95.31% (+<.01%) compared to f158e2d
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

scikit-learn 0.20 automation moved this from PRs tagged to Done Jul 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment