New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] FIX adaboost estimators not randomising correctly #7411

Merged
merged 5 commits into from Sep 23, 2016

Conversation

Projects
None yet
5 participants
@jnothman
Member

jnothman commented Sep 13, 2016

Fixes #7408 , where adaboost estimators were given the same random state initialisation, resulting in poor performance.

At the same time, I realised the need for setting nested random_states. I think this should be a public utility. This PR also ensures nested random_state is set in ensembles and in common tests.

Caveats:

  • AdaBoost results will differ from previously, by necessity
  • Bagging results will differ from previously, unnecessarily, but reducing code duplication and fixing an issue of nested estimators. Objections welcome.
  • testing random_states (via set_random_state) will differ from previously
  • sklearn.utils.randomize_estimator is a bad name for the new utility to set nested random_state. Suggestions welcome!
  • sklearn.utils.testing.set_random_state has a better name, but is subtly and valuably different from the new function:
    • it defaults to random_state=0 rather than system-wide random state
    • it ignores warnings
  • CV splitters should support {get,set}_params() if they are to be affected by the new utility. (If and when they do get this support, it could change the states set for other parts of the estimator by the new utility, due to parameter iteration order.) PR welcome, IMO
  • other things with random_state but without {get,set}_params are not affected by the new utility; this is noted in its docstring.
@amueller

This comment has been minimized.

Member

amueller commented Sep 13, 2016

hm looks good.
I'm a bit confused. I thought something similar recently happened in BaggingEstimator but I might have dreamt that...

random_state = check_random_state(random_state)
to_set = {}
for key in sorted(estimator.get_params(deep=True)):
if key == 'random_state' or key.endswith('_random_state'):

This comment has been minimized.

@amueller

amueller Sep 13, 2016

Member

why do we want / need this?

This comment has been minimized.

@jnothman

jnothman Sep 13, 2016

Member

Pipelines have random states too! This isn't necessarily the only way to do this, but having integer random_states in each component means that unit can be replicated.

This comment has been minimized.

@amueller

amueller Sep 13, 2016

Member

There the application would be trying to make an estimator deterministic. That is not really related to the issue we're trying to fix here, is it?

This comment has been minimized.

@jnothman

jnothman Sep 13, 2016

Member

The random ensembles currently initialise each estimator with an integer random state derived from their local random state. However, currently they only use the random_state param. They should be setting every random_state param.

This comment has been minimized.

@amueller

amueller Sep 13, 2016

Member

hm... true from a reproducibility perspective. The question is a bit how much of this can / should we put on the user.... Or on the Pipeline? The contract could also be "setting random_state on an object makes it deterministic". Then it would be the Pipelines problem. But that might be too much magic? It would be a pretty clear contract, though.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux via email Sep 13, 2016

Member

This comment has been minimized.

@amueller

amueller Sep 13, 2016

Member

If an object depends on something that we don't control, then that object breaks the contract and it's not our fault ;)

This comment has been minimized.

@jnothman

jnothman Sep 13, 2016

Member

Too much magic and highly impractical. It means every meta-estimator or wrapper needs a random_state param and needs to manage it to do just the same as this... Here we manage things like dict iteration order invariance; you're going to need a shared helper function anyway. We already define nested parameter management and the random_state convention. Let's use it.

That having been said, I think it might be a good idea for RandomizedSearchCV to (optionally) manage the random_state of scipy.stats RVs

This comment has been minimized.

@amueller

amueller Sep 14, 2016

Member

That having been said, I think it might be a good idea for RandomizedSearchCV to (optionally) manage the random_state of scipy.stats RVs

That already happens.

Hm but shouldn't this only match __random_state if we want to match the sub-estimator random state?

This comment has been minimized.

@jnothman

jnothman Sep 14, 2016

Member

I thought there would be no harm in allowing something to have multiple distinct parameters ending _random_state, but also including __random_state. No, it's not tested; yes, I can change it.

@@ -77,6 +77,9 @@ def _make_estimator(self, append=True):
estimator.set_params(**dict((p, getattr(self, p))
for p in self.estimator_params))
if random_state is not None:
randomize_estimator(estimator, random_state)

This comment has been minimized.

@amueller

amueller Sep 13, 2016

Member

why not implement the logic here?

This comment has been minimized.

@jnothman

jnothman Sep 13, 2016

Member

Can implement the logic here, except that set_random_state for testing should probably make use of this same logic, so why not make it a public util?

This comment has been minimized.

@jnothman

jnothman Sep 13, 2016

Member

You're right that we can implement it as sklearn.ensemble.base._set_random_state for now and get it merged without giving it a public name. And ignoring the lacks of testing.set_random_state. What do you think?

This comment has been minimized.

@amueller

amueller Sep 14, 2016

Member

ok lets make a private helper?

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 14, 2016

@amueller and @GaelVaroquaux , let me know if you think I should make this more conservative, i.e. to only effect the ensembles.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 14, 2016

Pushing more conservative version. Adding what's new in case you want to include the fix in 0.18-final.

@jnothman jnothman added this to the 0.18 milestone Sep 14, 2016

@jnothman jnothman removed the Enhancement label Sep 14, 2016

@ogrisel

Besides the following minor comments, LGTM.

random_state = check_random_state(random_state)
to_set = {}
for key in sorted(estimator.get_params(deep=True)):
if key == 'random_state' or key.endswith('_random_state'):

This comment has been minimized.

@ogrisel

ogrisel Sep 20, 2016

Member

Souldn't it be double _: key.endswith('__random_state') instead?

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016

Member

@amueller (I think) also questioned this. I wanted, but did not test, this to be applicable to the hypothetical case that an estimator had multiple random_states for multiple purposes. Is that silly?

I'll suppose I'll change it so it doesn't look wrong...

This comment has been minimized.

@ogrisel

ogrisel Oct 3, 2016

Member

The multiple random_states case sounds like a YAGNI to me.

ensemble._make_estimator(append=False)
assert_equal(3, len(ensemble))
assert_equal(3, len(ensemble.estimators_))
assert_true(isinstance(ensemble[0], Perceptron))
assert_equal(ensemble[0].random_state, None)
assert_true(isinstance(ensemble[1].random_state, int))

This comment has been minimized.

@ogrisel

ogrisel Sep 20, 2016

Member

For completeness you could add:

    assert_true(isinstance(ensemble[2].random_state, int))
@@ -113,6 +113,11 @@ def test_iris():
assert score > 0.9, "Failed with algorithm %s and score = %f" % \
(alg, score)
# Check we used multiple estimators
assert_true(len(clf.estimators_) > 1)

This comment has been minimized.

@ogrisel

ogrisel Sep 20, 2016

Member

Better user assert_greater(len(clf.estimators_), 1) to get more informative error messages (when using nosetests).

# Check we used multiple estimators
assert_true(len(clf.estimators_) > 1)
# Check for distinct random states (see issue #7408)
assert_true(len(set(est.random_state for est in clf.estimators_)) > 1)

This comment has been minimized.

@ogrisel

ogrisel Sep 20, 2016

Member
assert_greater(len(set(est.random_state for est in clf.estimators_)), 1)
# Check we used multiple estimators
assert_true(len(reg.estimators_) > 1)
# Check for distinct random states (see issue #7408)
assert_true(len(set(est.random_state for est in reg.estimators_)) > 1)

This comment has been minimized.

@ogrisel

ogrisel Sep 20, 2016

Member

assert_greater

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 20, 2016

Changes made, thanks @ogrisel. Am I leaving the what's new in 0.18 or has that train passed?

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 22, 2016

rebasing and marking MRG+1 due to @ogrisel's LGTM.

@jnothman jnothman changed the title from [MRG] FIX adaboost estimators not randomising correctly to [MRG+1] FIX adaboost estimators not randomising correctly Sep 22, 2016

# Check we used multiple estimators
assert_greater(len(clf.estimators_), 1)
# Check for distinct random states (see issue #7408)
assert_greater(len(set(est.random_state

This comment has been minimized.

@amueller

amueller Sep 22, 2016

Member

shouldn't they be equal to len(clf.estimators_) (with high probability)?

# Check we used multiple estimators
assert_true(len(reg.estimators_) > 1)
# Check for distinct random states (see issue #7408)
assert_greater(len(set(est.random_state for est in reg.estimators_)), 1)

This comment has been minimized.

@amueller

amueller Sep 22, 2016

Member

same here

@amueller

This comment has been minimized.

Member

amueller commented Sep 22, 2016

LGTM apart from minor nitpick in tests.

@NelleV

NelleV approved these changes Sep 22, 2016

LGTM

@NelleV NelleV changed the title from [MRG+1] FIX adaboost estimators not randomising correctly to [MRG+2] FIX adaboost estimators not randomising correctly Sep 22, 2016

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Sep 23, 2016

LGTM.

Waiting for a couple of comments by @amueller in the tests before merging (on the len of unique random_state). But everything is for me. +1 for merge

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 23, 2016

Tests improved. Thanks for the reviews.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 23, 2016

Merging. @amueller, please backport.

@jnothman jnothman merged commit 32d1236 into scikit-learn:master Sep 23, 2016

0 of 3 checks passed

ci/circleci CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@jnothman

This comment has been minimized.

Member

jnothman commented Sep 23, 2016

(Actually, now I'm wondering if I should have merged before CIs were green... I checked that those more specific tests passed locally. Please excuse my unjustified hurry.)

@amueller

This comment has been minimized.

Member

amueller commented Sep 23, 2016

would have probably been better to wait, but if master fails we'll know ;)

@amueller

This comment has been minimized.

Member

amueller commented Sep 23, 2016

I'll backport everything all at once, when we're ready to release, I think.

amueller added a commit that referenced this pull request Sep 25, 2016

[MRG+2] FIX adaboost estimators not randomising correctly (#7411)
* FIX adaboost estimators not randomising correctly

(fixes #7408)

FIX ensure nested random_state is set in ensembles

* DOC add what's new

* Only affect *__random_state, not *_random_state for now

* TST More informative assertions for ensemble tests

* More specific testing of different random_states

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+2] FIX adaboost estimators not randomising correctly (scikit-lea…
…rn#7411)

* FIX adaboost estimators not randomising correctly

(fixes scikit-learn#7408)

FIX ensure nested random_state is set in ensembles

* DOC add what's new

* Only affect *__random_state, not *_random_state for now

* TST More informative assertions for ensemble tests

* More specific testing of different random_states

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016

Merge tag '0.18' into releases
* tag '0.18': (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016

Merge branch 'releases' into dfsg
* releases: (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...

yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016

Merge branch 'dfsg' into debian
* dfsg: (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

[MRG+2] FIX adaboost estimators not randomising correctly (scikit-lea…
…rn#7411)

* FIX adaboost estimators not randomising correctly

(fixes scikit-learn#7408)

FIX ensure nested random_state is set in ensembles

* DOC add what's new

* Only affect *__random_state, not *_random_state for now

* TST More informative assertions for ensemble tests

* More specific testing of different random_states

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+2] FIX adaboost estimators not randomising correctly (scikit-lea…
…rn#7411)

* FIX adaboost estimators not randomising correctly

(fixes scikit-learn#7408)

FIX ensure nested random_state is set in ensembles

* DOC add what's new

* Only affect *__random_state, not *_random_state for now

* TST More informative assertions for ensemble tests

* More specific testing of different random_states

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+2] FIX adaboost estimators not randomising correctly (scikit-lea…
…rn#7411)

* FIX adaboost estimators not randomising correctly

(fixes scikit-learn#7408)

FIX ensure nested random_state is set in ensembles

* DOC add what's new

* Only affect *__random_state, not *_random_state for now

* TST More informative assertions for ensemble tests

* More specific testing of different random_states
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment