Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix SGD non deterministic behavior #13422

Merged
merged 47 commits into from
Apr 4, 2019

Conversation

ClemDoum
Copy link
Contributor

@ClemDoum ClemDoum commented Mar 8, 2019

What does this fix:

While trying to make my intent classifier training deterministic in my NLU lib I noticed that I couldn't because the BaseSGDClassifier._fit_multiclass method is non deterministic.

To reproduce you have to initialize the SGDClassifier with a RandomState instance:

from sklearn.linear_model import SGDClassifier
from sklearn.utils import check_random_state


def sgd_test():
    seed = 1
    random_state = check_random_state(seed)

    num_classes = 3
    num_examples = 10
    x = random_state.normal(0.0, 1, size=(num_examples, num_classes))
    y = random_state.randint(num_classes, size=num_examples)

    num_iter = 1000
    ref_coef = None
    for i in range(num_iter):
        print("Iter %s" % i)
        random_state = check_random_state(seed)
        clf = SGDClassifier(
            random_state=random_state, n_jobs=num_classes).fit(x, y)
        if ref_coef is None:
            ref_coef = clf.coef_.tolist()
        assert ref_coef == clf.coef_.tolist()


if __name__ == '__main__':
    sgd_test()

I added a small print statement just before the joblib threads pass their seed to the plain_sgd or average_sgd function. This gives me the following output (and will give you a different output on your machine but will eventually fail):

Iter 0
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 1
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 2
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 3
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 4
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 5
Thread: 0, seed 1791095845
Thread: 1, seed 2135392491
Thread: 2, seed 946286476
Iter 6
Thread: 0, seed 1791095845
Thread: 2, seed 2135392491
Thread: 1, seed 946286476
Traceback (most recent call last):
  File "test_sgd_seed.py", line 27, in <module>
    sgd_test()
  File "test_sgd_seed.py", line 23, in sgd_test
    assert ref_coef == clf.coef_.tolist()
AssertionError

Process finished with exit code 1

What is happening is that the joblib threads share the BaseSGDClassifier.random_state and set a seed in the fit_binary function before it's passed to the plain_sgd or average_sgd function. Depending on the order on which the threads reach the seed setting, the output of the SGD can differ.

The other reason is that the estimator random state was not passed in the make_dataset function.

I think the bug was not noticeable in the unit tests because the SGD estimators were initialized with int random states. In this case, the input of the check_random_state function in the fit_binary function is the integer seed, and each thread is actually returning the exact same random state and then sample the exact same random seed for the SGD.

How does this fixes the bug:

  • I added an optional seed argument to the fit_binary function which default to None. If the seed is not None then it will be used otherwise the seed is set with the estimator random_state. This allows to set the jobs seeds before the jobs are distributed to the threads and avoids the non-deterministic behavior
  • I passed the estimator random state to the make_dataset function
  • I had to change a few seeds and doctests here and there to make the unit tests pass again

Help needed

  • Do you have any idea on how to properly test my fix ?

(Now also fixes #5015.)

Comment

While fixing the initial but another bug was found and fixed: the our_rand_r in the Cython code of the sklearn/utils/seq_dataset.pyx was not behaving consistently across platforms, see the full bug description and fix below

@ClemDoum ClemDoum changed the title [MRG] Fix SGD non deterministic behabvior [WIP] Fix SGD non deterministic behabvior Mar 8, 2019
@ClemDoum
Copy link
Contributor Author

ClemDoum commented Mar 8, 2019

Windows unit tests seem to fail, I need to investigate a bit

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I agree the nondeterminism is a problem

dataset, intercept_decay = make_dataset(X, y_i, sample_weight)

# XXX should have random_state_!
random_state = random_state if random_state is not None \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need to support retrieving the random_state from the estimator. Let's just use check_random_state and keep it simple.

@ClemDoum ClemDoum changed the title [WIP] Fix SGD non deterministic behabvior [WIP] Fix SGD non deterministic behavior Mar 11, 2019
@ClemDoum ClemDoum force-pushed the fix/sgd-random-state branch 3 times, most recently from 4c4d02d to a5fbfe6 Compare March 11, 2019 13:23
@ClemDoum
Copy link
Contributor Author

@jnothman thanks, I made the change and used check_random_state instead of the estimator random state.
I'm still not sure why the output differs on Windows though.

@jnothman
Copy link
Member

Is it possible that MAX_INT is not platform-independent?

@ClemDoum ClemDoum force-pushed the fix/sgd-random-state branch 2 times, most recently from e9e94e8 to ab03321 Compare March 13, 2019 13:21
@ClemDoum
Copy link
Contributor Author

OK I think I fixed it

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the assumption that there's no reasonable way to write a regression test, this LGTM.

@agramfort
Copy link
Member

@ClemDoum you'll need to rebase on current master

@jnothman jnothman added this to the 0.21 milestone Apr 4, 2019
@codecov
Copy link

codecov bot commented Apr 4, 2019

Codecov Report

Merging #13422 into master will decrease coverage by 0.26%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #13422      +/-   ##
==========================================
- Coverage   96.65%   96.39%   -0.27%     
==========================================
  Files         376      377       +1     
  Lines       69711    69869     +158     
==========================================
- Hits        67381    67347      -34     
- Misses       2330     2522     +192
Impacted Files Coverage Δ
sklearn/kernel_approximation.py 100% <ø> (ø) ⬆️
sklearn/linear_model/passive_aggressive.py 100% <ø> (ø) ⬆️
...earn/linear_model/tests/test_passive_aggressive.py 100% <ø> (ø) ⬆️
sklearn/linear_model/tests/test_logistic.py 99.89% <ø> (ø) ⬆️
sklearn/linear_model/perceptron.py 100% <ø> (ø) ⬆️
sklearn/linear_model/tests/test_sgd.py 99.62% <100%> (ø) ⬆️
sklearn/utils/tests/test_random.py 98.9% <100%> (+0.03%) ⬆️
sklearn/linear_model/stochastic_gradient.py 98.58% <100%> (+0.01%) ⬆️
sklearn/tests/test_multioutput.py 100% <100%> (ø) ⬆️
sklearn/utils/tests/test_seq_dataset.py 100% <100%> (ø) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93e09aa...5bdf6c8. Read the comment docs.

@jnothman jnothman merged commit 01bc8b1 into scikit-learn:master Apr 4, 2019
@jnothman
Copy link
Member

jnothman commented Apr 4, 2019

Thank you @ClemDoum. Sorry I forgot to merge it when I approved.

@TomDLT
Copy link
Member

TomDLT commented Apr 4, 2019

@ClemDoum you'll need to rebase on current master

Side note: Since GitHub now has a nice squash-and-merge button, we now prefer merging master instead of rebasing. It is easier to follow the discussion.

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
@lesteve
Copy link
Member

lesteve commented Jul 29, 2019

Does @ClemDoum or anyone else involved in this discussion remember the reason of pip install sphinx-gallery>=0.2,<0.3 : https://github.com/scikit-learn/scikit-learn/pull/13422/files#diff-fce1b4a04283d583b7c31114e4a4fc24

I was not able to find any reference to this in the discussion ...

@ClemDoum
Copy link
Contributor Author

@lesteve yes sorry I didn't mentioned it in the PR but if I remember well sphinx-gallery==0.3.0 was broken on Pypi (sphinx-gallery/sphinx-gallery#459). Since the version of sphinx-gallery was floating, the CI was broken, I had to fix the version to make it pass.

@lesteve
Copy link
Member

lesteve commented Jul 30, 2019

Right that makes sense, thanks a lot!

jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Nov 14, 2022
This removes the -Woverflow warnings observed
when building scikit-learn.

RAND_R_MAX is the max value for uint8, incrementing
it causes an overflow (hence the warning).

I think this commit fixes the implementation, yet
I comes with a backwards incompatible results and
tests for implementation relying on `our_rand_r`
fails because results are now different.

I see several alternatives to remove the warning while
having tests pass

  - prefered solution: adapt the test suite using the
    new results so that all tests pass and ackowledge
    the change of behavior for impacted user-facing APIs
    in the changelog
  - accept the quirk of this implementation but hardcode
    and rename the effective constant
  - silent the -Woverflow warning by another mean

Relates to:
scikit-learn#13422
scikit-learn#24895
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Nov 14, 2022
This removes the -Woverflow warnings observed
when building scikit-learn.

RAND_R_MAX is the max value for uint8, incrementing
it causes an overflow (hence the warning).

Elements were originaly mentionned in
scikit-learn#13422 (comment)
but left unreviewed, it seems.

I think this commit fixes the implementation, yet
I comes with a backwards incompatible results and
tests for implementation relying on `our_rand_r`
fails because results are now different.

I see several alternatives to remove the warning while
having tests pass

  - prefered solution: adapt the test suite using the
    new results so that all tests pass and ackowledge
    the change of behavior for impacted user-facing APIs
    in the changelog
  - accept the quirk of this implementation but hardcode
    and rename the effective constant
  - silent the -Woverflow warning by another mean

Relates to:
scikit-learn#13422
scikit-learn#24895
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Nov 14, 2022
This removes the -Woverflow warnings observed
when building scikit-learn.

RAND_R_MAX is the max value for uint8, incrementing
it causes an overflow (hence the warning).

Elements were originally mentioned but seem to
have been left unreviewed, see:
scikit-learn#13422 (comment)

I think this commit fixes the implementation, yet
I comes with a backwards incompatible results and
tests for implementation relying on `our_rand_r`
fails because results are now different.

I see several alternatives to remove the warning while
having tests pass

  - prefered solution: adapt the test suite using the
    new results so that all tests pass and ackowledge
    the change of behavior for impacted user-facing APIs
    in the changelog
  - accept the quirk of this implementation but hardcode
    and rename the effective constant
  - silent the -Woverflow warning by another mean

Relates to:
scikit-learn#13422
scikit-learn#24895
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Nov 14, 2022
This removes the -Woverflow warnings observed
when building scikit-learn.

RAND_R_MAX is the max value for uint8, incrementing
it causes an overflow (hence the warning).

Elements were originally mentioned but seem to
have been left unreviewed, see:
scikit-learn#13422 (comment)

I think this commit fixes the implementation, yet
I comes with a backwards incompatible results and
tests for implementation relying on `our_rand_r`
fails because results are now different.

I see several alternatives to remove the warning while
having tests pass

  - preferred solution: adapt the test suite using the
    new results so that all tests pass and acknowledge
    the change of behavior for impacted user-facing APIs
    in the change-log
  - accept the quirk of this implementation but hardcode
    and rename the effective constant
  - silent the -Woverflow warning by another mean

Relates to:
scikit-learn#13422
scikit-learn#24895
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

our_rand_r is duplicated and should not be seeded with 0
5 participants