[Ray SGD] Support num_steps continue training #11142

amogkam · 2020-09-30T21:20:08Z

Currently, when training or validating with num_steps parameter set, each call will reset the data loaders from the beginning. This PR makes sure that subsequent training or validation calls consumes the entire data loader before resetting it. Also provides functionality to automatically cycle around the data loader if we still need to do more steps of training or validation.

Why are these changes needed?

Related issue number

Closes #8907

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/util/sgd/tests/test_torch.py

richardliaw

looks good, just one comment about test.

Does this all get reset when there are changes in worker pool?

amogkam · 2020-10-01T22:54:21Z

Yes that's a good point. This does get reset whenever the worker pool size changes. I'll raise an issue for this, but I think it requires some more thought on how to handle this.

…samples

amogkam · 2020-10-03T06:17:10Z

@richardliaw finally...tests are passing

amogkam added 3 commits September 30, 2020 13:52

done

d31b77b

formatting

c03fec7

set distributed sampler properly

81e5561

amogkam requested a review from richardliaw September 30, 2020 21:20

amogkam assigned richardliaw Sep 30, 2020

richardliaw reviewed Oct 1, 2020

View reviewed changes

python/ray/util/sgd/tests/test_torch.py Outdated Show resolved Hide resolved

richardliaw approved these changes Oct 1, 2020

View reviewed changes

richardliaw added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 1, 2020

simplify tests

038bb17

amogkam added 7 commits October 1, 2020 15:55

formatting

094d41b

fix failing image models test

3275d94

Merge branch 'master' of https://github.com/ray-project/ray into num-…

f5c5c07

…samples

try fix test

9bc6496

test pass

722e509

lint

595946c

update docstring

a633c3e

amogkam added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Oct 3, 2020

richardliaw merged commit 6325a97 into ray-project:master Oct 3, 2020

amogkam mentioned this pull request Oct 5, 2020

[Ray SGD] Iterator state should be maintained when workers resize #11214

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray SGD] Support num_steps continue training #11142

[Ray SGD] Support num_steps continue training #11142

amogkam commented Sep 30, 2020

richardliaw left a comment

amogkam commented Oct 1, 2020

amogkam commented Oct 3, 2020

[Ray SGD] Support num_steps continue training #11142

[Ray SGD] Support num_steps continue training #11142

Conversation

amogkam commented Sep 30, 2020

Why are these changes needed?

Related issue number

Checks

richardliaw left a comment

Choose a reason for hiding this comment

amogkam commented Oct 1, 2020

amogkam commented Oct 3, 2020