Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray SGD] Support num_steps continue training #11142

Merged
merged 11 commits into from
Oct 3, 2020

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Sep 30, 2020

Currently, when training or validating with num_steps parameter set, each call will reset the data loaders from the beginning. This PR makes sure that subsequent training or validation calls consumes the entire data loader before resetting it. Also provides functionality to automatically cycle around the data loader if we still need to do more steps of training or validation.

Why are these changes needed?

Related issue number

Closes #8907

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just one comment about test.

Does this all get reset when there are changes in worker pool?

@richardliaw richardliaw added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 1, 2020
@amogkam
Copy link
Contributor Author

amogkam commented Oct 1, 2020

Yes that's a good point. This does get reset whenever the worker pool size changes. I'll raise an issue for this, but I think it requires some more thought on how to handle this.

@amogkam
Copy link
Contributor Author

amogkam commented Oct 3, 2020

@richardliaw finally...tests are passing

@amogkam amogkam added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Oct 3, 2020
@richardliaw richardliaw merged commit 6325a97 into ray-project:master Oct 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[raysgd] num_steps does not allow you to sample without replacement
2 participants