Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SGD] Callback API for SGD+Tune #11316

Merged
merged 17 commits into from
Oct 15, 2020
Merged

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Oct 9, 2020

Documentation updates to come later.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -678,6 +715,7 @@ def step(self):
train_stats = self.trainer.train(max_retries=10, profile=True)
validation_stats = self.trainer.validate(profile=True)
stats = merge_dicts(train_stats, validation_stats)
self._iter += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this state supposed to be saved upon checkpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes good catch. Added it to save_checkpoint and load_checkpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now I know what you're doing here, I think we already provide self.training_iteration in trainable, so you don't actually need to save/manage this stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. Changed it to using self.training_iteration and removed the duplicate metadata checkpointing.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice; seems like a nice small change. Left a couple questions/comments. Have you showed this to users?

@amogkam amogkam requested a review from krfricke October 12, 2020 17:13
@amogkam
Copy link
Contributor Author

amogkam commented Oct 12, 2020

@richardliaw Yes I ran it by Richard (Dendra) and he said he really liked it.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few last comments for changes

@richardliaw richardliaw added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 13, 2020
@amogkam amogkam removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 15, 2020
@amogkam
Copy link
Contributor Author

amogkam commented Oct 15, 2020

image
image
Updated docstrings

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! thanks for cleaning this up.

@richardliaw richardliaw merged commit 38eb614 into ray-project:master Oct 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants