-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Add multinode sync test #23229
Conversation
# Conflicts: # python/ray/autoscaler/_private/fake_multi_node/docker_monitor.py # python/ray/autoscaler/_private/fake_multi_node/node_provider.py # python/ray/autoscaler/_private/fake_multi_node/test_utils.py
@@ -798,6 +798,9 @@ def restore(self, trial) -> None: | |||
else: | |||
logger.debug("Trial %s: Attempting restore from %s", trial, value) | |||
if trial.uses_cloud_checkpointing or not trial.sync_on_checkpoint: | |||
# If using cloud checkpointing, trial will get cp from cloud. | |||
# If not syncing to driver, assume it has access to the cp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and what guarantees that this is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing (in fact we sometimes run into errors when this is misconfigured). I've added the comment for future reference (as I find this part confusing). We should probably have better guarantees in the future.
Overall looks good, thanks Kai! Just some comments on documentation for future maintenance of these tests. |
@@ -158,6 +158,16 @@ some limitations still remain: | |||
|
|||
3. In docker-in-docker setups, a careful setup has to be followed to make the fake multinode docker provider work (see below). | |||
|
|||
Shared directories within the docker environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Why are these changes needed?
This adds a multinode checkpoint/restore test for Ray Tune. This covers some of the functionality of the release tests, but in a more controlled environment. In a follow-up PR, we should test (mocked) cloud checkpointing, too.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.