-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Remove head node syncing as the default storage option #37142
[AIR] Remove head node syncing as the default storage option #37142
Conversation
…g checkpoint Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
python/ray/tune/syncer.py
Outdated
@@ -787,6 +791,9 @@ def _remote_trial_logdir(self, trial: "Trial"): | |||
def _sync_trial_dir( | |||
self, trial: "Trial", force: bool = False, wait: bool = True | |||
) -> bool: | |||
if not os.environ.get(REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE): | |||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the same error message here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, this is for artifact syncing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user is running multi-node without checkpointing, but still saves artifacts:
on_checkpoint
does not get called- Trial artifacts would no longer be pulled to the head node.
We can do one of these options:
- Detect if the trial wrote any artifacts. If so, raise an error on this callback.
- Just log a warning on the first time sync_trial_dir gets called for a remote trial and does a no-op, saying something like "Syncing to head node is disabled. If you are writing any trial artifacts, they will not be persisted. Please do X or Y." Logging on every sync_trial_dir will spam the logs too much.
I think option 2 is safer.
…artifacts case Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ng release test Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…disable_head_node_sync
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…disable_head_node_sync
…disable_head_node_sync
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…disable_head_node_sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pending tests passing. Can you run all of the rllib release tests though: name:rllib_.*
it seems you only ran the SAC ones.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This reverts commit a0c5954. Revert changes to release tests yaml Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Ran the full suite of ml and rllib release tests by picking this PR onto the release branch. There were too many confounding factors on master to figure out what failures were caused by my PR vs. not. See here: https://buildkite.com/ray-project/release-tests-pr/builds/45251#_ All errors here are either unrelated to the PR or unstable, and some have been resolved (ex: a problem core PR has been reverted) cc: @avnishn for the tests, @sofianhnaide for approval |
…disable_head_node_sync
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…disable_head_node_sync
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ject#37142) Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ject#37142) Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ject#37142) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…ject#37142) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
…ject#37142) Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ject#37142) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…ject#37142) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
See the REP and Github Issue for context:
Here are the possible scenarios:
Error message:
TODO
Follow-up PR
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.