Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Remove head node syncing as the default storage option #37142

Merged
merged 49 commits into from
Jul 13, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Jul 6, 2023

Why are these changes needed?

See the REP and Github Issue for context:

Here are the possible scenarios:

  1. Single-node experiment: No error raised / no behavior change. Default syncing is disabled by default, which should not change any behavior. Before, we launched a syncing operation, but it was a no-op since it tried to sync a local directory to itself.
  2. Multi-node experiment + no checkpointing: A warning will be logged the first time AIR tries to pull the directory of a trial running on a remote node. All metrics get streamed to the driver, so those are still persisted to the head node. This really only affects trial artifacts, which are no longer pulled to the head node.
  3. Multi-node experiment + checkpoints + no cloud storage: Error raised upon the first checkpoint of a trial living on a worker node.
  4. Multi-node experiment + checkpoints + cloud storage: No error raised / no behavior change.

Error message:

E               DeprecationWarning: Ray AIR no longer supports the synchronization of checkpoints and other artifacts from worker nodes to the head node. This means that the checkpoints and artifacts saved by trials scheduled on worker nodes will not be accessible during the run (e.g., resuming from a checkpoint after a failure) or after the run (e.g., loading the checkpoint of a trial that ran on an already terminated worker node).
E               
E               To fix this issue, configure AIR to use either:
E               (1) Cloud storage: `RunConfig(storage_path='s3://your/bucket')`
E               (2) A network filesystem mounted on all nodes: `RunConfig(storage_path='/mnt/path/to/nfs_storage')`
E               See this Github issue for more details on transitioning to cloud storage/NFS as well as an explanation on why this functionality is being removed: https://github.com/ray-project/ray/issues/37177
E               
E               Other temporary workarounds:
E               - If you want to avoid errors/warnings and continue running with syncing explicitly turned off, set `RunConfig(SyncConfig(syncer=None))`
E               - Or, to re-enable the head node syncing behavior, set the environment variable RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE=1
E                 - **Note that this functionality will tentatively be hard-deprecated in Ray 2.7.** See the linked issue for the latest information.

TODO

  • Use the env var for any legacy tests
  • Fix all broken release tests
  • E2E test?
  • Link a Github issue explaining this decision

Follow-up PR

  • Update docs
    • sync_on_checkpoint is useless if head node syncing is disabled
    • syncer=None is not needed anymore for NFS, since it's disabled by default now

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…g checkpoint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@@ -787,6 +791,9 @@ def _remote_trial_logdir(self, trial: "Trial"):
def _sync_trial_dir(
self, trial: "Trial", force: bool = False, wait: bool = True
) -> bool:
if not os.environ.get(REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE):
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the same error message here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this is for artifact syncing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user is running multi-node without checkpointing, but still saves artifacts:

  • on_checkpoint does not get called
  • Trial artifacts would no longer be pulled to the head node.

We can do one of these options:

  1. Detect if the trial wrote any artifacts. If so, raise an error on this callback.
  2. Just log a warning on the first time sync_trial_dir gets called for a remote trial and does a no-op, saying something like "Syncing to head node is disabled. If you are writing any trial artifacts, they will not be persisted. Please do X or Y." Logging on every sync_trial_dir will spam the logs too much.

I think option 2 is safer.

@matthewdeng matthewdeng requested a review from ericl July 6, 2023 18:21
@matthewdeng matthewdeng added release-blocker P0 Issue that blocks the release Ray 2.6 labels Jul 6, 2023
…artifacts case

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ng release test

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from krfricke July 7, 2023 17:53
Copy link
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending tests passing. Can you run all of the rllib release tests though: name:rllib_.*

it seems you only ran the SAC ones.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This reverts commit a0c5954.

Revert changes to release tests yaml

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu
Copy link
Contributor Author

justinvyu commented Jul 12, 2023

Ran the full suite of ml and rllib release tests by picking this PR onto the release branch. There were too many confounding factors on master to figure out what failures were caused by my PR vs. not. See here: https://buildkite.com/ray-project/release-tests-pr/builds/45251#_

All errors here are either unrelated to the PR or unstable, and some have been resolved (ex: a problem core PR has been reverted)

cc: @avnishn for the tests, @sofianhnaide for approval

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 13, 2023
@matthewdeng matthewdeng merged commit 054e00f into ray-project:master Jul 13, 2023
70 of 79 checks passed
@justinvyu justinvyu deleted the air/disable_head_node_sync branch July 17, 2023 17:01
Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 24, 2023
Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 28, 2023
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…ject#37142)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…ject#37142)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…ject#37142)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…ject#37142)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ray 2.6 release-blocker P0 Issue that blocks the release tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants