Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train+tune] Local directory refactor (3/n): Revert to async experiment state snapshotting #43689

Merged
merged 150 commits into from Mar 9, 2024

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Mar 4, 2024

Why are these changes needed?

A hanging/failing driver file upload will currently block/fail the tune control loop, even though all trials may be running fine. This regression is a side-effect of #43403, which made a behavior change to increase the freshness of the experiment state files in storage. (See the "Experiment checkpoint saving and uploading now happens synchronously." bullet point in that PR description.) Prior to that PR, we would do driver syncing asynchronously -- if it failed, we'd catch and log the error; if it hung, we'd timeout after 30 minutes and log a warning.

This PR switches back to asynchronous driver syncing and adds back the num_to_keep forceful experiment state snapshot mitigation. After this PR, the change compared to what we had in 2.9.3 is the part in strikethrough:

  • Write the "local" copy of these experiment state files every TUNE_GLOBAL_CHECKPOINT_S seconds (every ~10 seconds).
  • If CheckpointConfig(num_to_keep) is set, then force a new sync to be launched if any trial has reported more than num_to_keep checkpoints since the last sync. Force a new sync by waiting on the previous one first (a blocking call) and launching a new one.
    • NOTE: This is a hack to keep the driver experiment state as consistent as possible so that trials don't. This is pretty unusable when you have more than ~10 trials with num_to_keep=1 since it blocks the execution loop for too long. This should be reworked soon.
    • This PR adds this mitigation back, since other solutions are too risky at the moment, and it's better to just fall back to the existing behavior.
  • Else:
    • Try to launch a sync up task to storage after writing the files to local. If we have already synced up within the last sync_period (default = 5 minutes), then skip the sync up.
    • If the previous sync task is still running, skip launching a new sync up task.
  • At the end of the experiment, wait for the latest sync up if one is running. Then, launch a final driver sync up and wait for that one to also finish.

Related issue number

Closes #43746
Closes #43748
Closes #43747

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update trainer._save usage in test

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ting for driver sync

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…open?)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from woshiyyya March 7, 2024 21:58
"Saving experiment state to storage at "
f"'{self._storage.experiment_fs_path}' failed with exception: ",
exc_info=True,
)

if force:
Copy link
Member

@woshiyyya woshiyyya Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong:

self._storage.syncer.sync_up() returns False if there's an ongoing syncing, otherwise launch a new syncing and returns True. (Nit: maybe update the behavior in the doctring? although it's an internal api)

The way we achieve "force sync_up" is to wait for the ongoing syncing to finish, thus we can always trigger a new syncing when we call sync_up() later.

No matter launched_sync is True(Launced a new syncing) or False(there's an ongoing syncing), we will not interrupt the current syncing and just let it to finish before timeout (1800s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is correct. We never interrupt an existing sync, but we do a blocking wait on the existing sync if force=True.

"The previous sync of the experiment directory to the cloud "
f"failed with the error: {str(e)}\nSyncing will be retried."
"Experiment state snapshotting has been triggered multiple "
f"times in the last {self._excessive_sync_threshold} seconds. "
Copy link
Member

@woshiyyya woshiyyya Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_excessive_sync_threshold = TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S = 5

This is just a warning that let the users not do checkpoint too frequently in general. Why do we specifically mention num_to_keep and the force logic here? If force=False, it's still possible to raise this warning if checkpoint too frequently.

Copy link
Member

@woshiyyya woshiyyya Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Also, why not suggest increase the sync_up period? instead of reduce the warning period?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If force=False, the checkpointing period is max(10, auto adjusted period), so we shouldn't be running into excessive syncs in the default case.

So, this message is just for the forced case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not suggest increase the sync_up period? instead of reduce the warning period?

num_to_keep will always cause experiment snapshots to be forced, which disregards the checkpoint period. So increasing the value of TUNE_GLOBAL_CHECKPOINT_S doesn't actually do anything. So, the only fixes are to increase num_to_keep or just accept this and suppress the warning. 🙁

self._trial_num_checkpoints_since_last_sync[trial]
>= self._sync_every_n_trial_checkpoints
):
self._should_force_sync_up = True
Copy link
Member

@woshiyyya woshiyyya Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> I see. This is the workaround for num_to_keep.

_sync_every_n_trial_checkpoints = CheckpointConfig.num_to_keep
https://github.com/justinvyu/ray/blob/a2fb4423906874ed988a860291c398721d54b736/python/ray/tune/execution/tune_controller.py#L319

We only enable force sync up every num_to_keep checkpoints. And that's why will only raised that excessive checkpoint warning when num_to_keep is set.

Copy link
Member

@woshiyyya woshiyyya Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the exp sync_up behavior is affected by the per-trial CheckpointManager behavior.

One idea: In the future, we should let each trial maintains it's own latest checkpoint path, experiment states only keep the status of all the trials. Then we don't have to worry about the checkpoint mismatch problem between exp state and per-trial checkpoint folder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that is it 😢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the exp sync_up behavior is affected by the per-trial CheckpointManager behavior.

Yes, this is a design flaw that we just have to workaround for now. We definitely want to learn this lesson when designing a new system.

@woshiyyya
Copy link
Member

woshiyyya commented Mar 8, 2024

Overall looks good to me! Left some comments

@justinvyu justinvyu changed the title [train+tune] Local directory refactor (3/n): Add back timeout+failure handling on experiment checkpointing [train+tune] Local directory refactor (3/n): Revert to async experiment state snapshotting Mar 8, 2024
Copy link
Contributor

@thomasdesr thomasdesr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For usage proto changes

@justinvyu justinvyu merged commit 7064a66 into ray-project:master Mar 9, 2024
8 of 9 checks passed
@justinvyu justinvyu deleted the handle_upload_timeout branch March 9, 2024 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants