New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train+tune] Local directory refactor (3/n): Revert to async experiment state snapshotting #43689
[train+tune] Local directory refactor (3/n): Revert to async experiment state snapshotting #43689
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…rate_driver_and_trial_artifacts
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com> update trainer._save usage in test Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ting for driver sync Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ad_pkl_directly
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…open?) Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…_exp_ckpt_async
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…_exp_ckpt_async
…w syncs Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…_exp_ckpt_async
"Saving experiment state to storage at " | ||
f"'{self._storage.experiment_fs_path}' failed with exception: ", | ||
exc_info=True, | ||
) | ||
|
||
if force: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I'm wrong:
self._storage.syncer.sync_up()
returns False
if there's an ongoing syncing, otherwise launch a new syncing and returns True
. (Nit: maybe update the behavior in the doctring? although it's an internal api)
The way we achieve "force sync_up" is to wait for the ongoing syncing to finish, thus we can always trigger a new syncing when we call sync_up()
later.
No matter launched_sync
is True
(Launced a new syncing) or False
(there's an ongoing syncing), we will not interrupt the current syncing and just let it to finish before timeout (1800s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is correct. We never interrupt an existing sync, but we do a blocking wait on the existing sync if force=True
.
"The previous sync of the experiment directory to the cloud " | ||
f"failed with the error: {str(e)}\nSyncing will be retried." | ||
"Experiment state snapshotting has been triggered multiple " | ||
f"times in the last {self._excessive_sync_threshold} seconds. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_excessive_sync_threshold = TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S = 5
This is just a warning that let the users not do checkpoint too frequently in general. Why do we specifically mention num_to_keep
and the force
logic here? If force=False
, it's still possible to raise this warning if checkpoint too frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Also, why not suggest increase the sync_up
period? instead of reduce the warning period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If force=False
, the checkpointing period is max(10, auto adjusted period)
, so we shouldn't be running into excessive syncs in the default case.
So, this message is just for the forced case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not suggest increase the sync_up period? instead of reduce the warning period?
num_to_keep
will always cause experiment snapshots to be forced, which disregards the checkpoint period. So increasing the value of TUNE_GLOBAL_CHECKPOINT_S
doesn't actually do anything. So, the only fixes are to increase num_to_keep
or just accept this and suppress the warning. 🙁
self._trial_num_checkpoints_since_last_sync[trial] | ||
>= self._sync_every_n_trial_checkpoints | ||
): | ||
self._should_force_sync_up = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> I see. This is the workaround for num_to_keep
.
_sync_every_n_trial_checkpoints = CheckpointConfig.num_to_keep
https://github.com/justinvyu/ray/blob/a2fb4423906874ed988a860291c398721d54b736/python/ray/tune/execution/tune_controller.py#L319
We only enable force sync up every num_to_keep
checkpoints. And that's why will only raised that excessive checkpoint warning when num_to_keep
is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the exp sync_up behavior is affected by the per-trial CheckpointManager behavior.
One idea: In the future, we should let each trial maintains it's own latest checkpoint path, experiment states only keep the status of all the trials. Then we don't have to worry about the checkpoint mismatch problem between exp state and per-trial checkpoint folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that is it 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the exp sync_up behavior is affected by the per-trial CheckpointManager behavior.
Yes, this is a design flaw that we just have to workaround for now. We definitely want to learn this lesson when designing a new system.
Overall looks good to me! Left some comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For usage proto changes
Why are these changes needed?
A hanging/failing driver file upload will currently block/fail the tune control loop, even though all trials may be running fine. This regression is a side-effect of #43403, which made a behavior change to increase the freshness of the experiment state files in storage. (See the "Experiment checkpoint saving and uploading now happens synchronously." bullet point in that PR description.) Prior to that PR, we would do driver syncing asynchronously -- if it failed, we'd catch and log the error; if it hung, we'd timeout after 30 minutes and log a warning.
This PR switches back to asynchronous driver syncing and adds back the
num_to_keep
forceful experiment state snapshot mitigation. After this PR, the change compared to what we had in 2.9.3 is the part in strikethrough:TUNE_GLOBAL_CHECKPOINT_S
seconds (every ~10 seconds).CheckpointConfig(num_to_keep)
is set, then force a new sync to be launched if any trial has reported more thannum_to_keep
checkpoints since the last sync. Force a new sync by waiting on the previous one first (a blocking call) and launching a new one.num_to_keep=1
since it blocks the execution loop for too long. This should be reworked soon.If we have already synced up within the lastsync_period
(default = 5 minutes), then skip the sync up.Related issue number
Closes #43746
Closes #43748
Closes #43747
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.