Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] pyarrow.fs persistence (7/n): ray.train.Checkpoint restore: Auto-recovery fault tolerance #38141

Merged

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 4, 2023

Why are these changes needed?

This PR handles the auto-restoration fault tolerance direction for the new Checkpoint API:

  • The latest _TrainingResult(checkpoint, metrics) data saved in the trial state on the driver gets sent to the workers for restoration.
  • No checkpoint data gets downloaded during restoration.
  • The user can access the checkpoint with to_directory and as_directory.

This PR also fixed a race condition in as_directory: the deletion lock should be set before the internal call to to_directory. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

Other comments

Here were some other ideas for restoring the checkpoint index:

  1. Store it inside the _TrainingResult when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
  2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update to use the new checkpoint id attribute

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add todo comment to remove legacy path

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
… -> driver

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/storage_context_to_worker_temp
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/new_checkpoint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint for session.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint for storage.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/restore_new_checkpoint_autoft
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 7, 2023
@justinvyu justinvyu requested a review from ericl August 7, 2023 21:26
@justinvyu
Copy link
Contributor Author

@ericl This one is ready for a 2nd round. Then have 2 more lined up to finish restoration for trainers.

@ericl
Copy link
Contributor

ericl commented Aug 7, 2023

Just a small remaining comment.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 7, 2023
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/restore_new_checkpoint_autoft
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/restore_new_checkpoint_autoft
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/restore_new_checkpoint_autoft
@justinvyu justinvyu added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 8, 2023
@justinvyu
Copy link
Contributor Author

@ericl Good to merge now

@ericl ericl merged commit d13ba07 into ray-project:master Aug 8, 2023
68 of 72 checks passed
@justinvyu justinvyu deleted the air/persistence/restore_new_checkpoint_autoft branch August 8, 2023 18:50
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
… Auto-recovery fault tolerance (ray-project#38141)

This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API:
- The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration.
- No checkpoint data gets downloaded during restoration.
- The user can access the checkpoint with `to_directory` and `as_directory`.

This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

### Other comments

Here were some other ideas for restoring the checkpoint index:
1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.**

Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
… Auto-recovery fault tolerance (ray-project#38141)

This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API:
- The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration.
- No checkpoint data gets downloaded during restoration.
- The user can access the checkpoint with `to_directory` and `as_directory`.

This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

### Other comments

Here were some other ideas for restoring the checkpoint index:
1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.**

Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
… Auto-recovery fault tolerance (ray-project#38141)

This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API:
- The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration.
- No checkpoint data gets downloaded during restoration.
- The user can access the checkpoint with `to_directory` and `as_directory`.

This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

### Other comments

Here were some other ideas for restoring the checkpoint index:
1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.**
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
… Auto-recovery fault tolerance (ray-project#38141)

This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API:
- The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration.
- No checkpoint data gets downloaded during restoration.
- The user can access the checkpoint with `to_directory` and `as_directory`.

This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

### Other comments

Here were some other ideas for restoring the checkpoint index:
1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.**

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
… Auto-recovery fault tolerance (ray-project#38141)

This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API:
- The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration.
- No checkpoint data gets downloaded during restoration.
- The user can access the checkpoint with `to_directory` and `as_directory`.

This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

### Other comments

Here were some other ideas for restoring the checkpoint index:
1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.**

Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants