New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[air] `pyarrow.fs` persistence (7/n): `ray.train.Checkpoint` restore: Auto-recovery fault tolerance #38141

Merged

ericl merged 114 commits into ray-project:master from justinvyu:air/persistence/restore_new_checkpoint_autoft

Aug 8, 2023

Contributor

justinvyu commented Aug 4, 2023 •

edited

Loading

Why are these changes needed?

This PR handles the auto-restoration fault tolerance direction for the new Checkpoint API:

The latest _TrainingResult(checkpoint, metrics) data saved in the trial state on the driver gets sent to the workers for restoration.
No checkpoint data gets downloaded during restoration.
The user can access the checkpoint with to_directory and as_directory.

This PR also fixed a race condition in as_directory: the deletion lock should be set before the internal call to to_directory. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

Other comments

Here were some other ideas for restoring the checkpoint index:

Store it inside the _TrainingResult when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

justinvyu added 30 commits

July 27, 2023 14:24


          Pipe storage context to Trainable (used now for Trainable syncing)

abb1307

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Don't use the storage context in the trial/trainable

f6ff90a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Disable all trainable syncing in new codepath

562369f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Pipe storage context to Train workers (not actually used yet)

95a3d20

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update to use the new checkpoint id attribute

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add todo comment to remove legacy path

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Fix race condition for setting checkpoint_uri

484e67f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Fix cyclical import

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Add simple trainer test

8c856b8

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Add legacy prefix to train session checkpoint uri

78c525f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Add new checkpoint class

e97f471

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          New train session report implementation using new checkpoint

64945be

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Simplify checkpoint propagation from user code (in worker) -> trainer…

c6480c9

… -> driver

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          New tune session.report

c681ccb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Save direction works with new checkpoint API

795bafe

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Update test with e2e trainer test

8a084bc

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Make callback supporting new checkpoint a todo for now

725d802

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Remove unnecessary comment

877acb9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into air/…

ee4ccbd

…persistence/storage_context_to_worker_temp


          Separate out the new set checkpoint id from the old set checkpoint uri

88042b3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into air/…

a5eeab2

…persistence/new_checkpoint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Update id -> index

a6cd9dc

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Address comments on error to raise with old ckpt type

01f34bb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Move checkpoint upload logic to a helper fn of storage ctx

65e7a27

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint for session.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Drop a checkpoint marker after uploading

f2a4c36

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint for storage.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Add a simplified checkpoint manager

49ee126

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Fixes to checkpoint manager

ffa0dd4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Add unit test for simplified checkpoint manager

15553f7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Full test coverage

00cc9d7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Add a simplified checkpoint manager

cb5990e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Fixes to checkpoint manager

2db9aae

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Add unit test for simplified checkpoint manager

a2067b7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 6 commits

August 7, 2023 13:44


          Fix restore info log

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Keep current checkpoint index synchronized on the driver

e897eaa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Remove checkpoint dirname parsing

3eef417

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into air/…

0e52384

…persistence/restore_new_checkpoint_autoft


          Update todo comment

89631ab

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Fix lint

d4e20f2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu removed the @author-action-required label

justinvyu requested a review from ericl

August 7, 2023 21:26

Contributor Author

justinvyu commented Aug 7, 2023

@ericl This one is ready for a 2nd round. Then have 2 more lined up to finish restoration for trainers.

Contributor

ericl commented Aug 7, 2023

Just a small remaining comment.

ericl added the @author-action-required label

justinvyu added 3 commits

August 7, 2023 16:36


          Rename to starting_checkpoint

72aa1fb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into air/…

fb056f8

…persistence/restore_new_checkpoint_autoft


          Fix lint

ca0df9f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

ericl approved these changes

View reviewed changes

Contributor

ericl left a comment

Lgtm

justinvyu added 4 commits

August 8, 2023 00:22


          fix typo

3636a21

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into air/…

8743a99

…persistence/restore_new_checkpoint_autoft


          Fix repr

a0c5a26

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into air/…

0a40c47

…persistence/restore_new_checkpoint_autoft

justinvyu added tests-ok and removed @author-action-required labels

Contributor Author

justinvyu commented Aug 8, 2023

@ericl Good to merge now

ericl merged commit d13ba07 into ray-project:master

68 of 72 checks passed

justinvyu deleted the air/persistence/restore_new_checkpoint_autoft branch

August 8, 2023 18:50

ericl mentioned this pull request

[tune/train] Implement new persistence strategy and roll out as default option #38294

Closed

NripeshN pushed a commit to NripeshN/ray that referenced this pull request


          [air] pyarrow.fs persistence (7/n): ray.train.Checkpoint restore:…

5718aa9

… Auto-recovery fault tolerance (ray-project#38141)

This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API:
- The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration.
- No checkpoint data gets downloaded during restoration.
- The user can access the checkpoint with `to_directory` and `as_directory`.

This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

### Other comments

Here were some other ideas for restoring the checkpoint index:
1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.**

Signed-off-by: NripeshN <nn2012@hw.ac.uk>

harborn pushed a commit to harborn/ray that referenced this pull request


          [air] pyarrow.fs persistence (7/n): ray.train.Checkpoint restore:…

7f7e148

… Auto-recovery fault tolerance (ray-project#38141)

This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API:
- The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration.
- No checkpoint data gets downloaded during restoration.
- The user can access the checkpoint with `to_directory` and `as_directory`.

This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

### Other comments

Here were some other ideas for restoring the checkpoint index:
1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.**

Signed-off-by: harborn <gangsheng.wu@intel.com>

harborn pushed a commit to harborn/ray that referenced this pull request


          [air] pyarrow.fs persistence (7/n): ray.train.Checkpoint restore:…

baaaa06

… Auto-recovery fault tolerance (ray-project#38141)

This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API:
- The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration.
- No checkpoint data gets downloaded during restoration.
- The user can access the checkpoint with `to_directory` and `as_directory`.

This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

### Other comments

Here were some other ideas for restoring the checkpoint index:
1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.**

arvind-chandra pushed a commit to lmco/ray that referenced this pull request


          [air] pyarrow.fs persistence (7/n): ray.train.Checkpoint restore:…

c3757c8

… Auto-recovery fault tolerance (ray-project#38141)

This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API:
- The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration.
- No checkpoint data gets downloaded during restoration.
- The user can access the checkpoint with `to_directory` and `as_directory`.

This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

### Other comments

Here were some other ideas for restoring the checkpoint index:
1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.**

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

vymao pushed a commit to vymao/ray that referenced this pull request


          [air] pyarrow.fs persistence (7/n): ray.train.Checkpoint restore:…

6d5ed87

… Auto-recovery fault tolerance (ray-project#38141)

This PR handles the auto-restoration fault tolerance direction for the new `Checkpoint` API:
- The latest `_TrainingResult(checkpoint, metrics)` data saved in the trial state on the driver gets sent to the workers for restoration.
- No checkpoint data gets downloaded during restoration.
- The user can access the checkpoint with `to_directory` and `as_directory`.

This PR also fixed a race condition in `as_directory`: the deletion lock should be set *before* the internal call to `to_directory`. Otherwise, worker 1 can exit the context and delete the directory, while worker 2 is still waiting for the download to finish. Then, once worker 1 lets go of the download lock, the directory has already been deleted, so worker 2 errors..

### Other comments

Here were some other ideas for restoring the checkpoint index:
1. Store it inside the `_TrainingResult` when saving the checkpoint. Then, pass this index along with the checkpoint all the way to the worker worker. Use the index to initialize the starting checkpoint number.
2. Save it inside the Trial storage context. The trial storage context saved on the driver never sets the checkpoint_index, because that indexing is handled all the way on the trainable/worker. **This is what we're doing now. The driver's trial.storage.current_checkpoint_index gets incremented on every reported checkpoint, to stay in sync with the worker/trainable.**

Signed-off-by: Victor <vctr.y.m@example.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment