[air] `pyarrow.fs` persistence (3/n): Introduce new `Checkpoint` API #37925

justinvyu · 2023-07-30T07:47:39Z

Why are these changes needed?

This PR introduces the new Checkpoint API (based on the prototype PR #36969). This PR also adds a set of simplified unit tests for the checkpoint class functionality that tests multiple types of checkpoint path/fs inputs.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com> Fix lint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com> Missing import Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/new_checkpoint_api

…s_directory Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/new_checkpoint_api

justinvyu · 2023-07-31T17:13:29Z

python/ray/train/_internal/storage.py

        _create_directory(fs=fs, fs_path=fs_path)
        _pyarrow_fs_copy_files(local_path, fs_path, destination_filesystem=fs)
+        return


Fix: if exclude is not passed, we were previously passing through, even though it should just perform this if block.

ericl · 2023-07-31T19:37:41Z

python/ray/train/checkpoint.py

+        if path and not filesystem:
+            self.filesystem, self.path = pyarrow.fs.FileSystem.from_uri(path)
+
+        # The UUID is generated by hashing the combination of the file system type


I wonder if this could potentially be dangerous if the data was updated somehow. What if we made it purely randomly generated (presumably it gets carried along whenenever Checkpoint is passed to different workers within this class)?

What about we just generate this uuid whenever to_directory gets called, and don't keep it as an attribute? That way we always use the latest path/filesystem rather than what it was at initialization.

If it's a random uuid, then we no longer de-duplicate downloads to the same directory.

@ericl I think the random uuid idea also works in most use cases.

ds.map_batches(Predictor, fn_args=(result.checkpoint,)) # <-- each map batches worker uses the same checkpoint w/ the same uuid Trainer(resume_from_checkpoint=result.checkpoint) # <-- each train worker downloads a ckpt with the same uuid

Only case it doesn't cover is multiple processes creating a separate checkpoints pointing to the same location.

I'm ok with either way, what do you think?

Ok, let's go with the random uuid then, because I think this is what is currently implemented in the air.Checkpoint code.

Oh, the air.Checkpoint currently uses a canonical uuid for URI-checkpoints (so same as the implementation I have now).

Oh I see. Hmm, I feel it's a bit risky to use that so would still prefer to generate random ones to start at least.

Ok, changed, ptal!

ericl

One question on whether we can use a random UUID instead.

…persistence/new_checkpoint_api

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

pcmoritz · 2023-08-01T22:07:10Z

Why do we suddenly have a public ray.train.checkpoint namespace? Please put it into _internal and only expose it as ray.train.Checkpoint, see ray-project/enhancements#36

matthewdeng · 2023-08-01T22:27:51Z

@pcmoritz why should this be moved to _internal? Wouldn't the typical pattern be to:

Define Checkpoint in train/checkpoint.py
Import ray.train.checkpoint.Checkpoint in train/__init__.py and include it in __all__?

ericl · 2023-08-01T22:30:38Z

The main things is avoiding having redundant public aliases for the same class right? So we don't want both ray.train.checkpoint.Checkpoint and ray.train.Checkpoint to exist at the same time.

ericl · 2023-08-01T22:31:15Z

Though, I don't think we are consistently following this throughout the codebase. For example, in Ray Data, we have ray.data.Dataset as well as ray.data.dataset.Dataset.

…ection (#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from #37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from #37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`

…ay-project#37925) Signed-off-by: NripeshN <nn2012@hw.ac.uk>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: NripeshN <nn2012@hw.ac.uk>

…ay-project#37925) Signed-off-by: harborn <gangsheng.wu@intel.com>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: harborn <gangsheng.wu@intel.com>

…ay-project#37925)

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`

…ay-project#37925) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…ay-project#37925) Signed-off-by: Victor <vctr.y.m@example.com>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: Victor <vctr.y.m@example.com>

justinvyu added 16 commits July 28, 2023 14:31

Add new checkpoint class

b4b631a

Signed-off-by: Justin Yu <justinvyu@anyscale.com> Fix lint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix to_directory

ee356b8

Signed-off-by: Justin Yu <justinvyu@anyscale.com> Missing import Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Improve docstrings + add doctest

e700fb0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add uuid

13be28a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add exists_at_fs_path util in storage

d3f86a3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Switch to json serialization + return empty dict if no metadata

c03b188

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Accept pathlike in to_directory

0aefe57

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix accidental double upload

574e9b0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Separate out custom fs into a helper fn

3d8f1bc

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add new checkpoint unit tests

f1882b0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Test as_directory

174efa5

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix metadata writing with json serialization (convert to bytes first)

59dce21

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add unit test for metadata operations

7812420

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add basic unit test for multiprocess to_directory

81ddf0a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add comments about top-level constants + rename

9c95018

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix del lock to only be created on as_directory

93778e6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from ericl July 30, 2023 07:47

justinvyu assigned ericl Jul 30, 2023

justinvyu added 7 commits July 31, 2023 09:32

Clean up the del lock code a bit

712ef29

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update docstring

8978262

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

a282754

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add to BUILD file

a6d1c34

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

47c64fa

…persistence/new_checkpoint_api

Fix list del locks helper function + add unit test for multiprocess a…

0aff24a

…s_directory Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

562eded

…persistence/new_checkpoint_api

justinvyu commented Jul 31, 2023

View reviewed changes

ericl reviewed Jul 31, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 31, 2023

Merge branch 'master' of https://github.com/ray-project/ray into air/…

92a2081

…persistence/new_checkpoint_api

Use a random uuid

39102f0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 31, 2023

justinvyu requested a review from ericl July 31, 2023 22:54

ericl approved these changes Jul 31, 2023

View reviewed changes

justinvyu changed the title ~~[air] pyarrow.fs persistence: Introduce new Checkpoint API~~ [air] pyarrow.fs persistence (3/n): Introduce new Checkpoint API Jul 31, 2023

justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 1, 2023

ericl merged commit 9d9f482 into ray-project:master Aug 1, 2023
50 of 54 checks passed

justinvyu deleted the air/persistence/new_checkpoint_api branch August 1, 2023 06:09

pcmoritz mentioned this pull request Aug 1, 2023

[air] Remove ray.train.checkpoint namespace #37991

Merged

8 tasks

justinvyu mentioned this pull request Aug 2, 2023

[air] pyarrow.fs persistence (5/n): ray.train.Checkpoint save direction #37888

Merged

8 tasks

rynewang mentioned this pull request Aug 3, 2023

[core][3/3] Use the new standalone runtime env http server. #37585

Merged

8 tasks

ericl mentioned this pull request Aug 10, 2023

[tune/train] Implement new persistence strategy and roll out as default option #38294

Closed

NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023

[air] pyarrow.fs persistence (3/n): Introduce new Checkpoint API (r…

ac42b13

…ay-project#37925) Signed-off-by: NripeshN <nn2012@hw.ac.uk>

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[air] pyarrow.fs persistence (3/n): Introduce new Checkpoint API (r…

dabd009

…ay-project#37925) Signed-off-by: harborn <gangsheng.wu@intel.com>

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[air] pyarrow.fs persistence (3/n): Introduce new Checkpoint API (r…

ac2d868

…ay-project#37925)

arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023

[air] pyarrow.fs persistence (3/n): Introduce new Checkpoint API (r…

8f4ea0b

…ay-project#37925) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[air] pyarrow.fs persistence (3/n): Introduce new Checkpoint API (r…

82e25f9

…ay-project#37925) Signed-off-by: Victor <vctr.y.m@example.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] `pyarrow.fs` persistence (3/n): Introduce new `Checkpoint` API #37925

[air] `pyarrow.fs` persistence (3/n): Introduce new `Checkpoint` API #37925

justinvyu commented Jul 30, 2023

justinvyu Jul 31, 2023

ericl Jul 31, 2023

justinvyu Jul 31, 2023 •

edited

Loading

justinvyu Jul 31, 2023

ericl Jul 31, 2023

justinvyu Jul 31, 2023

ericl Jul 31, 2023

justinvyu Jul 31, 2023

ericl left a comment

pcmoritz commented Aug 1, 2023

matthewdeng commented Aug 1, 2023

ericl commented Aug 1, 2023

ericl commented Aug 1, 2023

[air] pyarrow.fs persistence (3/n): Introduce new Checkpoint API #37925

[air] pyarrow.fs persistence (3/n): Introduce new Checkpoint API #37925

Conversation

justinvyu commented Jul 30, 2023

Why are these changes needed?

Related issue number

Checks

justinvyu Jul 31, 2023

Choose a reason for hiding this comment

ericl Jul 31, 2023

Choose a reason for hiding this comment

justinvyu Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

justinvyu Jul 31, 2023

Choose a reason for hiding this comment

ericl Jul 31, 2023

Choose a reason for hiding this comment

justinvyu Jul 31, 2023

Choose a reason for hiding this comment

ericl Jul 31, 2023

Choose a reason for hiding this comment

justinvyu Jul 31, 2023

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

pcmoritz commented Aug 1, 2023

matthewdeng commented Aug 1, 2023

ericl commented Aug 1, 2023

ericl commented Aug 1, 2023

[air] `pyarrow.fs` persistence (3/n): Introduce new `Checkpoint` API #37925

[air] `pyarrow.fs` persistence (3/n): Introduce new `Checkpoint` API #37925

justinvyu Jul 31, 2023 •

edited

Loading