[air] `pyarrow.fs` persistence (4/n): Introduce a simplified checkpoint manager #37962

justinvyu · 2023-08-01T01:33:21Z

Why are these changes needed?

This PR condenses the functionality of ray.air._internal.checkpoint_manager into a few simplified classes:

Introduces a new _TrainingResult in place of the old _TrackedCheckpoint, which holds onto a (checkpoint, metrics) checkpoint result reported by the user.
The logic of _CheckpointManager is stripped down to only handle the top K heap of checkpoints. This includes deleting the bottom checkpoints.
- Previously, the old _CheckpointManager also handled "committing" checkpoints to disk. This was mostly for in-memory checkpoints, which don't apply to the new persistence mode. This behavior has been removed.
- The implementation of delete_fn is now standardized using the pyarrow fs.

This is a prerequisite cleanup for #37888, which shouldn't deal with the old _TrackedCheckpoint interface anymore. The old _TrackedCheckpoint holds onto the node that a checkpoint lives on + implements "commit" / "delete" and has a bunch of other functionality that doesn't apply anymore.

Future cleanup / considerations

We should remove ray.air._internal.checkpoint_manager, ray.train._internal.checkpoint, and ray.tune.execution.checkpoint_manager.
This checkpoint manager can also be moved from the driver to the trial process. (That is the direction we're trying to head with the experiment layering effort.)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-01T01:34:14Z

python/ray/train/tests/test_checkpoint_manager.py

@@ -0,0 +1,191 @@
+from pathlib import Path


This was adapted from ray/air/test_checkpoint_manager.py

justinvyu · 2023-08-01T01:34:28Z

python/ray/train/_internal/checkpoint_manager.py

@@ -0,0 +1,218 @@
+import heapq


This was adapted from ray/air/_internal/checkpoint_manager.py

ericl · 2023-08-01T02:08:15Z

python/ray/train/_internal/checkpoint_manager.py

+            and len(self._top_k_checkpoints) > self._checkpoint_config.num_to_keep
+        ):
+            worst_checkpoint = heapq.heappop(self._top_k_checkpoints)
+            self._maybe_delete_checkpoint(worst_checkpoint.tracked_checkpoint)


Why not cleanup checkpoints right away instead of deferring?

This is behavior from before: we always keep the latest checkpoint around, even if it's not a top k checkpoint. Restoration from fault tolerance restores from the latest checkpoint at the moment.

@ericl Are you ok with keeping the existing functionality for now? This test walks through an example.

Hmm I'm more talking about the implementation. I think now's a good time to clean up some of this code. I believe we can replace this class with the following, which is much simpler and unlikely to be much slower:

class _CheckpointManager: def __init__( self, checkpoint_config: Optional[CheckpointConfig], latest_checkpoint_index: int = 0, ): self._checkpoint_config = checkpoint_config or CheckpointConfig() self._checkpoints: List[_TrackedCheckpoint)] = [] self._latest_checkpoint_index = latest_checkpoint_index def register_checkpoint(self, tracked_checkpoint: _TrackedCheckpoint): self._checkpoints.append(tracked_checkpoint) if self._checkpoint_config.num_to_keep: # Could cache score as optimization if we wanted candidates = sorted(self._checkpoints[:-1], key=self._get_checkpoint_score) while candidates and len(self._checkpoints) > self._checkpoint_config.num_to_keep: checkpoint = candidates.pop(0).checkpoint _delete_fs_path(fs=checkpoint.filesystem, fs_path=checkpoint.path) @property def best_checkpoint(self) -> Optional[_TrackedCheckpoint]: if not self._checkpoints: return None return sorted(self._checkpoints, key=self._get_checkpoint_score)[-1] @property def latest_checkpoint(self) -> Optional[_TrackedCheckpoint]: if not self._checkpoints: return None return self._checkpoints[-1]

Ok, simplified it in a slightly different way that doesn't need to sort the list. PTAL

There was also a slight behavioral change from above: the K+1th checkpoint wouldn't be deleted if the best checkpoint is always excluded from the candidates.

…persistence/simplified_ckpt_manager

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/simplified_ckpt_manager

justinvyu · 2023-08-01T22:14:40Z

@ericl Any thoughts on a new name for _TrackedCheckpoint? I'm using it in the next PR and aliasing it to _NewTrackedCheckpoint, but I think this name is confusing, and it's clearer if we just choose a new name for it.

I'm thinking of changing it to TrainingResult, since it's an object that holds the (metrics, checkpoint) reported from the user:

We can use this in place of ray.train._internal.session.TrainingResult (which is pretty much the same thing).
This is in line with unifying the 2 session reports.
This is also in line with combining the train and save steps in the Tune control loop.

ericl · 2023-08-01T23:25:40Z

+1, was thinking the name is a bit confusing too

…

On Tue, Aug 1, 2023, 3:14 PM Justin Yu ***@***.***> wrote: @ericl <https://github.com/ericl> Any thoughts on a new name for _TrackedCheckpoint? I'm using it in the next PR and aliasing it to _NewTrackedCheckpoint, but I think this name is confusing, and it's clearer if we just choose a new name for it. I'm thinking of changing it to TrainingResult, since it's an object that holds the (metrics, checkpoint) reported from the user: - We can use this in place of ray.train._internal.session.TrainingResult (which is pretty much the same thing). - This is in line with unifying the 2 session reports. - This is also in line with combining the train and save steps in the Tune control loop. — Reply to this email directly, view it on GitHub <#37962 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSUPROFZ4UJMRCMO7JLXTF5VVANCNFSM6AAAAAA27BA3ZU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

python/ray/train/_internal/checkpoint_manager.py

…persistence/simplified_ckpt_manager

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/simplified_ckpt_manager

…ection (#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from #37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from #37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`

…nt manager (ray-project#37962) Signed-off-by: NripeshN <nn2012@hw.ac.uk>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: NripeshN <nn2012@hw.ac.uk>

…nt manager (ray-project#37962) Signed-off-by: harborn <gangsheng.wu@intel.com>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: harborn <gangsheng.wu@intel.com>

…nt manager (ray-project#37962)

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`

…nt manager (ray-project#37962) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…nt manager (ray-project#37962) Signed-off-by: Victor <vctr.y.m@example.com>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: Victor <vctr.y.m@example.com>

justinvyu added 4 commits July 31, 2023 18:23

Add a simplified checkpoint manager

cb5990e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fixes to checkpoint manager

2db9aae

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add unit test for simplified checkpoint manager

a2067b7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Full test coverage

f1216f2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from ericl and krfricke August 1, 2023 01:33

justinvyu assigned ericl Aug 1, 2023

justinvyu commented Aug 1, 2023

View reviewed changes

ericl reviewed Aug 1, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 1, 2023

justinvyu added 3 commits August 1, 2023 09:45

Merge branch 'master' of https://github.com/ray-project/ray into air/…

d4243e6

…persistence/simplified_ckpt_manager

Simplify even more

9b9ff34

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

83aecd9

…persistence/simplified_ckpt_manager

justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 1, 2023

justinvyu requested a review from ericl August 1, 2023 20:51

ericl reviewed Aug 2, 2023

View reviewed changes

python/ray/train/_internal/checkpoint_manager.py Show resolved Hide resolved

ericl approved these changes Aug 2, 2023

View reviewed changes

justinvyu added 3 commits August 1, 2023 22:02

Merge branch 'master' of https://github.com/ray-project/ray into air/…

a6115b3

…persistence/simplified_ckpt_manager

Rename _TrackedCheckpoint -> _TrainingResult

7cc74d9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

6a0e1fb

…persistence/simplified_ckpt_manager

justinvyu mentioned this pull request Aug 2, 2023

[air] pyarrow.fs persistence (5/n): ray.train.Checkpoint save direction #37888

Merged

8 tasks

justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 2, 2023

ericl merged commit f19e027 into ray-project:master Aug 2, 2023
48 of 54 checks passed

justinvyu deleted the air/persistence/simplified_ckpt_manager branch August 2, 2023 18:17

ericl mentioned this pull request Aug 10, 2023

[tune/train] Implement new persistence strategy and roll out as default option #38294

Closed

NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023

[air] pyarrow.fs persistence (4/n): Introduce a simplified checkpoi…

84284c4

…nt manager (ray-project#37962) Signed-off-by: NripeshN <nn2012@hw.ac.uk>

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[air] pyarrow.fs persistence (4/n): Introduce a simplified checkpoi…

736347f

…nt manager (ray-project#37962) Signed-off-by: harborn <gangsheng.wu@intel.com>

harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023

[air] pyarrow.fs persistence (4/n): Introduce a simplified checkpoi…

4b5c789

…nt manager (ray-project#37962)

arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023

[air] pyarrow.fs persistence (4/n): Introduce a simplified checkpoi…

80d1e75

…nt manager (ray-project#37962) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023

[air] pyarrow.fs persistence (4/n): Introduce a simplified checkpoi…

17fe651

…nt manager (ray-project#37962) Signed-off-by: Victor <vctr.y.m@example.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] `pyarrow.fs` persistence (4/n): Introduce a simplified checkpoint manager #37962

[air] `pyarrow.fs` persistence (4/n): Introduce a simplified checkpoint manager #37962

justinvyu commented Aug 1, 2023 •

edited

Loading

justinvyu Aug 1, 2023

justinvyu Aug 1, 2023

ericl Aug 1, 2023

justinvyu Aug 1, 2023

justinvyu Aug 1, 2023

ericl Aug 1, 2023

justinvyu Aug 1, 2023

justinvyu commented Aug 1, 2023

ericl commented Aug 1, 2023 via email

[air] pyarrow.fs persistence (4/n): Introduce a simplified checkpoint manager #37962

[air] pyarrow.fs persistence (4/n): Introduce a simplified checkpoint manager #37962

Conversation

justinvyu commented Aug 1, 2023 • edited Loading

Why are these changes needed?

Future cleanup / considerations

Related issue number

Checks

justinvyu Aug 1, 2023

Choose a reason for hiding this comment

justinvyu Aug 1, 2023

Choose a reason for hiding this comment

ericl Aug 1, 2023

Choose a reason for hiding this comment

justinvyu Aug 1, 2023

Choose a reason for hiding this comment

justinvyu Aug 1, 2023

Choose a reason for hiding this comment

ericl Aug 1, 2023

Choose a reason for hiding this comment

justinvyu Aug 1, 2023

Choose a reason for hiding this comment

justinvyu commented Aug 1, 2023

ericl commented Aug 1, 2023 via email

[air] `pyarrow.fs` persistence (4/n): Introduce a simplified checkpoint manager #37962

[air] `pyarrow.fs` persistence (4/n): Introduce a simplified checkpoint manager #37962

justinvyu commented Aug 1, 2023 •

edited

Loading