-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] pyarrow.fs
persistence (4/n): Introduce a simplified checkpoint manager
#37962
[air] pyarrow.fs
persistence (4/n): Introduce a simplified checkpoint manager
#37962
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@@ -0,0 +1,191 @@ | |||
from pathlib import Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was adapted from ray/air/test_checkpoint_manager.py
@@ -0,0 +1,218 @@ | |||
import heapq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was adapted from ray/air/_internal/checkpoint_manager.py
and len(self._top_k_checkpoints) > self._checkpoint_config.num_to_keep | ||
): | ||
worst_checkpoint = heapq.heappop(self._top_k_checkpoints) | ||
self._maybe_delete_checkpoint(worst_checkpoint.tracked_checkpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not cleanup checkpoints right away instead of deferring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is behavior from before: we always keep the latest checkpoint around, even if it's not a top k checkpoint. Restoration from fault tolerance restores from the latest checkpoint at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl Are you ok with keeping the existing functionality for now? This test walks through an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm more talking about the implementation. I think now's a good time to clean up some of this code. I believe we can replace this class with the following, which is much simpler and unlikely to be much slower:
class _CheckpointManager:
def __init__(
self,
checkpoint_config: Optional[CheckpointConfig],
latest_checkpoint_index: int = 0,
):
self._checkpoint_config = checkpoint_config or CheckpointConfig()
self._checkpoints: List[_TrackedCheckpoint)] = []
self._latest_checkpoint_index = latest_checkpoint_index
def register_checkpoint(self, tracked_checkpoint: _TrackedCheckpoint):
self._checkpoints.append(tracked_checkpoint)
if self._checkpoint_config.num_to_keep:
# Could cache score as optimization if we wanted
candidates = sorted(self._checkpoints[:-1], key=self._get_checkpoint_score)
while candidates and len(self._checkpoints) > self._checkpoint_config.num_to_keep:
checkpoint = candidates.pop(0).checkpoint
_delete_fs_path(fs=checkpoint.filesystem, fs_path=checkpoint.path)
@property
def best_checkpoint(self) -> Optional[_TrackedCheckpoint]:
if not self._checkpoints:
return None
return sorted(self._checkpoints, key=self._get_checkpoint_score)[-1]
@property
def latest_checkpoint(self) -> Optional[_TrackedCheckpoint]:
if not self._checkpoints:
return None
return self._checkpoints[-1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, simplified it in a slightly different way that doesn't need to sort the list. PTAL
There was also a slight behavioral change from above: the K+1th checkpoint wouldn't be deleted if the best checkpoint is always excluded from the candidates.
…persistence/simplified_ckpt_manager
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/simplified_ckpt_manager
@ericl Any thoughts on a new name for I'm thinking of changing it to
|
+1, was thinking the name is a bit confusing too
…On Tue, Aug 1, 2023, 3:14 PM Justin Yu ***@***.***> wrote:
@ericl <https://github.com/ericl> Any thoughts on a new name for
_TrackedCheckpoint? I'm using it in the next PR and aliasing it to
_NewTrackedCheckpoint, but I think this name is confusing, and it's
clearer if we just choose a new name for it.
I'm thinking of changing it to TrainingResult, since it's an object that
holds the (metrics, checkpoint) reported from the user:
- We can use this in place of
ray.train._internal.session.TrainingResult (which is pretty much the
same thing).
- This is in line with unifying the 2 session reports.
- This is also in line with combining the train and save steps in the
Tune control loop.
—
Reply to this email directly, view it on GitHub
<#37962 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSUPROFZ4UJMRCMO7JLXTF5VVANCNFSM6AAAAAA27BA3ZU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…persistence/simplified_ckpt_manager
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/simplified_ckpt_manager
…ection (#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from #37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from #37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`
…nt manager (ray-project#37962) Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…nt manager (ray-project#37962) Signed-off-by: harborn <gangsheng.wu@intel.com>
…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: harborn <gangsheng.wu@intel.com>
…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`
…nt manager (ray-project#37962) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…nt manager (ray-project#37962) Signed-off-by: Victor <vctr.y.m@example.com>
…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR condenses the functionality of
ray.air._internal.checkpoint_manager
into a few simplified classes:_TrainingResult
in place of the old_TrackedCheckpoint
, which holds onto a(checkpoint, metrics)
checkpoint result reported by the user._CheckpointManager
is stripped down to only handle the top K heap of checkpoints. This includes deleting the bottom checkpoints._CheckpointManager
also handled "committing" checkpoints to disk. This was mostly for in-memory checkpoints, which don't apply to the new persistence mode. This behavior has been removed.delete_fn
is now standardized using the pyarrow fs.This is a prerequisite cleanup for #37888, which shouldn't deal with the old
_TrackedCheckpoint
interface anymore. The old_TrackedCheckpoint
holds onto the node that a checkpoint lives on + implements "commit" / "delete" and has a bunch of other functionality that doesn't apply anymore.Future cleanup / considerations
ray.air._internal.checkpoint_manager
,ray.train._internal.checkpoint
, andray.tune.execution.checkpoint_manager
.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.