Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] pyarrow.fs persistence (5/n): ray.train.Checkpoint save direction #37888

Merged

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Jul 28, 2023

Why are these changes needed?

This PR:

  1. Uses the storage context to upload the new ray.train.Checkpoint (from [air] pyarrow.fs persistence (3/n): Introduce new Checkpoint API #37925)
    directly from the Train worker.
  2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train CheckpointManager and use as single, simplified checkpoint manager (from [air] pyarrow.fs persistence (4/n): Introduce a simplified checkpoint manager #37962).
  3. Updates the e2e test to check for worker-uploaded checkpoints.

Follow-ups needed

  1. Trial path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. See: [air] pyarrow.fs persistence (6/n): Fix Trial + Experiment paths to use the StorageContext #38057
  2. Trial restoration is explicitly disabled at the moment. See PRs 7-9.
  3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that.
  4. The on_checkpoint hook for tune.Callback takes in a _TrackedCheckpoint. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here.
  5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min") See: https://github.com/ray-project/ray/pull/38141/files#diff-c8a0efbd48da8eaa5c945c0423d35eaf8797a31b1f82178e5c567614c130d8ebR511-R512

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update to use the new checkpoint id attribute

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add todo comment to remove legacy path

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
… -> driver

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/storage_context_to_worker_temp
@justinvyu justinvyu force-pushed the air/persistence/storage_context_to_worker branch from 50c7afa to ee4ccbd Compare July 28, 2023 17:21
python/ray/train/_internal/session.py Outdated Show resolved Hide resolved
python/ray/train/_internal/session.py Outdated Show resolved Hide resolved

# Save the rank of the worker that created this checkpoint.
metadata.update({CHECKPOINT_RANK_KEY: self.world_rank})
checkpoint_to_report.set_metadata(metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we shouldn't set this on the actual checkpoint right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the internal metadata that I want to attach to a checkpoint. We should be able to know the metrics that were reported with a checkpoint (useful for tracking top k checkpoints).

We currently have a few metadata concepts that should be cleaned up:

  1. The old .metadata.pkl file. This contained the checkpoint class information and was used for framework-specific checkpoints, but it's not needed anymore.
  2. The new Checkpoint.set_metadata. User-facing metadata.
  3. Internal metadata that was previously stored in the _TrackedCheckpoint. This is pretty much just the metrics reported with the checkpoint. This PR adds it as part of the user-facing metadata.
    • Q: Would it be better to separate 2 and 3? Ex: we can dump internal metadata in a separate .internal_metadata.json file.
  4. Internal metadata that is stored in the .tune_metadata. This consists of trainable state and can probably be removed in the new path.

Copy link
Contributor

@ericl ericl Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those metadata are available in the Result object though as metrics right? It's weird to put a couple metrics in Checkpoint, which should be only opaque user data.

Afaik, the result metadata is stored in the trial dir separately already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, I believe the structure we want is

Result
   metrics (incl tune internal metadata/metrics)
   Checkpoint 
       user data
       user metadata 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the full set of results are saved in the trial dir, but I think a checkpoint should still know the set of metrics that was reported along with it. This info is currently stored in a few places:

  • The experiment-state.json file holds onto these _TrackedCheckpoint objects, which hold onto "metadata" which is the metrics reported with the checkpoint. This is at the experiment level directory, and we're trying to move away from storing data in that file.
  • The .tune_metadata stores certain parts of state, such as the training iteration, and it directly within the checkpoint folder.

I'm thinking we can do:

trial_dir/
    result.json -- all metrics
    checkpoint_0/
        .internal_metadata.json <-- metrics & other metadata we populate, associated with the checkpoint
        .user_metadata.json
        user data ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's redundant with result.json etc right? In the result.json, we can store "checkpoint_number": "00001" etc. to connect which checkpoint should be associated with which metric.

Anyways, we shouldn't discuss this in the scope of 2.7--- it's a new feature that we can tack on later once we finish the GA APIs as specced out.

python/ray/train/_internal/session.py Outdated Show resolved Hide resolved
python/ray/train/_internal/session.py Outdated Show resolved Hide resolved
python/ray/train/checkpoint.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_new_persistence.py Outdated Show resolved Hide resolved
results, decode_checkpoint_fn=self._backend._decode_data
)
if _use_storage_context():
self._process_checkpoint_results(results)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to involve the checkpoint manager to delete old checkpoints, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Train checkpoint manager is already disabled on purpose and basically doesn't do anything. Tune's checkpoint manager is the one who actually keeps track of the heap of top K checkpoints and does deletion. I think we should just keep it this way for now, and just fully avoid the train checkpoint manager.

Here's where it will happen:

https://github.com/ray-project/ray/pull/37888/files/ee4ccbd8be5457291a2032826e8a30db4df1f72f#diff-918fa19407dbc59ec621b7fbc9e35cf991659dc3b6657b5f4c46534d8dfb73a3R54-R66

python/ray/train/_internal/session.py Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2023
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/new_checkpoint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Comment on lines +33 to +34
if TYPE_CHECKING:
from ray.train._checkpoint import Checkpoint
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This circular dependency is:

  • ray.train._internal.storage ->ray.train._checkpoint -> ray.train._internal.storage._download_fs_path

We can solve this by moving the filesystem utils to a different file.

Comment on lines +527 to +529
from ray.train._internal.checkpoint_manager import (
_CheckpointManager as _NewCheckpointManager,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This circular dependency is caused by:

ray.train -> ray.train.trainer.TrainingIterator -> ray.tune -> train._internal.CheckpointManager -> ray.train.CheckpointConfig

This will be fixed when TrainingIterator no longer needs to depend on tune.TrainableUtil. The code already has a todo for deletion.

@justinvyu justinvyu requested a review from ericl August 2, 2023 21:54
@justinvyu justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 2, 2023
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea on simplifying the report() case for multi-worker, but looks good to me either way.

"Report (metrics, checkpoint) to the Tune session:\n"
f" metrics={metrics}\n checkpoint={checkpoint}"
)
train.report(metrics, checkpoint=checkpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, a way to make this perhaps easier to read is to not use the public report API for this inner reporting, but call a private _report that doesn't do the repeated checkpoint uploading.

However, this is a minor comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl I think that makes sense -- I think this will be easier to do when we unify the 2 sessions -- just introduce a private method on the session that does the Train -> Tune communication.

Let's merge with what we have for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, hold off on that, need to rebase and up the test timeout limit.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 2, 2023
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/new_checkpoint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 3, 2023
@ericl ericl merged commit f7910f8 into ray-project:master Aug 3, 2023
71 of 73 checks passed
@justinvyu justinvyu deleted the air/persistence/storage_context_to_worker branch August 3, 2023 20:29
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…ection (ray-project#37888)

This PR:
1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925)
directly from the Train worker.
2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962).
3. Updates the e2e test to check for worker-uploaded checkpoints.

### Follow-ups needed

1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next.
2. Trial restoration is explicitly disabled at the moment. This is up next as well.
3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that.
4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here.
5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`

Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…ection (ray-project#37888)

This PR:
1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925)
directly from the Train worker.
2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962).
3. Updates the e2e test to check for worker-uploaded checkpoints.

### Follow-ups needed

1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next.
2. Trial restoration is explicitly disabled at the moment. This is up next as well.
3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that.
4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here.
5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`

Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…ection (ray-project#37888)

This PR:
1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925)
directly from the Train worker.
2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962).
3. Updates the e2e test to check for worker-uploaded checkpoints.

### Follow-ups needed

1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next.
2. Trial restoration is explicitly disabled at the moment. This is up next as well.
3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that.
4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here.
5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…ection (ray-project#37888)

This PR:
1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925)
directly from the Train worker.
2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962).
3. Updates the e2e test to check for worker-uploaded checkpoints.

### Follow-ups needed

1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next.
2. Trial restoration is explicitly disabled at the moment. This is up next as well.
3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that.
4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here.
5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…ection (ray-project#37888)

This PR:
1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925)
directly from the Train worker.
2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962).
3. Updates the e2e test to check for worker-uploaded checkpoints.

### Follow-ups needed

1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next.
2. Trial restoration is explicitly disabled at the moment. This is up next as well.
3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that.
4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here.
5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`

Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants