[air] `pyarrow.fs` persistence (5/n): `ray.train.Checkpoint` save direction #37888

justinvyu · 2023-07-28T08:04:27Z

Why are these changes needed?

This PR:

Uses the storage context to upload the new ray.train.Checkpoint (from [air] pyarrow.fs persistence (3/n): Introduce new Checkpoint API #37925)
directly from the Train worker.
Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train CheckpointManager and use as single, simplified checkpoint manager (from [air] pyarrow.fs persistence (4/n): Introduce a simplified checkpoint manager #37962).
Updates the e2e test to check for worker-uploaded checkpoints.

Follow-ups needed

~~Trial path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment.~~ See: [air] pyarrow.fs persistence (6/n): Fix Trial + Experiment paths to use the StorageContext #38057
~~Trial restoration is explicitly disabled at the moment.~~ See PRs 7-9.
Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that.
The on_checkpoint hook for tune.Callback takes in a _TrackedCheckpoint. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here.
~~Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")~~ See: https://github.com/ray-project/ray/pull/38141/files#diff-c8a0efbd48da8eaa5c945c0423d35eaf8797a31b1f82178e5c567614c130d8ebR511-R512

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com> Update to use the new checkpoint id attribute Signed-off-by: Justin Yu <justinvyu@anyscale.com> Add todo comment to remove legacy path Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

… -> driver Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com> Fix lint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/storage_context_to_worker_temp

python/ray/train/_internal/session.py

ericl · 2023-07-28T18:03:05Z

python/ray/train/_internal/session.py

+
+        # Save the rank of the worker that created this checkpoint.
+        metadata.update({CHECKPOINT_RANK_KEY: self.world_rank})
+        checkpoint_to_report.set_metadata(metadata)


Hmm we shouldn't set this on the actual checkpoint right?

This is the internal metadata that I want to attach to a checkpoint. We should be able to know the metrics that were reported with a checkpoint (useful for tracking top k checkpoints).

We currently have a few metadata concepts that should be cleaned up:

The old .metadata.pkl file. This contained the checkpoint class information and was used for framework-specific checkpoints, but it's not needed anymore.

The new Checkpoint.set_metadata. User-facing metadata.

Internal metadata that was previously stored in the _TrackedCheckpoint. This is pretty much just the metrics reported with the checkpoint. This PR adds it as part of the user-facing metadata.

Q: Would it be better to separate 2 and 3? Ex: we can dump internal metadata in a separate .internal_metadata.json file.

Internal metadata that is stored in the .tune_metadata. This consists of trainable state and can probably be removed in the new path.

Those metadata are available in the Result object though as metrics right? It's weird to put a couple metrics in Checkpoint, which should be only opaque user data.

Afaik, the result metadata is stored in the trial dir separately already.

For clarity, I believe the structure we want is

Result metrics (incl tune internal metadata/metrics) Checkpoint user data user metadata

Yep, the full set of results are saved in the trial dir, but I think a checkpoint should still know the set of metrics that was reported along with it. This info is currently stored in a few places:

The experiment-state.json file holds onto these _TrackedCheckpoint objects, which hold onto "metadata" which is the metrics reported with the checkpoint. This is at the experiment level directory, and we're trying to move away from storing data in that file.

The .tune_metadata stores certain parts of state, such as the training iteration, and it directly within the checkpoint folder.

I'm thinking we can do:

trial_dir/ result.json -- all metrics checkpoint_0/ .internal_metadata.json <-- metrics & other metadata we populate, associated with the checkpoint .user_metadata.json user data ...

It's redundant with result.json etc right? In the result.json, we can store "checkpoint_number": "00001" etc. to connect which checkpoint should be associated with which metric.

Anyways, we shouldn't discuss this in the scope of 2.7--- it's a new feature that we can tack on later once we finish the GA APIs as specced out.

python/ray/train/_internal/session.py

python/ray/train/checkpoint.py

python/ray/train/tests/test_new_persistence.py

ericl · 2023-07-28T18:51:54Z

python/ray/train/trainer.py

-                    results, decode_checkpoint_fn=self._backend._decode_data
-                )
+                if _use_storage_context():
+                    self._process_checkpoint_results(results)


Do we still need to involve the checkpoint manager to delete old checkpoints, etc?

The Train checkpoint manager is already disabled on purpose and basically doesn't do anything. Tune's checkpoint manager is the one who actually keeps track of the heap of top K checkpoints and does deletion. I think we should just keep it this way for now, and just fully avoid the train checkpoint manager.

Here's where it will happen:

https://github.com/ray-project/ray/pull/37888/files/ee4ccbd8be5457291a2032826e8a30db4df1f72f#diff-918fa19407dbc59ec621b7fbc9e35cf991659dc3b6657b5f4c46534d8dfb73a3R54-R66

python/ray/train/_internal/session.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/new_checkpoint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/new_checkpoint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-02T21:47:31Z

python/ray/train/_internal/storage.py

+if TYPE_CHECKING:
+    from ray.train._checkpoint import Checkpoint


This circular dependency is:

ray.train._internal.storage ->ray.train._checkpoint -> ray.train._internal.storage._download_fs_path

We can solve this by moving the filesystem utils to a different file.

justinvyu · 2023-08-02T21:52:21Z

python/ray/tune/experiment/trial.py

+            from ray.train._internal.checkpoint_manager import (
+                _CheckpointManager as _NewCheckpointManager,
+            )


This circular dependency is caused by:

ray.train -> ray.train.trainer.TrainingIterator -> ray.tune -> train._internal.CheckpointManager -> ray.train.CheckpointConfig

This will be fixed when TrainingIterator no longer needs to depend on tune.TrainableUtil. The code already has a todo for deletion.

ericl

One idea on simplifying the report() case for multi-worker, but looks good to me either way.

ericl · 2023-08-02T23:22:08Z

python/ray/train/data_parallel_trainer.py

+                    "Report (metrics, checkpoint) to the Tune session:\n"
+                    f"  metrics={metrics}\n  checkpoint={checkpoint}"
+                )
+                train.report(metrics, checkpoint=checkpoint)


Btw, a way to make this perhaps easier to read is to not use the public report API for this inner reporting, but call a private _report that doesn't do the repeated checkpoint uploading.

However, this is a minor comment.

@ericl I think that makes sense -- I think this will be easier to do when we unify the 2 sessions -- just introduce a private method on the session that does the Train -> Tune communication.

Let's merge with what we have for now

Actually, hold off on that, need to rebase and up the test timeout limit.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/new_checkpoint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: NripeshN <nn2012@hw.ac.uk>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: harborn <gangsheng.wu@intel.com>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")`

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: Victor <vctr.y.m@example.com>

justinvyu added 12 commits July 27, 2023 14:24

Pipe storage context to Trainable (used now for Trainable syncing)

abb1307

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Don't use the storage context in the trial/trainable

f6ff90a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Disable all trainable syncing in new codepath

562369f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix race condition for setting checkpoint_uri

484e67f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix cyclical import

2148669

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add simple trainer test

8c856b8

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add legacy prefix to train session checkpoint uri

78c525f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add new checkpoint class

e97f471

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

New train session report implementation using new checkpoint

64945be

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Simplify checkpoint propagation from user code (in worker) -> trainer…

c6480c9

… -> driver Signed-off-by: Justin Yu <justinvyu@anyscale.com>

New tune session.report

c681ccb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

This was referenced Jul 28, 2023

[no_ci] [air] pyarrow.fs persistence: Use StorageContext for Trainable syncing (2/n) #37694

Closed

[no_ci] [air] pyarrow.fs persistence: Pass StorageContext to Train workers (3/n) #37696

Closed

justinvyu added 5 commits July 28, 2023 10:19

Save direction works with new checkpoint API

795bafe

Signed-off-by: Justin Yu <justinvyu@anyscale.com> Fix lint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update test with e2e trainer test

8a084bc

Signed-off-by: Justin Yu <justinvyu@anyscale.com> Fix lint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Make callback supporting new checkpoint a todo for now

725d802

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Remove unnecessary comment

877acb9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

ee4ccbd

…persistence/storage_context_to_worker_temp

justinvyu force-pushed the air/persistence/storage_context_to_worker branch from 50c7afa to ee4ccbd Compare July 28, 2023 17:21

justinvyu assigned ericl Jul 28, 2023

justinvyu requested review from ericl, krfricke and woshiyyya July 28, 2023 17:25

ericl reviewed Jul 28, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2023

justinvyu added 4 commits July 28, 2023 13:53

Separate out the new set checkpoint id from the old set checkpoint uri

88042b3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

a5eeab2

…persistence/new_checkpoint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update id -> index

a6cd9dc

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Address comments on error to raise with old ckpt type

01f34bb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 2 commits August 2, 2023 14:38

fix lint

b89bd1c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

cada06a

…persistence/new_checkpoint

justinvyu force-pushed the air/persistence/storage_context_to_worker branch from 148df61 to cada06a Compare August 2, 2023 21:39

justinvyu removed request for scv119, raulchen, c21, scottjlee, amogkam and bveeramani August 2, 2023 21:39

justinvyu added 3 commits August 2, 2023 14:40

Revert all changes to ckpt manager

3b784d7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Don't set checkpoint user metadata

49c1ead

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Remove remaining print

7177940

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Aug 2, 2023

View reviewed changes

justinvyu requested a review from ericl August 2, 2023 21:54

justinvyu removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 2, 2023

ericl approved these changes Aug 2, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 2, 2023

justinvyu added 2 commits August 3, 2023 10:14

Bump the test size

3ba944a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

5f71608

…persistence/new_checkpoint Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 3, 2023

ericl merged commit f7910f8 into ray-project:master Aug 3, 2023
71 of 73 checks passed

justinvyu deleted the air/persistence/storage_context_to_worker branch August 3, 2023 20:29

ericl mentioned this pull request Aug 10, 2023

[tune/train] Implement new persistence strategy and roll out as default option #38294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] `pyarrow.fs` persistence (5/n): `ray.train.Checkpoint` save direction #37888

[air] `pyarrow.fs` persistence (5/n): `ray.train.Checkpoint` save direction #37888

justinvyu commented Jul 28, 2023 •

edited

Loading

ericl Jul 28, 2023

justinvyu Jul 28, 2023

ericl Jul 30, 2023 •

edited

Loading

ericl Jul 30, 2023

justinvyu Jul 30, 2023

ericl Jul 30, 2023

ericl Jul 28, 2023

justinvyu Jul 28, 2023

justinvyu Aug 2, 2023

justinvyu Aug 2, 2023

ericl left a comment

ericl Aug 2, 2023

justinvyu Aug 3, 2023

justinvyu Aug 3, 2023

		if TYPE_CHECKING:
		from ray.train._checkpoint import Checkpoint

[air] pyarrow.fs persistence (5/n): ray.train.Checkpoint save direction #37888

[air] pyarrow.fs persistence (5/n): ray.train.Checkpoint save direction #37888

Conversation

justinvyu commented Jul 28, 2023 • edited Loading

Why are these changes needed?

Follow-ups needed

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl Jul 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[air] `pyarrow.fs` persistence (5/n): `ray.train.Checkpoint` save direction #37888

[air] `pyarrow.fs` persistence (5/n): `ray.train.Checkpoint` save direction #37888

justinvyu commented Jul 28, 2023 •

edited

Loading

ericl Jul 30, 2023 •

edited

Loading