[air] `pyarrow.fs` persistence: Remove dependence on rank 0 worker checkpoint reporting #38523

justinvyu · 2023-08-16T21:41:42Z

Why are these changes needed?

This PR removes the need for the rank 0 worker to report a checkpoint in order for a checkpoint to be tracked by Train/Tune.

Before (ray <= 2.6):

def train_fn_per_worker(config):
    ...
    tmpdir = tempfile.mkdtemp()
    if session.get_world_rank() == 0:
        # write to tmpdir
        # global rank 0 MUST report. otherwise it's as if you didn't checkpoint
        checkpoint = Checkpoint.from_directory(...)
    else:
        # create an "empty" checkpoint...
        # otherwise, if you just reported None, we throw an error
        # even worse, if you report a dict checkpoint here... unknown territory
        checkpoint = Checkpoint.from_directory(...)
    session.report(..., checkpoint)

After:

def train_fn_per_worker(config):
    ...
    # ANY combination of workers can report a checkpoint
    if train.get_context().get_world_rank() in [2, 4, 6]:
        with tempfile.TemporaryDirectory() as tempdir:
            # write to tmpdir
            train.report(metrics, Checkpoint.from_directory(tempdir))
    else:
        train.report(metrics)

Note: the reported metrics are still pulled from the global rank 0 worker (same behavior as before). This PR does not remove that restriction.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

ericl · 2023-08-16T22:24:13Z

python/ray/train/data_parallel_trainer.py

+                # All workers reported a checkpoint to the same fs path, so there's
+                # no need to report multiple checkpoints to Tune.
+                at_least_one_reported_checkpoint = any(
+                    result.checkpoint is not None for result in results


Is results synchronously gathered from all workers (i.e., a barrier?)

Yep. Results are returned by the training iterator via a ray.get([worker.get_next_result.remote() for worker in workers])

ericl

This change makes sense to me, but could you elaborate on the motivation/use cases for this? Are there new edge cases this would expose, or is it a pretty straightforward win-win?

ericl · 2023-08-16T22:26:15Z

python/ray/train/data_parallel_trainer.py

+                        path=tune_session.storage.checkpoint_fs_path,
+                    )
+                    if at_least_one_reported_checkpoint
+                    else None


Should we validate/assert all the checkpoints are pointing to the same path?

justinvyu · 2023-08-16T23:03:12Z

I think it's purely win-win here. The motivating use case is with deepspeed checkpoints:

Lightning saves multiple ranks' deepspeed checkpoint shards to a single directory and it's hard to de-aggregate, so our integration should just have each local rank 0 worker upload the directory instead of every worker doing it. The global rank 0 worker may not be the local rank 0 worker, and without this PR, that would mean the checkpoint doesn't get tracked by Tune internally if global rank 0 doesn't report.

I think this is better for UX as well -- we don't enforce which worker can report a checkpoint.

@woshiyyya can elaborate more here.

…persistence/rank0_ckpt Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-17T16:01:48Z

python/ray/train/data_parallel_trainer.py

+                )
+
+                checkpoint = (
+                    NewCheckpoint(


One consequence of this: if users end up using our new framework checkpoints, it'll lose the class here and always end up as a generic checkpoint.

woshiyyya

Thanks @justinvyu! This change is mostly for distributed checkpointing. In deepspeed, there should be one worker on each node to report checkpoints, not necessarily be global rank 0.

This change gives us more general and flexible interface, as long as one of the workers report checkpoint, we can always correctly track it.

…eckpoint reporting (ray-project#38523) This PR removes the need for the rank 0 worker to report a checkpoint in order for a checkpoint to be tracked by Train/Tune. Before (ray <= 2.6): ```python def train_fn_per_worker(config): ... tmpdir = tempfile.mkdtemp() if session.get_world_rank() == 0: # write to tmpdir # global rank 0 MUST report. otherwise it's as if you didn't checkpoint checkpoint = Checkpoint.from_directory(...) else: # create an "empty" checkpoint... # otherwise, if you just reported None, we throw an error # even worse, if you report a dict checkpoint here... unknown territory checkpoint = Checkpoint.from_directory(...) session.report(..., checkpoint) ``` After: ```python def train_fn_per_worker(config): ... # ANY combination of workers can report a checkpoint if train.get_context().get_world_rank() in [2, 4, 6]: with tempfile.TemporaryDirectory() as tempdir: # write to tmpdir train.report(metrics, Checkpoint.from_directory(tempdir)) else: train.report(metrics) ``` Note: the reported *metrics* are still pulled from the global rank 0 worker (same behavior as before). This PR does not remove that restriction. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…eckpoint reporting (ray-project#38523) This PR removes the need for the rank 0 worker to report a checkpoint in order for a checkpoint to be tracked by Train/Tune. Before (ray <= 2.6): ```python def train_fn_per_worker(config): ... tmpdir = tempfile.mkdtemp() if session.get_world_rank() == 0: # write to tmpdir # global rank 0 MUST report. otherwise it's as if you didn't checkpoint checkpoint = Checkpoint.from_directory(...) else: # create an "empty" checkpoint... # otherwise, if you just reported None, we throw an error # even worse, if you report a dict checkpoint here... unknown territory checkpoint = Checkpoint.from_directory(...) session.report(..., checkpoint) ``` After: ```python def train_fn_per_worker(config): ... # ANY combination of workers can report a checkpoint if train.get_context().get_world_rank() in [2, 4, 6]: with tempfile.TemporaryDirectory() as tempdir: # write to tmpdir train.report(metrics, Checkpoint.from_directory(tempdir)) else: train.report(metrics) ``` Note: the reported *metrics* are still pulled from the global rank 0 worker (same behavior as before). This PR does not remove that restriction. Signed-off-by: Victor <vctr.y.m@example.com>

justinvyu added 3 commits August 16, 2023 13:36

Remove dependence on rank 0 reporting a checkpoint

17127b2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update test to follow tmpdir autocleanup pattern

2305614

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test rank 0 not reporting a checkpoint

1974500

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from ericl and woshiyyya August 16, 2023 21:41

justinvyu assigned ericl Aug 16, 2023

ericl reviewed Aug 16, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 16, 2023

ericl reviewed Aug 16, 2023

View reviewed changes

justinvyu added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 17, 2023

justinvyu added 2 commits August 16, 2023 22:05

Merge branch 'master' of https://github.com/ray-project/ray into air/…

b472865

…persistence/rank0_ckpt Signed-off-by: Justin Yu <justinvyu@anyscale.com>

assert that all checkpoints uploaded to the same location

4eb76e5

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from ericl August 17, 2023 16:00

justinvyu commented Aug 17, 2023

View reviewed changes

woshiyyya approved these changes Aug 17, 2023

View reviewed changes

ericl merged commit e1208a7 into ray-project:master Aug 17, 2023
42 of 44 checks passed

justinvyu deleted the air/persistence/rank0_ckpt branch August 17, 2023 19:04

justinvyu mentioned this pull request Aug 27, 2023

[train][GA] Get rid of _checkpoint_keep_all_ranks and _checkpoint_upload_from_workers configs #38957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] `pyarrow.fs` persistence: Remove dependence on rank 0 worker checkpoint reporting #38523

[air] `pyarrow.fs` persistence: Remove dependence on rank 0 worker checkpoint reporting #38523

justinvyu commented Aug 16, 2023 •

edited

ericl Aug 16, 2023

justinvyu Aug 16, 2023

ericl left a comment

ericl Aug 16, 2023

justinvyu commented Aug 16, 2023

justinvyu Aug 17, 2023

woshiyyya left a comment •

edited

[air] pyarrow.fs persistence: Remove dependence on rank 0 worker checkpoint reporting #38523

[air] pyarrow.fs persistence: Remove dependence on rank 0 worker checkpoint reporting #38523

Conversation

justinvyu commented Aug 16, 2023 • edited

Why are these changes needed?

Related issue number

Checks

ericl Aug 16, 2023

Choose a reason for hiding this comment

justinvyu Aug 16, 2023

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

ericl Aug 16, 2023

Choose a reason for hiding this comment

justinvyu commented Aug 16, 2023

justinvyu Aug 17, 2023

Choose a reason for hiding this comment

woshiyyya left a comment • edited

Choose a reason for hiding this comment

[air] `pyarrow.fs` persistence: Remove dependence on rank 0 worker checkpoint reporting #38523

[air] `pyarrow.fs` persistence: Remove dependence on rank 0 worker checkpoint reporting #38523

justinvyu commented Aug 16, 2023 •

edited

woshiyyya left a comment •

edited