[REP] Consolidated persistence API for Ray Train/Tune #35

ericl · 2023-07-07T00:33:02Z

This REP proposes standardizing on the recently introduced storage_path option for Ray Train/Tune, deprecating other legacy persistence paths.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

pcmoritz · 2023-07-09T21:53:31Z

Thanks for putting this together and adding the implementation recommendations!

In the motivation you should consider adding the prevalence of larger and larger models in generative AI and LLMs.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2023-07-09T22:28:30Z

In the motivation you should consider adding the prevalence of larger and larger models in generative AI and LLMs.

Good point, indeed that is one of the reasons why persistence has become increasingly a pain point over the past year.

krfricke

Generally I'm very much in favor of simplifying this.

Note that removing the support for automatic syncing will change the way users interact with the code. Until now, code "just worked" in distributed setups without specifying a storage path.

Now, fault-tolerant distributed training will not work out of the box on distributed setups without additional configuration from the user side.

This means that the code that works locally may have to be changed to work on the cloud. We remove complexity from the library setup and push the responsibility to the cluster setup and configuration, i.e. to the user.

In many cases, this is preferred - synchronization of large model checkpoints was very inefficient, and was thus a footgun that could lead to problems. By making the storage configuration explicit we can remove inefficient behavior (syncing large checkpoints via the object store).

It's just something to be aware of - our examples will all have to include a storage path definition, or, if users rely on the RAY_STORAGE environment variable, they have to be aware of this in the docs to avoid a frustrating onboarding experience.

reps/2023-06-06-simplify-sync.md

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2023-07-11T00:50:34Z

Thanks for the comments @krfricke @woshiyyya , they should be incorporated now.

reps/2023-06-06-simplify-sync.md

krfricke

Thanks for the update! One last question remaining from my side.

reps/2023-06-06-simplify-sync.md

krfricke

Thanks for the clarifications

Signed-off-by: Eric Liang <ekhliang@gmail.com>

reps/2023-06-06-simplify-sync.md

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2023-07-20T18:06:16Z

@zhe-thoughts this can be merged

ericl added 3 commits July 6, 2023 17:22

add draft

b7def92

Signed-off-by: Eric Liang <ekhliang@gmail.com>

edits

1738712

Signed-off-by: Eric Liang <ekhliang@gmail.com>

edits

c22f105

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl assigned pcmoritz, matthewdeng, krfricke and justinvyu Jul 7, 2023

add implementation recommendations section

44c9442

Signed-off-by: Eric Liang <ekhliang@gmail.com>

add llm motivation

6c495a5

Signed-off-by: Eric Liang <ekhliang@gmail.com>

krfricke reviewed Jul 10, 2023

View reviewed changes

woshiyyya reviewed Jul 10, 2023

View reviewed changes

reps/2023-06-06-simplify-sync.md Show resolved Hide resolved

apply comments

0a827e9

Signed-off-by: Eric Liang <ekhliang@gmail.com>

justinvyu reviewed Jul 12, 2023

View reviewed changes

reps/2023-06-06-simplify-sync.md Show resolved Hide resolved

reps/2023-06-06-simplify-sync.md Show resolved Hide resolved

reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved

krfricke reviewed Jul 13, 2023

View reviewed changes

reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved

krfricke approved these changes Jul 17, 2023

View reviewed changes

per new discussion

61ad0e4

Signed-off-by: Eric Liang <ekhliang@gmail.com>

pcmoritz approved these changes Jul 17, 2023

View reviewed changes

ericl added the pending-committer-vote label Jul 17, 2023

matthewdeng reviewed Jul 17, 2023

View reviewed changes

reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved

reps/2023-06-06-simplify-sync.md Show resolved Hide resolved

reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved

ericl added 4 commits July 17, 2023 17:08

add checkpoint api

d860362

Signed-off-by: Eric Liang <ekhliang@gmail.com>

add non-goals

f96ca51

Signed-off-by: Eric Liang <ekhliang@gmail.com>

require json

231b32c

Signed-off-by: Eric Liang <ekhliang@gmail.com>

update_metadata and reset_metadata

c3dcbe8

Signed-off-by: Eric Liang <ekhliang@gmail.com>

justinvyu mentioned this pull request Jul 20, 2023

[AIR] Remove head node syncing as the default storage option ray-project/ray#37142

Merged

15 tasks

ericl added vote-approved and removed pending-committer-vote labels Jul 20, 2023

zhe-thoughts merged commit f70c41c into main Jul 20, 2023
1 check passed

justinvyu mentioned this pull request Aug 2, 2023

[air] Don't set rank-specific local directories for Train workers ray-project/ray#38007

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REP] Consolidated persistence API for Ray Train/Tune #35

[REP] Consolidated persistence API for Ray Train/Tune #35

ericl commented Jul 7, 2023

pcmoritz commented Jul 9, 2023

ericl commented Jul 9, 2023

krfricke left a comment

ericl commented Jul 11, 2023

krfricke left a comment

krfricke left a comment

ericl commented Jul 20, 2023

[REP] Consolidated persistence API for Ray Train/Tune #35

[REP] Consolidated persistence API for Ray Train/Tune #35

Conversation

ericl commented Jul 7, 2023

pcmoritz commented Jul 9, 2023

ericl commented Jul 9, 2023

krfricke left a comment

Choose a reason for hiding this comment

ericl commented Jul 11, 2023

krfricke left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

ericl commented Jul 20, 2023