Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REP] Consolidated persistence API for Ray Train/Tune #35

Merged
merged 11 commits into from
Jul 20, 2023

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Jul 7, 2023

This REP proposes standardizing on the recently introduced storage_path option for Ray Train/Tune, deprecating other legacy persistence paths.

ericl added 3 commits July 6, 2023 17:22
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
@pcmoritz
Copy link
Collaborator

pcmoritz commented Jul 9, 2023

Thanks for putting this together and adding the implementation recommendations!

In the motivation you should consider adding the prevalence of larger and larger models in generative AI and LLMs.

Signed-off-by: Eric Liang <ekhliang@gmail.com>
@ericl
Copy link
Contributor Author

ericl commented Jul 9, 2023

In the motivation you should consider adding the prevalence of larger and larger models in generative AI and LLMs.

Good point, indeed that is one of the reasons why persistence has become increasingly a pain point over the past year.

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I'm very much in favor of simplifying this.

Note that removing the support for automatic syncing will change the way users interact with the code. Until now, code "just worked" in distributed setups without specifying a storage path.

Now, fault-tolerant distributed training will not work out of the box on distributed setups without additional configuration from the user side.

This means that the code that works locally may have to be changed to work on the cloud. We remove complexity from the library setup and push the responsibility to the cluster setup and configuration, i.e. to the user.

In many cases, this is preferred - synchronization of large model checkpoints was very inefficient, and was thus a footgun that could lead to problems. By making the storage configuration explicit we can remove inefficient behavior (syncing large checkpoints via the object store).

It's just something to be aware of - our examples will all have to include a storage path definition, or, if users rely on the RAY_STORAGE environment variable, they have to be aware of this in the docs to avoid a frustrating onboarding experience.

reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved
reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved
reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved
reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved
reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved
Signed-off-by: Eric Liang <ekhliang@gmail.com>
@ericl
Copy link
Contributor Author

ericl commented Jul 11, 2023

Thanks for the comments @krfricke @woshiyyya , they should be incorporated now.

reps/2023-06-06-simplify-sync.md Show resolved Hide resolved
reps/2023-06-06-simplify-sync.md Show resolved Hide resolved
reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! One last question remaining from my side.

reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications

Signed-off-by: Eric Liang <ekhliang@gmail.com>
reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved
reps/2023-06-06-simplify-sync.md Show resolved Hide resolved
reps/2023-06-06-simplify-sync.md Outdated Show resolved Hide resolved
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
@ericl
Copy link
Contributor Author

ericl commented Jul 20, 2023

@zhe-thoughts this can be merged

@zhe-thoughts zhe-thoughts merged commit f70c41c into main Jul 20, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants