-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REP] Consolidated persistence API for Ray Train/Tune #35
Conversation
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Thanks for putting this together and adding the implementation recommendations! In the motivation you should consider adding the prevalence of larger and larger models in generative AI and LLMs. |
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Good point, indeed that is one of the reasons why persistence has become increasingly a pain point over the past year. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I'm very much in favor of simplifying this.
Note that removing the support for automatic syncing will change the way users interact with the code. Until now, code "just worked" in distributed setups without specifying a storage path.
Now, fault-tolerant distributed training will not work out of the box on distributed setups without additional configuration from the user side.
This means that the code that works locally may have to be changed to work on the cloud. We remove complexity from the library setup and push the responsibility to the cluster setup and configuration, i.e. to the user.
In many cases, this is preferred - synchronization of large model checkpoints was very inefficient, and was thus a footgun that could lead to problems. By making the storage configuration explicit we can remove inefficient behavior (syncing large checkpoints via the object store).
It's just something to be aware of - our examples will all have to include a storage path definition, or, if users rely on the RAY_STORAGE environment variable, they have to be aware of this in the docs to avoid a frustrating onboarding experience.
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Thanks for the comments @krfricke @woshiyyya , they should be incorporated now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! One last question remaining from my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
@zhe-thoughts this can be merged |
This REP proposes standardizing on the recently introduced
storage_path
option for Ray Train/Tune, deprecating other legacy persistence paths.