Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] pyarrow.fs persistence: Pass StorageContext to Train workers (2/n) #37909

Conversation

justinvyu
Copy link
Contributor

Why are these changes needed?

This PR passes a storage context to Train workers of a DataParallelTrainer, in preparation for using its storage_filesystem for uploading checkpoints/artifacts directly upon a train.report call.

The storage context is passed through the Trial -> Trainable -> DataParallelTrainer -> Train worker, but it's ignored at the trial/trainable level. The Trainable syncing paths are disabled if the new persistence mode is enabled. This means that class/function trainables don't sync their checkpoints and artifacts in the new persistence mode for now. This will be fixed with follow-up PRs that unifies (1) the train.report implementation and (2) function and class trainables.

This PR also fixes a race condition, where the training thread starts before the checkpoint path is set on all workers. Without adding an on_init_session, accessing session.checkpoint_uri immediately at the start of the training loop would give None. This is not too big of a problem, but it is unintended behavior and caused the test case I added to fail.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Update to use the new checkpoint id attribute

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add todo comment to remove legacy path

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2023
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/storage_context_to_worker_pt1
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Fix lint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu force-pushed the air/persistence/storage_context_to_worker_pt1 branch from c40170e to fcaf3ba Compare July 28, 2023 23:36
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@matthewdeng matthewdeng merged commit 2424388 into ray-project:master Jul 29, 2023
71 of 74 checks passed
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…s (2/n) (ray-project#37909)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…s (2/n) (ray-project#37909)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…s (2/n) (ray-project#37909)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…s (2/n) (ray-project#37909)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
@justinvyu justinvyu deleted the air/persistence/storage_context_to_worker_pt1 branch September 13, 2023 18:55
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…s (2/n) (ray-project#37909)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants