Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train+tune] Local directory refactor (1/n): Write launcher state files (tuner.pkl, trainer.pkl) directly to storage #43369

Merged
merged 32 commits into from Mar 1, 2024

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Feb 23, 2024

Why are these changes needed?

This PR updates Trainers and the Tuner to upload its state directly to storage_path, rather than dumping it in a local directory and relying on driver syncing to upload.

This removes the dependency of these pkl files on the _get_defaults_results_dir, which is the folder that defaults to ~/ray_results and is set by environment variables.

TODOs for follow-up PRs

There's currently an issue where multiple runs with the same RunConfig(name) can produce conflicting versions of tuner.pkl. This can lead to a bug where the files needed to restore a run can be overwritten by a mismatched version. See #43369 (comment) for more details.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update trainer._save usage in test

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…entrypoints

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
self._run_config.name
or StorageContext.get_experiment_dir_name(self.converted_trainable)
)
storage = StorageContext(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be trying to instantiate this StorageContext only in one place and then pass it around?

Copy link
Contributor Author

@justinvyu justinvyu Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to do since we'd need to pass the context through from BaseTrainer.fit to the public Tuner interface.

One alternative we discussed is having a global storage context, but that'd also require a get_or_create_storage_context logic to account for users coming in through any of the 3 entrypoints (tuner, trainer, tune.run).

I will clarify the usage is just to access the path and the filesystem so that we don't re-implement that logic.

python/ray/tune/impl/tuner_internal.py Show resolved Hide resolved
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…pickled one is instead)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@@ -713,7 +713,7 @@ def create_trainable_with_params():
)
return trainable_with_params

exp_name = "restore_with_params"
exp_name = f"restore_with_params-{use_function_trainable=}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was required to avoid a bug where:

  • Both tests would use the same exp name and save things to the same driver staging dir
  • tuner.pkl gets downloaded to the staging dir when the test runs the first time with the use_function_trainable=True.
  • Then, when the second run happens, it writes the correct tuner.pkl to storage, but then does another driver sync where the old tuner.pkl that was downloaded gets uploaded again.
  • This caused an error when restoring the 2nd run.

This problem gets fixed in the follow-up PR by using unique staging directories for each new ray train experiment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should never run Tune twice in the same job?
I think this will no longer be a problem after your second PR, which separates driver staging directories with timestamp.

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Left some minor comments.

@@ -42,14 +42,6 @@
_SELF = "self"


_TUNER_FAILED_MSG = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we no longer send this failure message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only caught TuneError, which only gets raised from tune.run(raise_on_failed_trial=True). However, the Tuner always passes False for this flag, so this error message doesn't actually get printed ever.

raise_on_failed_trial=False,

@@ -713,7 +713,7 @@ def create_trainable_with_params():
)
return trainable_with_params

exp_name = "restore_with_params"
exp_name = f"restore_with_params-{use_function_trainable=}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should never run Tune twice in the same job?
I think this will no longer be a problem after your second PR, which separates driver staging directories with timestamp.

@@ -302,7 +302,6 @@ def test_run_config_in_trainer_and_tuner(
assert not (tmp_path / "trainer").exists()
assert both_msg not in caplog.text
else:
assert tuner._local_tuner.get_run_config() == RunConfig()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because we inject a default storage context into the run config if not specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because I set the run_config.name in the trainer, then pass that along to the tuner. So, the tuner's run config is no longer the default RunConfig. This is so that we don't generate the experiment name multiple times (which could lead to different folders being used by the trainer vs. tuner).

@justinvyu justinvyu merged commit 58edaa4 into ray-project:master Mar 1, 2024
9 checks passed
@justinvyu justinvyu deleted the upload_pkl_directly branch March 1, 2024 19:12
hebiao064 pushed a commit to hebiao064/ray that referenced this pull request Mar 12, 2024
…es (`tuner.pkl`, `trainer.pkl`) directly to storage (ray-project#43369)

This PR updates `Trainer`s and the `Tuner` to upload its state directly to `storage_path`, rather than dumping it in a local directory and relying on driver syncing to upload.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants