Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Improve excessive syncing warning and make some deprecations #45210

Merged
merged 11 commits into from
May 20, 2024

Conversation

justinvyu
Copy link
Contributor

Why are these changes needed?

This PR makes it so that the "Experiment state snapshotting has been triggered multiple..." warning message is less confusing and does not always trigger at the end of training.

  • Previously, the wording made it seem like this was an error, which makes it confusing when it's printed out along with an actual checkpointing error.
  • The warning would also often happen at the end of training, since we always trigger a forced driver sync at the end of a run, where a previous sync may have happened really recently. For example, a simple script like this will print out the error:
from ray import tune

kwargs = {"kwarg1": 1, "kwarg2": 2}


def train_fn(config, **trainable_kwargs):
    print(config, trainable_kwargs)


tune.Tuner(tune.with_parameters(train_fn, **kwargs)).fit()
Trial train_fn_a9462_00000 completed after 0 iterations at 2024-05-08 15:38:18. Total running time: 1s
2024-05-08 15:38:18,757 WARNING experiment_state.py:205 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this error by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0).
2024-05-08 15:38:18,758 INFO tune.py:1007 -- Wrote the latest version of all result files and experiment state to '/Users/justin/ray_results/train_fn_2024-05-08_15-38-13' in 0.0031s.

This PR also makes some deprecations: TUNE_RESULT_DIR, RAY_AIR_LOCAL_CACHE_DIR, local_dir

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


)
if os.environ.get("TUNE_RESULT_DIR"):
warnings.warn(ENV_VAR_DEPRECATION_MESSAGE.format("TUNE_RESULT_DIR"))
raise DeprecationWarning(ENV_VAR_DEPRECATION_MESSAGE.format("TUNE_RESULT_DIR"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought is that it might be odd to raise if an environment variable is set, but I'm not sure what the best practice is, so this is probably fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering keeping it a warning for another release before just removing it.

Pros:

  • No breaking user code

Cons:

  • Implicit behavior change that users may be relying on will happen without warning / with no suggested fix

python/ray/tune/execution/experiment_state.py Show resolved Hide resolved
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from a team as a code owner May 16, 2024 21:30
@can-anyscale
Copy link
Collaborator

w00t I just rebase your PR on top of the latest master to avoid a bug in microcheck, hope you don't mind, thankks

@justinvyu justinvyu added the go add ONLY when ready to merge, run all tests label May 17, 2024
@justinvyu justinvyu merged commit baff597 into ray-project:master May 20, 2024
7 checks passed
@justinvyu justinvyu deleted the fix_warning_wording branch May 20, 2024 20:15
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
…ay-project#45210)

Make the "Experiment state snapshotting has been triggered multiple..." warning message is less confusing, and remove the false positive log at the end of every run. Also makes some deprecations of `TUNE_RESULT_DIR`,
`RAY_AIR_LOCAL_CACHE_DIR`, `local_dir` legacy settings.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 6, 2024
…ay-project#45210)

Make the "Experiment state snapshotting has been triggered multiple..." warning message is less confusing, and remove the false positive log at the end of every run. Also makes some deprecations of `TUNE_RESULT_DIR`,
`RAY_AIR_LOCAL_CACHE_DIR`, `local_dir` legacy settings.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
…ay-project#45210)

Make the "Experiment state snapshotting has been triggered multiple..." warning message is less confusing, and remove the false positive log at the end of every run. Also makes some deprecations of `TUNE_RESULT_DIR`,
`RAY_AIR_LOCAL_CACHE_DIR`, `local_dir` legacy settings.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
GabeChurch pushed a commit to GabeChurch/ray that referenced this pull request Jun 11, 2024
…ay-project#45210)

Make the "Experiment state snapshotting has been triggered multiple..." warning message is less confusing, and remove the false positive log at the end of every run. Also makes some deprecations of `TUNE_RESULT_DIR`,
`RAY_AIR_LOCAL_CACHE_DIR`, `local_dir` legacy settings.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Signed-off-by: gchurch <gabe1church@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants