Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Fix hyperband r calculation and stopping #39157

Merged
merged 5 commits into from
Sep 8, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Aug 31, 2023

Why are these changes needed?

This PR fixes two flaws in the current hyperband implementation.

1. Bug in the r calculation.

#1620 introduced a minimum constraint in the r calculation during successive halving with r = min(r, max_t - prev_r). It's unclear to me where this is coming from (cc @richardliaw), but in my opinion this is flawed.

E.g. for s=1, max_t=8 and eta=2, we get r0 = 4. Then r = r0 * 2 = 8. With the current formula, we then get r1 = min(r, max_t - r0) = min(8, 8-4) = 4. Thus, r0 = r1 and the bracket already "finished" after 4 iterations and should terminate all trials. Or in other words, none of the trials in this bracket will ever proceed.

I believe the correct fix here is to set r = min(r, max_t). I couldn't find a reference implementation for comparison, but it logically makes sense and seems to match the formula behavior described in the paper.

2. Stopping of "overstepped" trials.

The first bug revealed a second shortcoming in the current implementation. When a trial reports a timestep that is higher than r_i and r_(i+1), it can hang forever. This is because "good" trials are only continued if bracket.continue_trial(t) returns True. However, if the trial already overstepped r_(i+1), this can return False (specifically when stop_last_trials=True). In that case, a paused or running trial will not be terminated nor continued, and instead hang forever.

This second case is fixed in this PR by introducing another clause in the processing of "good" trials that checks for this condition.

Related issue number

Closes #37605

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 2 commits August 31, 2023 13:16
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these fixes make sense. I'm not so familiar with our hyperband implementation so would need to spend more time on this for a full review.

If tests are passing, happy to get this in now (if aiming for 2.7)

@@ -276,7 +276,7 @@ def _process_bracket(
# kill bad trials
self._num_stopped += len(bad)
for t in bad:
if t.status == Trial.PAUSED:
if t.status == Trial.PAUSED or t.is_saving:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May lose the checkpoint here if we stop while saving?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is avoiding the situation when the trial is in the middle of pausing:

            if should_checkpoint:
                self._cached_trial_decisions[trial.trial_id] = TrialScheduler.PAUSE
                future_result = self._schedule_trial_save(
                    trial=trial, storage=CheckpointStorage.PERSISTENT
                )
                trial.temporary_state.saving_to = future_result

where the status is technically still "RUNNING" while the save is resolving.

I think it's fine to miss the checkpoint here. The trial is bad anyway and is going to be early-terminated.

@krfricke
Copy link
Contributor Author

krfricke commented Sep 7, 2023

I think these fixes make sense. I'm not so familiar with our hyperband implementation so would need to spend more time on this for a full review.

If tests are passing, happy to get this in now (if aiming for 2.7)

Tests seem to be passing fine - and yes, let's pick for 2.7 as it fixes behavior relating to the new storage path

@krfricke krfricke added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. v2.7.0-pick labels Sep 7, 2023
@krfricke krfricke merged commit 212cd04 into ray-project:master Sep 8, 2023
59 of 63 checks passed
@krfricke krfricke deleted the tune/hyperband-stop branch September 8, 2023 08:36
krfricke added a commit to krfricke/ray that referenced this pull request Sep 8, 2023
This PR fixes two flaws in the current hyperband implementation.

#### 1. Bug in the `r` calculation.

ray-project#1620 introduced a minimum constraint in the `r` calculation during successive halving with `r = min(r, max_t - prev_r)`. It's unclear to me where this is coming from (cc @richardliaw), but in my opinion this is flawed.

E.g. for `s=1`, `max_t=8` and `eta=2`, we get `r0 = 4`. Then `r = r0 * 2 = 8`. With the current formula, we then get `r1 = min(r, max_t - r0) = min(8, 8-4) = 4`. Thus, `r0 = r1` and the bracket already "finished" after 4 iterations and should terminate all trials. Or in other words, none of the trials in this bracket will ever proceed.

I believe the correct fix here is to set `r = min(r, max_t)`. I couldn't find a reference implementation for comparison, but it logically makes sense and seems to match the formula behavior described in the paper.

#### 2. Stopping of "overstepped" trials.

The first bug revealed a second shortcoming in the current implementation. When a trial reports a timestep that is higher than `r_i` and `r_(i+1)`, it can hang forever. This is because "good" trials are only continued if `bracket.continue_trial(t)` returns `True`. However, if the trial already overstepped `r_(i+1)`, this can return `False` (specifically when `stop_last_trials=True`). In that case, a paused or running trial will not be terminated nor continued, and instead hang forever.

This second case is fixed in this PR by introducing another clause in the processing of "good" trials that checks for this condition.

Signed-off-by: Kai Fricke <kai@anyscale.com>
GeneDer pushed a commit that referenced this pull request Sep 9, 2023
This PR fixes two flaws in the current hyperband implementation.

#### 1. Bug in the `r` calculation.

#1620 introduced a minimum constraint in the `r` calculation during successive halving with `r = min(r, max_t - prev_r)`. It's unclear to me where this is coming from (cc @richardliaw), but in my opinion this is flawed.

E.g. for `s=1`, `max_t=8` and `eta=2`, we get `r0 = 4`. Then `r = r0 * 2 = 8`. With the current formula, we then get `r1 = min(r, max_t - r0) = min(8, 8-4) = 4`. Thus, `r0 = r1` and the bracket already "finished" after 4 iterations and should terminate all trials. Or in other words, none of the trials in this bracket will ever proceed.

I believe the correct fix here is to set `r = min(r, max_t)`. I couldn't find a reference implementation for comparison, but it logically makes sense and seems to match the formula behavior described in the paper.

#### 2. Stopping of "overstepped" trials.

The first bug revealed a second shortcoming in the current implementation. When a trial reports a timestep that is higher than `r_i` and `r_(i+1)`, it can hang forever. This is because "good" trials are only continued if `bracket.continue_trial(t)` returns `True`. However, if the trial already overstepped `r_(i+1)`, this can return `False` (specifically when `stop_last_trials=True`). In that case, a paused or running trial will not be terminated nor continued, and instead hang forever.

This second case is fixed in this PR by introducing another clause in the processing of "good" trials that checks for this condition.

Signed-off-by: Kai Fricke <kai@anyscale.com>
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
This PR fixes two flaws in the current hyperband implementation.

#### 1. Bug in the `r` calculation.

ray-project#1620 introduced a minimum constraint in the `r` calculation during successive halving with `r = min(r, max_t - prev_r)`. It's unclear to me where this is coming from (cc @richardliaw), but in my opinion this is flawed.

E.g. for `s=1`, `max_t=8` and `eta=2`, we get `r0 = 4`. Then `r = r0 * 2 = 8`. With the current formula, we then get `r1 = min(r, max_t - r0) = min(8, 8-4) = 4`. Thus, `r0 = r1` and the bracket already "finished" after 4 iterations and should terminate all trials. Or in other words, none of the trials in this bracket will ever proceed.

I believe the correct fix here is to set `r = min(r, max_t)`. I couldn't find a reference implementation for comparison, but it logically makes sense and seems to match the formula behavior described in the paper.

#### 2. Stopping of "overstepped" trials.

The first bug revealed a second shortcoming in the current implementation. When a trial reports a timestep that is higher than `r_i` and `r_(i+1)`, it can hang forever. This is because "good" trials are only continued if `bracket.continue_trial(t)` returns `True`. However, if the trial already overstepped `r_(i+1)`, this can return `False` (specifically when `stop_last_trials=True`). In that case, a paused or running trial will not be terminated nor continued, and instead hang forever.

This second case is fixed in this PR by introducing another clause in the processing of "good" trials that checks for this condition.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
This PR fixes two flaws in the current hyperband implementation.

#### 1. Bug in the `r` calculation.

ray-project#1620 introduced a minimum constraint in the `r` calculation during successive halving with `r = min(r, max_t - prev_r)`. It's unclear to me where this is coming from (cc @richardliaw), but in my opinion this is flawed.

E.g. for `s=1`, `max_t=8` and `eta=2`, we get `r0 = 4`. Then `r = r0 * 2 = 8`. With the current formula, we then get `r1 = min(r, max_t - r0) = min(8, 8-4) = 4`. Thus, `r0 = r1` and the bracket already "finished" after 4 iterations and should terminate all trials. Or in other words, none of the trials in this bracket will ever proceed.

I believe the correct fix here is to set `r = min(r, max_t)`. I couldn't find a reference implementation for comparison, but it logically makes sense and seems to match the formula behavior described in the paper.

#### 2. Stopping of "overstepped" trials.

The first bug revealed a second shortcoming in the current implementation. When a trial reports a timestep that is higher than `r_i` and `r_(i+1)`, it can hang forever. This is because "good" trials are only continued if `bracket.continue_trial(t)` returns `True`. However, if the trial already overstepped `r_(i+1)`, this can return `False` (specifically when `stop_last_trials=True`). In that case, a paused or running trial will not be terminated nor continued, and instead hang forever.

This second case is fixed in this PR by introducing another clause in the processing of "good" trials that checks for this condition.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Tune] HyperBand and ASHA not scheduling trials correctly.
2 participants