[tune] Fix hyperband r calculation and stopping #39157

krfricke · 2023-08-31T11:31:04Z

Why are these changes needed?

This PR fixes two flaws in the current hyperband implementation.

1. Bug in the `r` calculation.

#1620 introduced a minimum constraint in the r calculation during successive halving with r = min(r, max_t - prev_r). It's unclear to me where this is coming from (cc @richardliaw), but in my opinion this is flawed.

E.g. for s=1, max_t=8 and eta=2, we get r0 = 4. Then r = r0 * 2 = 8. With the current formula, we then get r1 = min(r, max_t - r0) = min(8, 8-4) = 4. Thus, r0 = r1 and the bracket already "finished" after 4 iterations and should terminate all trials. Or in other words, none of the trials in this bracket will ever proceed.

I believe the correct fix here is to set r = min(r, max_t). I couldn't find a reference implementation for comparison, but it logically makes sense and seems to match the formula behavior described in the paper.

2. Stopping of "overstepped" trials.

The first bug revealed a second shortcoming in the current implementation. When a trial reports a timestep that is higher than r_i and r_(i+1), it can hang forever. This is because "good" trials are only continued if bracket.continue_trial(t) returns True. However, if the trial already overstepped r_(i+1), this can return False (specifically when stop_last_trials=True). In that case, a paused or running trial will not be terminated nor continued, and instead hang forever.

This second case is fixed in this PR by introducing another clause in the processing of "good" trials that checks for this condition.

Related issue number

Closes #37605

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

…yperband-stop

justinvyu

I think these fixes make sense. I'm not so familiar with our hyperband implementation so would need to spend more time on this for a full review.

If tests are passing, happy to get this in now (if aiming for 2.7)

justinvyu · 2023-09-06T18:12:38Z

python/ray/tune/schedulers/hyperband.py

@@ -276,7 +276,7 @@ def _process_bracket(
            # kill bad trials
            self._num_stopped += len(bad)
            for t in bad:
-                if t.status == Trial.PAUSED:
+                if t.status == Trial.PAUSED or t.is_saving:


May lose the checkpoint here if we stop while saving?

This change is avoiding the situation when the trial is in the middle of pausing:

if should_checkpoint: self._cached_trial_decisions[trial.trial_id] = TrialScheduler.PAUSE future_result = self._schedule_trial_save( trial=trial, storage=CheckpointStorage.PERSISTENT ) trial.temporary_state.saving_to = future_result

where the status is technically still "RUNNING" while the save is resolving.

I think it's fine to miss the checkpoint here. The trial is bad anyway and is going to be early-terminated.

krfricke · 2023-09-07T08:59:18Z

I think these fixes make sense. I'm not so familiar with our hyperband implementation so would need to spend more time on this for a full review.

If tests are passing, happy to get this in now (if aiming for 2.7)

Tests seem to be passing fine - and yes, let's pick for 2.7 as it fixes behavior relating to the new storage path

@richardliaw

This PR fixes two flaws in the current hyperband implementation. #### 1. Bug in the `r` calculation. ray-project#1620 introduced a minimum constraint in the `r` calculation during successive halving with `r = min(r, max_t - prev_r)`. It's unclear to me where this is coming from (cc @richardliaw), but in my opinion this is flawed. E.g. for `s=1`, `max_t=8` and `eta=2`, we get `r0 = 4`. Then `r = r0 * 2 = 8`. With the current formula, we then get `r1 = min(r, max_t - r0) = min(8, 8-4) = 4`. Thus, `r0 = r1` and the bracket already "finished" after 4 iterations and should terminate all trials. Or in other words, none of the trials in this bracket will ever proceed. I believe the correct fix here is to set `r = min(r, max_t)`. I couldn't find a reference implementation for comparison, but it logically makes sense and seems to match the formula behavior described in the paper. #### 2. Stopping of "overstepped" trials. The first bug revealed a second shortcoming in the current implementation. When a trial reports a timestep that is higher than `r_i` and `r_(i+1)`, it can hang forever. This is because "good" trials are only continued if `bracket.continue_trial(t)` returns `True`. However, if the trial already overstepped `r_(i+1)`, this can return `False` (specifically when `stop_last_trials=True`). In that case, a paused or running trial will not be terminated nor continued, and instead hang forever. This second case is fixed in this PR by introducing another clause in the processing of "good" trials that checks for this condition. Signed-off-by: Kai Fricke <kai@anyscale.com>

@richardliaw

This PR fixes two flaws in the current hyperband implementation. #### 1. Bug in the `r` calculation. #1620 introduced a minimum constraint in the `r` calculation during successive halving with `r = min(r, max_t - prev_r)`. It's unclear to me where this is coming from (cc @richardliaw), but in my opinion this is flawed. E.g. for `s=1`, `max_t=8` and `eta=2`, we get `r0 = 4`. Then `r = r0 * 2 = 8`. With the current formula, we then get `r1 = min(r, max_t - r0) = min(8, 8-4) = 4`. Thus, `r0 = r1` and the bracket already "finished" after 4 iterations and should terminate all trials. Or in other words, none of the trials in this bracket will ever proceed. I believe the correct fix here is to set `r = min(r, max_t)`. I couldn't find a reference implementation for comparison, but it logically makes sense and seems to match the formula behavior described in the paper. #### 2. Stopping of "overstepped" trials. The first bug revealed a second shortcoming in the current implementation. When a trial reports a timestep that is higher than `r_i` and `r_(i+1)`, it can hang forever. This is because "good" trials are only continued if `bracket.continue_trial(t)` returns `True`. However, if the trial already overstepped `r_(i+1)`, this can return `False` (specifically when `stop_last_trials=True`). In that case, a paused or running trial will not be terminated nor continued, and instead hang forever. This second case is fixed in this PR by introducing another clause in the processing of "good" trials that checks for this condition. Signed-off-by: Kai Fricke <kai@anyscale.com>

@richardliaw

This PR fixes two flaws in the current hyperband implementation. #### 1. Bug in the `r` calculation. ray-project#1620 introduced a minimum constraint in the `r` calculation during successive halving with `r = min(r, max_t - prev_r)`. It's unclear to me where this is coming from (cc @richardliaw), but in my opinion this is flawed. E.g. for `s=1`, `max_t=8` and `eta=2`, we get `r0 = 4`. Then `r = r0 * 2 = 8`. With the current formula, we then get `r1 = min(r, max_t - r0) = min(8, 8-4) = 4`. Thus, `r0 = r1` and the bracket already "finished" after 4 iterations and should terminate all trials. Or in other words, none of the trials in this bracket will ever proceed. I believe the correct fix here is to set `r = min(r, max_t)`. I couldn't find a reference implementation for comparison, but it logically makes sense and seems to match the formula behavior described in the paper. #### 2. Stopping of "overstepped" trials. The first bug revealed a second shortcoming in the current implementation. When a trial reports a timestep that is higher than `r_i` and `r_(i+1)`, it can hang forever. This is because "good" trials are only continued if `bracket.continue_trial(t)` returns `True`. However, if the trial already overstepped `r_(i+1)`, this can return `False` (specifically when `stop_last_trials=True`). In that case, a paused or running trial will not be terminated nor continued, and instead hang forever. This second case is fixed in this PR by introducing another clause in the processing of "good" trials that checks for this condition. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

@richardliaw

This PR fixes two flaws in the current hyperband implementation. #### 1. Bug in the `r` calculation. ray-project#1620 introduced a minimum constraint in the `r` calculation during successive halving with `r = min(r, max_t - prev_r)`. It's unclear to me where this is coming from (cc @richardliaw), but in my opinion this is flawed. E.g. for `s=1`, `max_t=8` and `eta=2`, we get `r0 = 4`. Then `r = r0 * 2 = 8`. With the current formula, we then get `r1 = min(r, max_t - r0) = min(8, 8-4) = 4`. Thus, `r0 = r1` and the bracket already "finished" after 4 iterations and should terminate all trials. Or in other words, none of the trials in this bracket will ever proceed. I believe the correct fix here is to set `r = min(r, max_t)`. I couldn't find a reference implementation for comparison, but it logically makes sense and seems to match the formula behavior described in the paper. #### 2. Stopping of "overstepped" trials. The first bug revealed a second shortcoming in the current implementation. When a trial reports a timestep that is higher than `r_i` and `r_(i+1)`, it can hang forever. This is because "good" trials are only continued if `bracket.continue_trial(t)` returns `True`. However, if the trial already overstepped `r_(i+1)`, this can return `False` (specifically when `stop_last_trials=True`). In that case, a paused or running trial will not be terminated nor continued, and instead hang forever. This second case is fixed in this PR by introducing another clause in the processing of "good" trials that checks for this condition. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

Kai Fricke added 2 commits August 31, 2023 13:16

[tune] Fix hyperband r calculation and stopping

fefb19a

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix copy/paste error

a9fe1bd

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from justinvyu August 31, 2023 11:31

krfricke assigned justinvyu Aug 31, 2023

krfricke mentioned this pull request Aug 31, 2023

[Tune] HyperBand and ASHA not scheduling trials correctly. #37605

Closed

krfricke and others added 3 commits September 5, 2023 09:54

Merge branch 'master' into tune/hyperband-stop

7a50471

Merge branch 'master' into tune/hyperband-stop

e5e2ebd

Merge remote-tracking branch 'origin/tune/hyperband-stop' into tune/h…

8d4c50f

…yperband-stop

justinvyu reviewed Sep 6, 2023

View reviewed changes

krfricke added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. v2.7.0-pick labels Sep 7, 2023

krfricke requested a review from justinvyu September 7, 2023 13:56

justinvyu approved these changes Sep 7, 2023

View reviewed changes

krfricke merged commit 212cd04 into ray-project:master Sep 8, 2023
59 of 63 checks passed

krfricke deleted the tune/hyperband-stop branch September 8, 2023 08:36

GeneDer mentioned this pull request Sep 8, 2023

[2.7.0/tune] Fix hyperband r calculation and stopping (#39157) #39451

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Fix hyperband r calculation and stopping #39157

[tune] Fix hyperband r calculation and stopping #39157

krfricke commented Aug 31, 2023 •

edited

Loading

justinvyu left a comment

justinvyu Sep 6, 2023

krfricke Sep 7, 2023

krfricke commented Sep 7, 2023

[tune] Fix hyperband r calculation and stopping #39157

[tune] Fix hyperband r calculation and stopping #39157

Conversation

krfricke commented Aug 31, 2023 • edited Loading

Why are these changes needed?

1. Bug in the r calculation.

2. Stopping of "overstepped" trials.

Related issue number

Checks

justinvyu left a comment

Choose a reason for hiding this comment

justinvyu Sep 6, 2023

Choose a reason for hiding this comment

krfricke Sep 7, 2023

Choose a reason for hiding this comment

krfricke commented Sep 7, 2023

krfricke commented Aug 31, 2023 •

edited

Loading

1. Bug in the `r` calculation.