[RLlib] Bug fix: Failed EnvRunners are not restored if there is no local EnvRunner. #54091

sven1977 · 2025-06-25T15:59:33Z

Bug fix: Failed EnvRunners are not restored if there is no local EnvRunner.

The existing logic would restore restarted EnvRunners from the local one.
If there was no local one, the Algorithm.restore_env_runners method would early-out.

The fix is to properly collect the module state from the Learners and local connector pipelines, instead and then sync the newly recreated EnvRunners.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Copilot

Pull Request Overview

This PR fixes a bug where failed EnvRunners were not restored when there was no local EnvRunner available by revising the restoration logic.

Removed the early return that prevented restoration when both env_runner_group.local_env_runner and self.env_runner were missing.
Introduced a fallback mechanism that synchronizes EnvRunner state from the LearnerGroup when no local EnvRunner exists.

Comments suppressed due to low confidence (2)

rllib/algorithms/algorithm.py:1888

The early return was removed to allow fallback restoration. Consider asserting that env_runner_group is always non-null (based on its type hint) to avoid potential errors if an unexpected value is passed.

            A list of EnvRunner indices that have been restored during the call of

rllib/algorithms/algorithm.py:1936

In the fallback block that syncs state from the learner group, consider adding error handling or an assertion to ensure that 'self.learner_group' is non-null before its usage to avoid potential errors.

            else:

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…env_runners_not_restarting_on_new_stack

Signed-off-by: sven1977 <svenmika1977@gmail.com>

simonsays1980

LGTM. I guess the same has to be done for the OfflineEvaluationRunner as well.

…cal EnvRunner. (#54091) Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

fix

5fabb87

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Copilot AI review requested due to automatic review settings June 25, 2025 15:59

sven1977 requested a review from a team as a code owner June 25, 2025 15:59

sven1977 enabled auto-merge (squash) June 25, 2025 15:59

github-actions bot added the go label Jun 25, 2025

Copilot AI reviewed Jun 25, 2025

View reviewed changes

sven1977 added 2 commits June 27, 2025 10:55

wip

fa728dd

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

2c9e606

Signed-off-by: sven1977 <svenmika1977@gmail.com>

github-actions bot disabled auto-merge June 27, 2025 15:01

sven1977 assigned simonsays1980 Jun 27, 2025

sven1977 added rllib rllib-checkpointing-or-recovery rllib-envrunners labels Jun 27, 2025

sven1977 enabled auto-merge (squash) June 27, 2025 15:04

sven1977 disabled auto-merge June 27, 2025 15:04

sven1977 added 4 commits June 28, 2025 10:27

fix

16db2dc

Signed-off-by: sven1977 <svenmika1977@gmail.com>

TEST

52a94bc

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

5c13235

…env_runners_not_restarting_on_new_stack

TEST

1bc44f6

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 added the tests-ok label Jun 29, 2025

simonsays1980 approved these changes Jun 30, 2025

View reviewed changes

sven1977 merged commit 73c8df7 into ray-project:master Jul 1, 2025
5 checks passed

sven1977 deleted the fix_env_runners_not_restarting_on_new_stack branch July 1, 2025 15:16

elliot-barn pushed a commit that referenced this pull request Jul 2, 2025

[RLlib] Bug fix: Failed EnvRunners are not restored if there is no lo…

36523ae

…cal EnvRunner. (#54091) Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RLlib] Bug fix: Failed EnvRunners are not restored if there is no local EnvRunner. #54091

[RLlib] Bug fix: Failed EnvRunners are not restored if there is no local EnvRunner. #54091

sven1977 commented Jun 25, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

simonsays1980 left a comment

Uh oh!

Uh oh!

Uh oh!

[RLlib] Bug fix: Failed EnvRunners are not restored if there is no local EnvRunner. #54091

[RLlib] Bug fix: Failed EnvRunners are not restored if there is no local EnvRunner. #54091

Conversation

sven1977 commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

simonsays1980 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sven1977 commented Jun 25, 2025 •

edited

Loading