-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[RLlib] Bug fix: Failed EnvRunners are not restored if there is no local EnvRunner. #54091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Bug fix: Failed EnvRunners are not restored if there is no local EnvRunner. #54091
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug where failed EnvRunners were not restored when there was no local EnvRunner available by revising the restoration logic.
- Removed the early return that prevented restoration when both env_runner_group.local_env_runner and self.env_runner were missing.
- Introduced a fallback mechanism that synchronizes EnvRunner state from the LearnerGroup when no local EnvRunner exists.
Comments suppressed due to low confidence (2)
rllib/algorithms/algorithm.py:1888
- The early return was removed to allow fallback restoration. Consider asserting that env_runner_group is always non-null (based on its type hint) to avoid potential errors if an unexpected value is passed.
A list of EnvRunner indices that have been restored during the call of
rllib/algorithms/algorithm.py:1936
- In the fallback block that syncs state from the learner group, consider adding error handling or an assertion to ensure that 'self.learner_group' is non-null before its usage to avoid potential errors.
else:
…env_runners_not_restarting_on_new_stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I guess the same has to be done for the OfflineEvaluationRunner
as well.
…cal EnvRunner. (#54091) Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Bug fix: Failed EnvRunners are not restored if there is no local EnvRunner.
The existing logic would restore restarted EnvRunners from the local one.
If there was no local one, the
Algorithm.restore_env_runners
method would early-out.The fix is to properly collect the module state from the Learners and local connector pipelines, instead and then sync the newly recreated EnvRunners.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.