-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Add on_workers_recreated
callback to Algorithm.
#40354
[RLlib] Add on_workers_recreated
callback to Algorithm.
#40354
Conversation
…e64_add_on_worker_created_callback
…e64_add_on_worker_created_callback
…llback' into issue64_add_on_worker_created_callback
…e64_add_on_worker_created_callback
on_worker_created
callback to Algorithm/WorkerSet.on_workers_recreated
callback to Algorithm.
rllib/algorithms/callbacks.py
Outdated
Note that any "worker" inside the algorithm's `self.worker` and | ||
`self.evaluation_workers` WorkerSets are instances of a subclass of EnvRunner. | ||
|
||
.. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use code-blocks anymore, we use testcode
and testoutput
these days!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no! :D Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
for id_ in worker_ids_2: | ||
# A newly created worker: It's recreated counter should be 1. | ||
if id_ not in original_worker_ids: | ||
self.assertTrue(algo._counters[f"worker_{id_}_recreated"] == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this not become flakey in case a worker gets created >1 times? We iterate three times so i'd expect that after each iteration, workers might be recreated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine a recreated worker gets a new ID, hence us checking, which are the new healthy worker IDs right before this loop.
rllib/evaluation/worker_set.py
Outdated
# TODO(jungong) : switch to True once Algorithm is migrated. | ||
healthy_only=False, | ||
local_worker: bool = True, | ||
healthy_only: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity: Why can we switch to True and what migration was Jun's comment directed at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one :) I think now that Algorithm uses the ActorManager (within WorkerSet), we can switch the default behavior to only operate on the healthy workers. Which makes sense: You wouldn't want to "ping" an already failed worker with an apply()
request as this one would most certainly also fail or - even worse - cause a timeout or hang.
All fixed, please take another look. Thanks a ton @ArturNiederfahrenhorst ! |
…llback' into issue64_add_on_worker_created_callback
Signed-off-by: Sven Mika <svenmika1977@gmail.com>
…llback' into issue64_add_on_worker_created_callback # Conflicts: # rllib/algorithms/callbacks.py # rllib/algorithms/tests/test_callbacks.py
…e64_add_on_worker_created_callback
Signed-off-by: Sven Mika <sven@anyscale.io>
This is a pick of part of #40354 that disable rllib:test_a3c tests in release branch. This particular test has been very flaky that it is blocking release branch as well. Signed-off-by: can <can@anyscale.com>
Add
on_workers_recreated
callback to DefaultCallbacks to be used inAlgorithm
.Why are these changes needed?
Sometimes, when a local- or remote worker is restarted, users would like to be able to immediately send some information to these workers (e.g. the global steps sampled from the Algorithm's counters). Therefore, we added this new callback event.
The new
on_workers_recreated()
callback is triggered after the Algorithm has checked, whether any workers (training ones underAlgorithm.workers
and evaluation ones underAlgorithm.evaluation_workers
) have failed, via theWorkerSet.restore_workers()
API.Here is the docstring of the new callback method. Note the need to use the
.apply()
method when dealing with a remote worker, as the callback is called from the local algorithm process and NOT by the worker process itself:Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.