-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Cleanup ActorManager
and WorkerSet
: Make all mark_healthy
/healthy_only
method args True
by default.
#44993
[RLlib] Cleanup ActorManager
and WorkerSet
: Make all mark_healthy
/healthy_only
method args True
by default.
#44993
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Some comments in regard to potential influence on sample timout settings.
timeout_seconds=self.config.worker_restore_timeout_s, | ||
# Bring back actor after successful state syncing. | ||
mark_healthy=True, | ||
timeout_seconds=self.config.env_runner_restore_timeout_s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this timeout and restoration time in general influence in any way the sampling (blocking it or slowing down for example)? I ask b/c in this case the sample_timeout_s
could lead to no samples returned by synchronous_parallel_sample
and lead to errors.
env_runner_health_probe_timeout_s: Max amount of time we should spend | ||
waiting for health probe calls to finish. Health pings are very cheap, | ||
so the default is 1 minute. | ||
env_runner_health_probe_timeout_s: Max amount of time in seconds, we should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well: Any influence on/by sample_timeout_s
?
Signed-off-by: Sven Mika <svenmika1977@gmail.com>
…y`/`healthy_only` method args `True` by default. (ray-project#44993)
Cleanup
ActorManager
andWorkerSet
: Make allmark_healthy
/healthy_only
method argsTrue
by default.Also renamed
WorkerSet.__actor_manager
not super-private anymore (super-privates make it impossible to debug properly!).Reasoning:
The default behavior of a
WorkerSet.foreach_...
call should be:timeout
(sometimes a minute) for workers that we already know are NOT healthy.mark_healthy
should always be True anyways! If any worker that is being sent q request - for whatever reason - then responds properly, we should mark it healthy, unless there is a good reason NOT to do so.restore_workers
API. This method will send a special "ping" request to all unhealthy (only) workers (with a now shortened timeout of 30sec (before 60s)) and mark them healthy, if they respond well.Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.