New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Eval workers use async req manager. #27390
[RLlib] Eval workers use async req manager. #27390
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this design Sven, is there a requirement to enforce a timeout on remote requests?
If so we don't support that today, but in theory we could.
rllib/algorithms/algorithm.py
Outdated
self._evaluation_async_req_manager is not None | ||
and worker_set is getattr(self, "evaluation_workers", None) | ||
): | ||
self._evaluation_async_req_manager.remove_workers(removed_workers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty nice!! feel proud that we have built utils to make this task easier.
there are a bunch of nits, and a couple of meaningful comments in evaluate_v2(), please take a look and see if they make sense.
@@ -293,6 +294,9 @@ def default_logger_creator(config): | |||
|
|||
# Evaluation WorkerSet and metrics last returned by `self.evaluate()`. | |||
self.evaluation_workers: Optional[WorkerSet] = None | |||
# If evaluation duration is "auto", use a AsyncRequestsManager to be more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comment? if enable_async_evaluation is True ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
self.config.get("framework") in ["tf2", "tfe"] | ||
and not tf.executing_eagerly() | ||
): | ||
tf1.enable_eager_execution() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is still useful? if an eval worker is scheduled to a node that doesn't have eager turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks. Yeah, it's really not needed here. evaluate
is never called from within a thread (yes, it could be run on a different node/process b/c it's on some remote eval worker, but never inside a thread).
rllib/algorithms/algorithm.py
Outdated
@@ -551,6 +555,13 @@ def setup(self, config: PartialAlgorithmConfigDict): | |||
logdir=self.logdir, | |||
) | |||
|
|||
if self.config["evaluation_with_async_requests"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit nit nit, just a naming suggest, enable_async_evaluation may be more expressive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but all our eval settings start with evaluation_...
But fair enough, now that we have config objects with type-safe property names and method args, this may not be so much of a problem anymore :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this the default as fast as possible to not get users confused as of the difference between evaluation_parallel_to_training
and enable_async_evaluation
.
rllib/algorithms/algorithm.py
Outdated
@@ -941,6 +943,199 @@ def duration_fn(num_units_done): | |||
# Also return the results here for convenience. | |||
return self.evaluation_metrics | |||
|
|||
@ExperimentalAPI | |||
def _evaluate_v2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit nit nit, another naming suggestion, with the config name, we can then call this _async_evaluate(self)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to _evaluate_async()
.
We should make this the default as fast as possible to not get users confused as of the difference between evaluation_parallel_to_training
and enable_async_evaluation
.
@@ -2359,6 +2560,15 @@ def _run_one_training_iteration(self) -> Tuple[ResultDict, "TrainIterCtx"]: | |||
Returns: | |||
The results dict from the training iteration. | |||
""" | |||
# In case we are training (in a thread) parallel to evaluation, | |||
# we may have to re-enable eager mode here (gets disabled in the | |||
# thread). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tf1.enable_eager_execution() throws exception if it is called after any other tf operations.
why do we disable it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't get disabled explicitly, but for some reason, any new thread you create starts with "regular" tf (non-eager) and we have to enable it. We do the same in learning threads of IMPALA and APEX. Sorry, never had the time to investigate why it would start this way. However, even if we start with eager enabled, we would still check and DISABLE it in case we don't want eager mode :) So it would be the same "hassle".
# Evaluation does not run for every step. | ||
# Save evaluation metrics on trainer, so it can be attached to | ||
# subsequent step results as latest evaluation result. | ||
self.evaluation_metrics = {"evaluation": metrics} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this block into _run_one_evaluation()? then we don't have to duplicate this in both evaluate_v2() and evaluate()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't work. What if the user wants to call evaluate
manually?
rllib/algorithms/algorithm.py
Outdated
time_started = time.time() | ||
timed_out = True | ||
|
||
while time.time() - time_started < self.config["evaluation_sample_timeout_s"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should just be while True
in this mode we won't be looking at evaluation_sample_timeout_s, which is the time we wait for a single round of eval episodes. and in this mode, there is no concept of rounds.
and we don't have to track timed_out parameter either actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the episodes are too long (no horizon set on evaluation config and batch-mode=complete_episodes)?
Fair, in this case, we are screwed anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
rllib/algorithms/algorithm.py
Outdated
timed_out = False | ||
break | ||
|
||
round_ += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to _round? don't know if style guide suggests training _ for any type of variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq_no == self._evaluation_weights_seq_number | ||
and ( | ||
i * (1 if unit == "episodes" else rollout_len * num_envs) | ||
< units_left_to_do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really care about this? usually folks don't complain if we have a bit more data than intended? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"usually" Yeah, but this is cleaner, no? :D
rllib/algorithms/algorithm.py
Outdated
f"{unit} done)" | ||
) | ||
|
||
if timed_out and log_once("evaluation_timeout"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can get rid of this warning for async mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, no more timeouts in async mode
…_workers_use_async_req_manager
…_workers_use_async_req_manager
…_workers_use_async_req_manager
…_workers_use_async_req_manager
rllib/algorithms/algorithm.py
Outdated
): | ||
raise ValueError( | ||
"Local evaluation OR evaluation without input reader OR evaluation " | ||
"with only a local eval worker not supported in combination " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between "Local evaluation" and "evaluation with only a local eval worker"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Local evaluation" is when you use the "local worker" (of the regular WorkerSet (self.workers)) for evaluation.
Clarified the error message.
@@ -216,6 +216,68 @@ def _do_test_fault_fatal(self, alg, config, fail_eval=False): | |||
self.assertRaises(Exception, lambda: a.train()) | |||
a.stop() | |||
|
|||
def _do_test_fault_fatal_but_recreate(self, alg, config, eval_only=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is eval_only=False ever used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point: Removed the arg altogether.
weights = {pid: actual_weights[i] for i, pid in enumerate(weights.keys())} | ||
# Only update our weights, if no seq no given OR given seq no is different | ||
# from ours | ||
if weights_seq_no is None or weights_seq_no != self.weights_seq_no: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If we have a sequence number, it would probably be expected behaviour to update if it increments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's what we do, no? Below in the line
self.weights_seq_no = weights_seq_no
We even update if the passed in seq_no is None, such that next time any seq-no is passed that's not None, we also update again and - after that - have a non-None seq-no again. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question regarding concurrent calls to recreate_failed_workers. Other than that: Nothing major.
Hey @ArturNiederfahrenhorst , please take another look. All your questions have been addressed now. Thanks! |
…_workers_use_async_req_manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Lint still failing? |
…_workers_use_async_req_manager
…er' into eval_workers_use_async_req_manager
Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Add new config option:
enable_evaluation_v2
. This is an experimental setting.If True:
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.