[RLlib] Eval workers use async req manager. #27390

sven1977 · 2022-08-02T21:30:30Z

Add new config option: enable_evaluation_v2. This is an experimental setting.

If True:

Evaluation workers will be organized inside a AsyncRequestsManager object, no matter what the eval settings are (e.g. parallel eval and training, different eval durations, etc..)
This allows the eval step to become more robust against eval worker failures and or environment-related delays and pauses (e.g. if an env has to restart or re-connect to some server, which may take hours).
The AsyncRequestsManager is queried during the evaluation step for results and each returned result is checked for being up-to-date in terms of the weights being used for that sample request. If the weights used for a sample request are outdated, the sample result is discarded.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>

avnishn

in this design Sven, is there a requirement to enforce a timeout on remote requests?

If so we don't support that today, but in theory we could.

avnishn · 2022-08-02T22:00:28Z

rllib/algorithms/algorithm.py

+            self._evaluation_async_req_manager is not None
+            and worker_set is getattr(self, "evaluation_workers", None)
+        ):
+          self._evaluation_async_req_manager.remove_workers(removed_workers)


gjoliver

pretty nice!! feel proud that we have built utils to make this task easier.
there are a bunch of nits, and a couple of meaningful comments in evaluate_v2(), please take a look and see if they make sense.

gjoliver · 2022-08-03T07:48:08Z

rllib/algorithms/algorithm.py

@@ -293,6 +294,9 @@ def default_logger_creator(config):

        # Evaluation WorkerSet and metrics last returned by `self.evaluate()`.
        self.evaluation_workers: Optional[WorkerSet] = None
+        # If evaluation duration is "auto", use a AsyncRequestsManager to be more


update the comment? if enable_async_evaluation is True ...

gjoliver · 2022-08-03T07:50:46Z

rllib/algorithms/algorithm.py

-            self.config.get("framework") in ["tf2", "tfe"]
-            and not tf.executing_eagerly()
-        ):
-            tf1.enable_eager_execution()


I feel like this is still useful? if an eval worker is scheduled to a node that doesn't have eager turned on.

Ah, thanks. Yeah, it's really not needed here. evaluate is never called from within a thread (yes, it could be run on a different node/process b/c it's on some remote eval worker, but never inside a thread).

gjoliver · 2022-08-03T07:51:48Z

rllib/algorithms/algorithm.py

@@ -551,6 +555,13 @@ def setup(self, config: PartialAlgorithmConfigDict):
                logdir=self.logdir,
            )

+            if self.config["evaluation_with_async_requests"]:


nit nit nit, just a naming suggest, enable_async_evaluation may be more expressive.

Yeah, but all our eval settings start with evaluation_...
But fair enough, now that we have config objects with type-safe property names and method args, this may not be so much of a problem anymore :)

We should make this the default as fast as possible to not get users confused as of the difference between evaluation_parallel_to_training and enable_async_evaluation.

gjoliver · 2022-08-03T07:52:32Z

rllib/algorithms/algorithm.py

@@ -941,6 +943,199 @@ def duration_fn(num_units_done):
        # Also return the results here for convenience.
        return self.evaluation_metrics

+    @ExperimentalAPI
+    def _evaluate_v2(


nit nit nit, another naming suggestion, with the config name, we can then call this _async_evaluate(self)?

Renamed to _evaluate_async().

We should make this the default as fast as possible to not get users confused as of the difference between evaluation_parallel_to_training and enable_async_evaluation.

gjoliver · 2022-08-03T07:57:51Z

rllib/algorithms/algorithm.py

@@ -2359,6 +2560,15 @@ def _run_one_training_iteration(self) -> Tuple[ResultDict, "TrainIterCtx"]:
        Returns:
            The results dict from the training iteration.
        """
+        # In case we are training (in a thread) parallel to evaluation,
+        # we may have to re-enable eager mode here (gets disabled in the
+        # thread).


I think tf1.enable_eager_execution() throws exception if it is called after any other tf operations.
why do we disable it?

It doesn't get disabled explicitly, but for some reason, any new thread you create starts with "regular" tf (non-eager) and we have to enable it. We do the same in learning threads of IMPALA and APEX. Sorry, never had the time to investigate why it would start this way. However, even if we start with eager enabled, we would still check and DISABLE it in case we don't want eager mode :) So it would be the same "hassle".

gjoliver · 2022-08-03T08:24:21Z

rllib/algorithms/algorithm.py

+        # Evaluation does not run for every step.
+        # Save evaluation metrics on trainer, so it can be attached to
+        # subsequent step results as latest evaluation result.
+        self.evaluation_metrics = {"evaluation": metrics}


let's move this block into _run_one_evaluation()? then we don't have to duplicate this in both evaluate_v2() and evaluate()

I think this won't work. What if the user wants to call evaluate manually?

gjoliver · 2022-08-03T08:29:50Z

rllib/algorithms/algorithm.py

+        time_started = time.time()
+        timed_out = True
+
+        while time.time() - time_started < self.config["evaluation_sample_timeout_s"]:


I think this should just be while True
in this mode we won't be looking at evaluation_sample_timeout_s, which is the time we wait for a single round of eval episodes. and in this mode, there is no concept of rounds.
and we don't have to track timed_out parameter either actually.

What if the episodes are too long (no horizon set on evaluation config and batch-mode=complete_episodes)?
Fair, in this case, we are screwed anyways.

gjoliver · 2022-08-03T08:30:45Z

rllib/algorithms/algorithm.py

+                timed_out = False
+                break
+
+            round_ += 1


rename to _round? don't know if style guide suggests training _ for any type of variables.

gjoliver · 2022-08-03T08:32:48Z

rllib/algorithms/algorithm.py

+                        seq_no == self._evaluation_weights_seq_number
+                        and (
+                            i * (1 if unit == "episodes" else rollout_len * num_envs)
+                            < units_left_to_do


do we really care about this? usually folks don't complain if we have a bit more data than intended? :)

"usually" Yeah, but this is cleaner, no? :D

gjoliver · 2022-08-03T08:33:37Z

rllib/algorithms/algorithm.py

+                f"{unit} done)"
+            )
+
+        if timed_out and log_once("evaluation_timeout"):


we can get rid of this warning for async mode.

done, no more timeouts in async mode

…_workers_use_async_req_manager

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…_workers_use_async_req_manager

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…_workers_use_async_req_manager

Signed-off-by: sven1977 <svenmika1977@gmail.com>

ArturNiederfahrenhorst · 2022-08-11T16:25:16Z

rllib/algorithms/algorithm.py

+        ):
+            raise ValueError(
+                "Local evaluation OR evaluation without input reader OR evaluation "
+                "with only a local eval worker not supported in combination "


What's the difference between "Local evaluation" and "evaluation with only a local eval worker"?

"Local evaluation" is when you use the "local worker" (of the regular WorkerSet (self.workers)) for evaluation.

Clarified the error message.

Signed-off-by: sven1977 <svenmika1977@gmail.com>

rllib/algorithms/algorithm.py

ArturNiederfahrenhorst · 2022-08-11T17:06:50Z

rllib/algorithms/tests/test_worker_failures.py

@@ -216,6 +216,68 @@ def _do_test_fault_fatal(self, alg, config, fail_eval=False):
            self.assertRaises(Exception, lambda: a.train())
            a.stop()

+    def _do_test_fault_fatal_but_recreate(self, alg, config, eval_only=False):


Is eval_only=False ever used?

Good point: Removed the arg altogether.

ArturNiederfahrenhorst · 2022-08-11T17:16:34Z

rllib/evaluation/rollout_worker.py

-            weights = {pid: actual_weights[i] for i, pid in enumerate(weights.keys())}
+        # Only update our weights, if no seq no given OR given seq no is different
+        # from ours
+        if weights_seq_no is None or weights_seq_no != self.weights_seq_no:


Nit: If we have a sequence number, it would probably be expected behaviour to update if it increments.

But that's what we do, no? Below in the line

self.weights_seq_no = weights_seq_no

We even update if the passed in seq_no is None, such that next time any seq-no is passed that's not None, we also update again and - after that - have a non-None seq-no again. :)

ArturNiederfahrenhorst

I have a question regarding concurrent calls to recreate_failed_workers. Other than that: Nothing major.

sven1977 · 2022-08-12T13:44:06Z

Hey @ArturNiederfahrenhorst , please take another look. All your questions have been addressed now. Thanks!

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…_workers_use_async_req_manager

Signed-off-by: sven1977 <svenmika1977@gmail.com>

ArturNiederfahrenhorst

lgtm

stale

Signed-off-by: Avnish <avnishnarayan@gmail.com>

richardliaw · 2022-08-16T00:03:59Z

Lint still failing?

…_workers_use_async_req_manager

…er' into eval_workers_use_async_req_manager

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

sven1977 added 5 commits August 2, 2022 21:24

wip

74e8895

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

a0cdf9e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

merge

a0c6c03

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

3d300f2

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

96b6b42

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested review from gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners August 2, 2022 21:30

avnishn reviewed Aug 2, 2022

View reviewed changes

gjoliver previously requested changes Aug 3, 2022

View reviewed changes

sven1977 added 10 commits August 5, 2022 12:41

Merge branch 'master' of https://github.com/ray-project/ray into eval…

8e7476e

…_workers_use_async_req_manager

wip

ac9f643

wip

410d7f0

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into eval…

44f7d22

…_workers_use_async_req_manager

wip

b4ab354

Merge branch 'master' of https://github.com/ray-project/ray into eval…

235a2fa

…_workers_use_async_req_manager

wip

93e68ad

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into eval…

9b58eaa

…_workers_use_async_req_manager

wip

73aa45a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

LINT

0690a36

Signed-off-by: sven1977 <svenmika1977@gmail.com>

ArturNiederfahrenhorst reviewed Aug 11, 2022

View reviewed changes

LINT

f686624

Signed-off-by: sven1977 <svenmika1977@gmail.com>

ArturNiederfahrenhorst reviewed Aug 11, 2022

View reviewed changes

rllib/algorithms/algorithm.py Show resolved Hide resolved

ArturNiederfahrenhorst reviewed Aug 11, 2022

View reviewed changes

rllib/algorithms/algorithm.py Show resolved Hide resolved

ArturNiederfahrenhorst reviewed Aug 11, 2022

View reviewed changes

sven1977 added 3 commits August 12, 2022 15:44

wip

c416332

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into eval…

c82d868

…_workers_use_async_req_manager

LINT

2290679

Signed-off-by: sven1977 <svenmika1977@gmail.com>

ArturNiederfahrenhorst approved these changes Aug 15, 2022

View reviewed changes

Address lint failure

a22b157

Signed-off-by: Avnish <avnishnarayan@gmail.com>

sven1977 added 3 commits August 16, 2022 09:35

Merge branch 'master' of https://github.com/ray-project/ray into eval…

253b3b9

…_workers_use_async_req_manager

Merge remote-tracking branch 'origin/eval_workers_use_async_req_manag…

794742e

…er' into eval_workers_use_async_req_manager

LINT

64eebfd

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 merged commit 436c89b into ray-project:master Aug 16, 2022

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022

[RLlib] Eval workers use async req manager. (ray-project#27390)

dc4c7d3

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

JiahaoYao pushed a commit to JiahaoYao/ray that referenced this pull request Aug 21, 2022

[RLlib] Eval workers use async req manager. (ray-project#27390)

dad63bc

JiahaoYao pushed a commit to JiahaoYao/ray that referenced this pull request Aug 22, 2022

[RLlib] Eval workers use async req manager. (ray-project#27390)

922dbdc

JiahaoYao pushed a commit to JiahaoYao/ray that referenced this pull request Aug 22, 2022

[RLlib] Eval workers use async req manager. (ray-project#27390)

c1edb1e

JiahaoYao pushed a commit to JiahaoYao/ray that referenced this pull request Aug 30, 2022

[RLlib] Eval workers use async req manager. (ray-project#27390)

7789579

pcmoritz pushed a commit to pcmoritz/ray-1 that referenced this pull request Aug 31, 2022

[RLlib] Eval workers use async req manager. (ray-project#27390)

49fb40c

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

ArturNiederfahrenhorst pushed a commit to ArturNiederfahrenhorst/ray that referenced this pull request Sep 1, 2022

[RLlib] Eval workers use async req manager. (ray-project#27390)

e2c617b

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

matthewdeng pushed a commit that referenced this pull request Sep 15, 2022

[RLlib] Eval workers use async req manager. (#27390)

5e6add3

sven1977 deleted the eval_workers_use_async_req_manager branch June 2, 2023 20:18

[RLlib] Eval workers use async req manager. #27390

[RLlib] Eval workers use async req manager. #27390

Conversation

sven1977 commented Aug 2, 2022 • edited

Why are these changes needed?

Related issue number

Checks

avnishn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArturNiederfahrenhorst left a comment

Choose a reason for hiding this comment

sven1977 commented Aug 12, 2022

ArturNiederfahrenhorst left a comment

Choose a reason for hiding this comment

richardliaw commented Aug 16, 2022

sven1977 commented Aug 2, 2022 •

edited