[RLlib] New ConnectorV3 API #05: PPO runs in single-agent mode in this API stack #42272

sven1977 · 2024-01-09T20:07:57Z

EnvRunners support new ConnectorV3 API; PPO runs in single-agent mode in this API stack
This PR:

Adds a new config key: train_batch_size_per_learner to better distinguish between total effective batch size and batch size per (GPU) learner worker.
Makes large changes to the PPO algorithm when run with the new API stack + EnvRunners:
- Forwards episode data directly from EnvRunner(s) to Learner worker(s) w/o having to form a MultiAgentBatch first.
- Removes need for PPO's forward_exploration to perform a value-function pass. This is an essential improvement in code quality as we now have full separation between the sampling- and the learning worlds. The EnvRunner (sampling world) is no longer concerned with having to think about what the PPOLearner (learning world) might need and only needs to compute actions for the next env step.
- All vf-computations, GAE, and advantages computations have been moved to the Learner side and these operations are now performed in a batched fashion (on all provided episodes at once). Having the episodes still intact on the Learner side helps reducing the complexity of these computations.

Benchmark results:
Learns Pong in ~5min via examples/connectors/connector_v2_frame_stacking.py example script:

Args: --num-gpus=8 --num-env-runners=95 --framework=torch

on commit: 790a537

Trial status: 1 RUNNING
Current time: 2024-01-12 12:41:55. Total running time: 7min 0s
Logical resource usage: 96.0/96 CPUs, 8.0/8 GPUs (0.0/1.0 accelerator_type:V100)
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name            status       iter     total time (s)       ts     reward     episode_reward_max     episode_reward_min     episode_len_mean     episodes_this_iter │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_env_0b2b7_00000   RUNNING       226             362.71   904000      19.62                     21                      9              1728.96                      0 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…runner_support_connectors_04_learner_api_changes

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…runner_support_connectors_04_learner_api_changes

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…runner_support_connectors_04_learner_api_changes

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…runner_support_connectors_04_learner_api_changes

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 · 2024-01-16T11:53:39Z

rllib/algorithms/ppo/ppo.py

@@ -550,24 +638,3 @@ def training_step(self) -> ResultDict:
        self.workers.local_worker().set_global_vars(global_vars)

        return train_results
-
-    def postprocess_episodes(


No longer needed here. Episodes are sent directly to Learner(s) as-is.

sven1977 · 2024-01-16T11:54:17Z

rllib/algorithms/ppo/ppo_learner.py

@@ -39,6 +47,78 @@ def build(self) -> None:
            )
        )

+    @override(Learner)
+    def _preprocess_train_data(


Note: Only called on the new API stack + EnvRunners.

sven1977 · 2024-01-16T11:57:09Z

rllib/algorithms/ppo/ppo_learner.py

+        if not episodes:
+            return batch, episodes
+
+        # Make all episodes one ts longer in order to just have a single batch


New way to do GAE:

elongate all episodes by one artificial ts.

perform vf-predictions AND bootstrap value predictions in one single batch (b/c we have the extra timestep!)

use the learner connector to make sure this forward pass is done using the correct (custom?) batch format.

remove extra timesteps from episodes (and computed advantages)

sven1977 · 2024-01-16T11:57:59Z

rllib/algorithms/ppo/ppo_rl_module.py

-            SampleBatch.VF_PREDS,
-            SampleBatch.ACTION_DIST_INPUTS,
-        ]
+        return self.output_specs_inference()


sven1977 · 2024-01-16T11:59:04Z

rllib/algorithms/ppo/tf/ppo_tf_rl_module.py

@@ -40,6 +40,11 @@ def _forward_exploration(self, batch: NestedDict) -> Dict[str, Any]:
        the policy distribution to be used for computing KL divergence between the old
        policy and the new policy during training.
        """
+        # TODO (sven): Make this the only bahevior once PPO has been migrated
+        #  to new API stack (including EnvRunners!).
+        if self.config.model_config_dict.get("uses_new_env_runners"):


temporary hack to make sure RLModule knows, when it still has to compute vf-preds via forward_exploration (old and hybrid API stacks).

sven1977 · 2024-01-16T11:59:38Z

rllib/core/learner/learner.py

@@ -272,6 +281,40 @@ def __init__(
        # the final results dict in the `self.compile_update_results()` method.
        self._metrics = defaultdict(dict)

+    @OverrideToImplementCustomLogic_CallToSuperRecommended


Moved here for better ordering of methods (used to be all the way at the bottom of class).

sven1977 · 2024-01-16T12:00:15Z

rllib/core/learner/learner.py

+
+        # Build learner connector pipeline used on this Learner worker.
+        # TODO (sven): Support multi-agent cases.
+        if self.config.uses_new_env_runners and not self.config.is_multi_agent():


For now: Only on new API stack + EnvRunner + single-agent: use Learner connector (w/o this PPO on new stack would not learn).

sven1977 · 2024-01-16T12:01:08Z

rllib/utils/minibatch_utils.py

@@ -87,7 +86,13 @@ def __iter__(self):
                    def get_len(b):
                        return len(b[SampleBatch.SEQ_LENS])

+                    n_steps = int(


Bug fix. When slicing on a BxT batch, we should slice properly along B-axis (with the correct slice size!).

sven1977 · 2024-01-16T12:01:18Z

rllib/policy/sample_batch.py

-                    return value
-
-            data = tree.map_structure(map_, self)
+            infos = self.pop(SampleBatch.INFOS, None)


Simplifications.

sven1977 · 2024-01-16T12:01:24Z

rllib/policy/sample_batch.py

-                            # we return the values here and slice them separately
-                            # TODO(Artur): Clean this hack up.
-                            return value
+                    return value[start_padded:stop_padded]


Simplifications.

kouroshHakha

Stmp

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…runner_support_connectors_05_ppo_w_connectorv2s

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…runner_support_connectors_05_ppo_w_connectorv2s

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Mark2000 · 2024-05-02T07:22:27Z

@sven1977 Could you speak more to why GAE support was dropped for APPO in this release?

        # IMPALA and APPO need vtrace (A3C Policies no longer exist).
         if not self.vtrace:
             raise ValueError(
                 "IMPALA and APPO do NOT support vtrace=False anymore! Set "
                 "`config.training(vtrace=True)`."
             )

sven1977 added 22 commits November 17, 2023 12:49

wip

e7ae52a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

fe640de

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into env_…

b340ddc

…runner_support_connectors_04_learner_api_changes

wip

9492b7a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

merge

52d5e72

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

6437d7e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

merge

242d40a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

61be702

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

bf802fc

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fixes

16f2c38

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into env_…

ad047a7

…runner_support_connectors_04_learner_api_changes

wip

083388d

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into env_…

8e02889

…runner_support_connectors_04_learner_api_changes

LINT

10b0700

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix

e439fc8

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix

4633659

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

cce2c66

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

7dff78f

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix

bdb20dc

Signed-off-by: sven1977 <svenmika1977@gmail.com>

fix

bcdb92f

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into env_…

03fe431

…runner_support_connectors_04_learner_api_changes

fix

7dd8f3f

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested review from avnishn, ArturNiederfahrenhorst, smorad, maxpumperla and kouroshHakha as code owners January 9, 2024 20:07

sven1977 added do-not-merge Do not merge this PR! rllib-newstack rllib-oldstack-cleanup Issues related to cleaning up classes, utilities on the old API stack labels Jan 9, 2024

sven1977 commented Jan 16, 2024

View reviewed changes

kouroshHakha approved these changes Jan 17, 2024

View reviewed changes

sven1977 added 17 commits January 18, 2024 09:25

wip

92f80f7

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into env_…

0be8c3a

…runner_support_connectors_05_ppo_w_connectorv2s

wip

d3d6a22

Signed-off-by: sven1977 <svenmika1977@gmail.com>

LINT

9cd3d3f

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

279f7e4

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

8046634

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

229f788

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

dab848c

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

b7ccae9

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

f1ccb4b

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

d78b5e0

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

b59b717

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

9d7087c

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

9de63cc

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

c340869

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into env_…

db2f77d

…runner_support_connectors_05_ppo_w_connectorv2s

wip

25e0efe

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 merged commit e03dd6e into ray-project:master Jan 19, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] New ConnectorV3 API #05: PPO runs in single-agent mode in this API stack #42272

[RLlib] New ConnectorV3 API #05: PPO runs in single-agent mode in this API stack #42272

sven1977 commented Jan 9, 2024 •

edited

Loading

sven1977 Jan 16, 2024

sven1977 Jan 16, 2024

sven1977 Jan 16, 2024

sven1977 Jan 16, 2024

sven1977 Jan 16, 2024

sven1977 Jan 16, 2024

sven1977 Jan 16, 2024

sven1977 Jan 16, 2024

sven1977 Jan 16, 2024

sven1977 Jan 16, 2024

kouroshHakha left a comment

Mark2000 commented May 2, 2024

[RLlib] New ConnectorV3 API #05: PPO runs in single-agent mode in this API stack #42272

[RLlib] New ConnectorV3 API #05: PPO runs in single-agent mode in this API stack #42272

Conversation

sven1977 commented Jan 9, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha left a comment

Choose a reason for hiding this comment

Mark2000 commented May 2, 2024

sven1977 commented Jan 9, 2024 •

edited

Loading