[RLlib] DQN Rainbow on new API stack: `RLModule` and `Catalog` together with `TorchNoisyMLP`. #43199

simonsays1980 · 2024-02-15T14:10:19Z

Why are these changes needed?

This is the third part for moving the DQN Rainbow algorithm to our new stack and towards using EnvRunner API. See for the other parts #43196 and #43198. This PR introduces the DQNRainbowRLModule tiogether with its catalog. The module implements a dueling arhcitecture and distributional Q-learning.
Furthermore, it comes with noisy networks introducing a new TorchNoisyMLP (in Encoder and Head version) that makes use of a NoisyLinear-layer similar to the nn.Linear such that we can use the same design as for the other torch encoders and heads. Together with the networks corresponding configurations are introduced.

Related issue number

Closes #37777

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

… Epsilon-greedy. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

… noisy layers. Introduced a new 'TorchNoisyMLP' tat includes a 'NoisyLinear' layer in tghe same way like 'nn.Linear' such that we can keep the diesng we have for the new stack. Furthermore, included '**kwargs' into the ctor of 'PrioritizedReplayBuffer' such that arguments in 'replay_buffer_config' don't raise an exception. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…lay_buffer.py' to branch 'dqn-rainbow-training-step'. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

rllib/core/learner/torch/torch_learner.py

sven1977 · 2024-02-20T14:01:00Z

rllib/core/learner/torch/torch_learner.py

@@ -303,7 +303,7 @@ def build(self) -> None:
            # the user the option to run on the gpu of their choice, so we enable that
            # option here via the local gpu id scaling config parameter.
            if self._distributed:
-                devices = get_devices()
+                devices = get_device()
                assert len(devices) == 1, (
                    "`get_devices()` should only return one cuda device, "


fix this comment here as well: get_device() should only return one ....

rllib/algorithms/dqn/dqn_rainbow_catalog.py

rllib/algorithms/dqn/dqn_rainbow_rl_module.py

sven1977 · 2024-02-20T17:08:52Z

rllib/algorithms/dqn/dqn_rainbow_rl_module.py

+
+    @abstractmethod
+    @OverrideToImplementCustomLogic
+    def _qf(self, batch: Dict[str, TensorType]) -> Dict[str, TensorType]:


Nice and clean API. Like it! :)

rllib/algorithms/dqn/dqn_rainbow_rl_module.py

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…remaining input args to the configs of the catalog. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…ded all remaining arguments to the encoder and head configs in the catalog. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

simonsays1980

Just left some comments for discussion to bring this module into a final version.

simonsays1980 · 2024-02-23T13:19:49Z

rllib/algorithms/dqn/dqn_rainbow_catalog.py

+            input_dims=self.latent_dims,
+            hidden_layer_dims=self._model_config_dict["post_fcnet_hiddens"],
+            hidden_layer_activation=self.af_and_vf_head_activation,
+            # TODO (simon): No clue where we get this from.


@sven1977 Where do we actually get these config settings? These are not given in models/catalog.py.

I think they only exist in the new model config objects, so thus far, they are not really reachable by users going through the traditional Catalog model_config_dict.

simonsays1980 · 2024-02-23T13:20:01Z

rllib/algorithms/dqn/dqn_rainbow_catalog.py

+            output_layer_activation="linear",
+            output_layer_dim=output_layer_dim,
+            # TODO (simon): CHeck where we get this from.
+            # output_layer_use_bias=self._model_config_dict["output_layer_use_bias"],


simonsays1980 · 2024-02-23T13:22:34Z

rllib/algorithms/dqn/dqn_rainbow_catalog.py

+                    output_layer_use_bias=self._model_config_dict[
+                        "output_layer_use_bias"
+                    ],
+                    # TODO (sven, simon): Should these initializers rather the fcnet


@sven1977 Do we really use here in the encoder already the post_fcnet settings or should these only go into the heads? I see this also in the default settings in the core/models/catalog.py:get_encoder_config and the other new stack algorithms. Let's make a decision here and go with it from here.

I would think so.
Let's say the encoder is a 3 layer CNN with an additional [512, 512] head on top (after flattening the last CNN layer's output).
Then we need the post_fcnet_hiddens here to set up these last 2 512-layers.

Then each head (Af, Vf) can still have its own config with m fcnet_hiddens.

@sven1977 Makes sense what you write. In the case of the new catalog we have no fully connected layers in the CNN encoder. Only the MLP encoder has fully connected ones in its hidden and output layers.

I am still wondering b/c we have in the models/catalog.py only the fcnet_hiddens and post_fcnet_hiddens where one can define such layer dimensions. As we use both in the encoder the heads can only use the same.

For example: We use in the catalog the encoder_latent_dim which is defined to be either the one given in the model_config_dict (so directly by the user) or by fcnet_hiddens[-1].

rllib/algorithms/dqn/torch/dqn_rainbow_torch_rl_module.py

simonsays1980 · 2024-02-23T13:32:49Z

rllib/algorithms/dqn/torch/dqn_rainbow_torch_rl_module.py

+        exploit_actions = action_dist.to_deterministic().sample()
+
+        # Apply epsilon-greedy exploration.
+        # TODO (simon): Implement sampling for nested spaces.


@sven1977 How is this actually best managed in the new stack? Do we support this already?

We don't yet. After SAC/DQN/APEX-DQN are done, we should write an off-policy algo that can handle nested and arbitrary action spaces.

Didn't we have also online algorithms with complex action spaces in the old stack?

simonsays1980 · 2024-02-23T13:34:31Z

rllib/algorithms/dqn/torch/dqn_rainbow_torch_rl_module.py

+
+        # Apply epsilon-greedy exploration.
+        # TODO (simon): Implement sampling for nested spaces.
+        # TODO (simon): Implement different epsilon and schedules.


@sven1977 for this we need to resolve the model_config_dict (or better named RLModule configuration setup). Right now epsilon is passed in via the exploration_config which does not exist for the new stack. Instead there we need to define new parameters.

Yeah, forget about exploration_config. This will not come back :) .

For now, let's make epsilon available as a model_config_dict key that behaves similarly to learning_rate_schedule (fixed float OR list of "scheduling"-tuples). We then create a ray.rllib.utils.schedules.scheduler::Scheduler object (already exists and tested well in new stack) inside the RLModule and use the (newly added?) timestep to figure out the current epsilon, then use that to sample in the forward_exploration (all other forward methods should not use this).

Yes, and let's chat at some point about Module configs in general :)

simonsays1980 · 2024-02-23T13:35:20Z

rllib/algorithms/dqn/torch/dqn_rainbow_torch_rl_module.py

+
+        return output
+
+    # TODO (simon): Maybe returning results in a dict of dicts:


@sven1977 I am interested in you opinion on using here rather a nested dictionary (not necessarily a NestedDict ;) ) instead of the plain one.

Plain one. I'm trying hard right now to get rid of both NestedDict and SampleBatch/MultiAgentBatch. Wherever normal dicts already work, let's use them to simplify the code base.

I think what's left to do:

Learner.update() only accepts a MultiAgentBatch

RLModules forward passes only accept NestedDict, but I'm not 100% sure. Normal dicts might also work already here. We just have to convert to tensors first, of course, but that's something that NestedDict also does not do automatically.

So far normal dicts work. Tried this out for PPO, SAC and DQN Rainbow.

In regard to the output_specs being plain: the ActorCriticEncoder works with a nested one (e.g. output[ENCODER][CRITIC]).

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

… In addition added timesteps to the 'forward_exploration' calls in the 'EnvRunner's. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

sven1977

LGTM.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…nitions of the other algorithms (PPO, SAC, etc.). Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…' such that schedulers on remote workers can be updated. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

simonsays1980 added 14 commits February 6, 2024 16:29

Started programming DQN Rainbow in new stack.

760a33f

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Merge branch 'master' into dqn-rainbow-rl-module

d670b5a

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Added 'DQNRainbowTorchRLModule'.

9b368b5

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Added functionality for target network updates.

2956906

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

LINTER

8af30fb

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Added training step for the new API stack with EnvRunner.

293662a

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Fixed some bugs in 'DQNRainbowTorchModule' and added exploration with…

99ef649

… Epsilon-greedy. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Merge branch 'master' into dqn-rainbow-rl-module

8a9b7ef

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Changed multinomial sampling weights.

e602e23

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Merge branch 'master' into dqn-rainbow-rl-module

c64c51f

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Implemented most parts of the DQN Rainbow algorithm with new stack.

e56e9e3

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Moved changes in 'dqn.py', 'simple_q.py' and 'prioritized_episode_rep…

a138915

…lay_buffer.py' to branch 'dqn-rainbow-training-step'. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Added docs and renamed functions. Furthermore, added typing.

b52b58d

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

sven1977 self-assigned this Feb 20, 2024

sven1977 changed the title ~~DQN Rainbow RLModule and Catalog together with TorchNoisyMLP.~~ [RLlib] DQN Rainbow on new API stack: RLModule and Catalog together with TorchNoisyMLP. Feb 20, 2024

sven1977 marked this pull request as ready for review February 20, 2024 13:57

sven1977 requested review from sven1977, avnishn, ArturNiederfahrenhorst, maxpumperla and kouroshHakha as code owners February 20, 2024 13:57