Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] DQN Rainbow on new API stack: RLModule and Catalog together with TorchNoisyMLP. #43199

Merged
merged 25 commits into from Feb 28, 2024

Conversation

simonsays1980
Copy link
Collaborator

@simonsays1980 simonsays1980 commented Feb 15, 2024

Why are these changes needed?

This is the third part for moving the DQN Rainbow algorithm to our new stack and towards using EnvRunner API. See for the other parts #43196 and #43198. This PR introduces the DQNRainbowRLModule tiogether with its catalog. The module implements a dueling arhcitecture and distributional Q-learning.
Furthermore, it comes with noisy networks introducing a new TorchNoisyMLP (in Encoder and Head version) that makes use of a NoisyLinear-layer similar to the nn.Linear such that we can use the same design as for the other torch encoders and heads. Together with the networks corresponding configurations are introduced.

Related issue number

Closes #37777

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
… Epsilon-greedy.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
… noisy layers. Introduced a new 'TorchNoisyMLP' tat includes a 'NoisyLinear' layer in tghe same way like 'nn.Linear' such that we can keep the diesng we have for the new stack. Furthermore, included '**kwargs' into the ctor of 'PrioritizedReplayBuffer' such that arguments in 'replay_buffer_config' don't raise an exception.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…lay_buffer.py' to branch 'dqn-rainbow-training-step'.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
@sven1977 sven1977 self-assigned this Feb 20, 2024
@sven1977 sven1977 changed the title DQN Rainbow RLModule and Catalog together with TorchNoisyMLP. [RLlib] DQN Rainbow on new API stack: RLModule and Catalog together with TorchNoisyMLP. Feb 20, 2024
@sven1977 sven1977 marked this pull request as ready for review February 20, 2024 13:57
@@ -303,7 +303,7 @@ def build(self) -> None:
# the user the option to run on the gpu of their choice, so we enable that
# option here via the local gpu id scaling config parameter.
if self._distributed:
devices = get_devices()
devices = get_device()
assert len(devices) == 1, (
"`get_devices()` should only return one cuda device, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this comment here as well: get_device() should only return one ....


@abstractmethod
@OverrideToImplementCustomLogic
def _qf(self, batch: Dict[str, TensorType]) -> Dict[str, TensorType]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean API. Like it! :)

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…remaining input args to the configs of the catalog.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…ded all remaining arguments to the encoder and head configs in the catalog.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Copy link
Collaborator Author

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left some comments for discussion to bring this module into a final version.

input_dims=self.latent_dims,
hidden_layer_dims=self._model_config_dict["post_fcnet_hiddens"],
hidden_layer_activation=self.af_and_vf_head_activation,
# TODO (simon): No clue where we get this from.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 Where do we actually get these config settings? These are not given in models/catalog.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they only exist in the new model config objects, so thus far, they are not really reachable by users going through the traditional Catalog model_config_dict.

output_layer_activation="linear",
output_layer_dim=output_layer_dim,
# TODO (simon): CHeck where we get this from.
# output_layer_use_bias=self._model_config_dict["output_layer_use_bias"],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

output_layer_use_bias=self._model_config_dict[
"output_layer_use_bias"
],
# TODO (sven, simon): Should these initializers rather the fcnet
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 Do we really use here in the encoder already the post_fcnet settings or should these only go into the heads? I see this also in the default settings in the core/models/catalog.py:get_encoder_config and the other new stack algorithms. Let's make a decision here and go with it from here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think so.
Let's say the encoder is a 3 layer CNN with an additional [512, 512] head on top (after flattening the last CNN layer's output).
Then we need the post_fcnet_hiddens here to set up these last 2 512-layers.

Then each head (Af, Vf) can still have its own config with m fcnet_hiddens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 Makes sense what you write. In the case of the new catalog we have no fully connected layers in the CNN encoder. Only the MLP encoder has fully connected ones in its hidden and output layers.

I am still wondering b/c we have in the models/catalog.py only the fcnet_hiddens and post_fcnet_hiddens where one can define such layer dimensions. As we use both in the encoder the heads can only use the same.

For example: We use in the catalog the encoder_latent_dim which is defined to be either the one given in the model_config_dict (so directly by the user) or by fcnet_hiddens[-1].

exploit_actions = action_dist.to_deterministic().sample()

# Apply epsilon-greedy exploration.
# TODO (simon): Implement sampling for nested spaces.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 How is this actually best managed in the new stack? Do we support this already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet. After SAC/DQN/APEX-DQN are done, we should write an off-policy algo that can handle nested and arbitrary action spaces.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we have also online algorithms with complex action spaces in the old stack?


# Apply epsilon-greedy exploration.
# TODO (simon): Implement sampling for nested spaces.
# TODO (simon): Implement different epsilon and schedules.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 for this we need to resolve the model_config_dict (or better named RLModule configuration setup). Right now epsilon is passed in via the exploration_config which does not exist for the new stack. Instead there we need to define new parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, forget about exploration_config. This will not come back :) .

For now, let's make epsilon available as a model_config_dict key that behaves similarly to learning_rate_schedule (fixed float OR list of "scheduling"-tuples). We then create a ray.rllib.utils.schedules.scheduler::Scheduler object (already exists and tested well in new stack) inside the RLModule and use the (newly added?) timestep to figure out the current epsilon, then use that to sample in the forward_exploration (all other forward methods should not use this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and let's chat at some point about Module configs in general :)


return output

# TODO (simon): Maybe returning results in a dict of dicts:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 I am interested in you opinion on using here rather a nested dictionary (not necessarily a NestedDict ;) ) instead of the plain one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plain one. I'm trying hard right now to get rid of both NestedDict and SampleBatch/MultiAgentBatch. Wherever normal dicts already work, let's use them to simplify the code base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what's left to do:

  • Learner.update() only accepts a MultiAgentBatch
  • RLModules forward passes only accept NestedDict, but I'm not 100% sure. Normal dicts might also work already here. We just have to convert to tensors first, of course, but that's something that NestedDict also does not do automatically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far normal dicts work. Tried this out for PPO, SAC and DQN Rainbow.

In regard to the output_specs being plain: the ActorCriticEncoder works with a nested one (e.g. output[ENCODER][CRITIC]).

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
… In addition added timesteps to the 'forward_exploration' calls in the 'EnvRunner's.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…nitions of the other algorithms (PPO, SAC, etc.).

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…' such that schedulers on remote workers can be updated.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
@sven1977 sven1977 merged commit 73a5264 into ray-project:master Feb 28, 2024
9 checks passed
@simonsays1980 simonsays1980 deleted the dqn-rainbow-rl-module branch February 28, 2024 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RLlib] Enabling RLModule by default on DQN
2 participants