New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] DQN Rainbow on new API stack: RLModule
and Catalog
together with TorchNoisyMLP
.
#43199
[RLlib] DQN Rainbow on new API stack: RLModule
and Catalog
together with TorchNoisyMLP
.
#43199
Conversation
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
… Epsilon-greedy. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
… noisy layers. Introduced a new 'TorchNoisyMLP' tat includes a 'NoisyLinear' layer in tghe same way like 'nn.Linear' such that we can keep the diesng we have for the new stack. Furthermore, included '**kwargs' into the ctor of 'PrioritizedReplayBuffer' such that arguments in 'replay_buffer_config' don't raise an exception. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…lay_buffer.py' to branch 'dqn-rainbow-training-step'. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
RLModule
and Catalog
together with TorchNoisyMLP
. RLModule
and Catalog
together with TorchNoisyMLP
.
@@ -303,7 +303,7 @@ def build(self) -> None: | |||
# the user the option to run on the gpu of their choice, so we enable that | |||
# option here via the local gpu id scaling config parameter. | |||
if self._distributed: | |||
devices = get_devices() | |||
devices = get_device() | |||
assert len(devices) == 1, ( | |||
"`get_devices()` should only return one cuda device, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix this comment here as well: get_device()
should only return one ....
|
||
@abstractmethod | ||
@OverrideToImplementCustomLogic | ||
def _qf(self, batch: Dict[str, TensorType]) -> Dict[str, TensorType]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and clean API. Like it! :)
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…remaining input args to the configs of the catalog. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…ded all remaining arguments to the encoder and head configs in the catalog. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left some comments for discussion to bring this module into a final version.
input_dims=self.latent_dims, | ||
hidden_layer_dims=self._model_config_dict["post_fcnet_hiddens"], | ||
hidden_layer_activation=self.af_and_vf_head_activation, | ||
# TODO (simon): No clue where we get this from. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sven1977 Where do we actually get these config settings? These are not given in models/catalog.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they only exist in the new model config objects, so thus far, they are not really reachable by users going through the traditional Catalog model_config_dict.
output_layer_activation="linear", | ||
output_layer_dim=output_layer_dim, | ||
# TODO (simon): CHeck where we get this from. | ||
# output_layer_use_bias=self._model_config_dict["output_layer_use_bias"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
output_layer_use_bias=self._model_config_dict[ | ||
"output_layer_use_bias" | ||
], | ||
# TODO (sven, simon): Should these initializers rather the fcnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sven1977 Do we really use here in the encoder already the post_fcnet
settings or should these only go into the heads? I see this also in the default settings in the core/models/catalog.py:get_encoder_config
and the other new stack algorithms. Let's make a decision here and go with it from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think so.
Let's say the encoder is a 3 layer CNN with an additional [512, 512] head on top (after flattening the last CNN layer's output).
Then we need the post_fcnet_hiddens here to set up these last 2 512-layers.
Then each head (Af, Vf) can still have its own config with m fcnet_hiddens
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sven1977 Makes sense what you write. In the case of the new catalog we have no fully connected layers in the CNN encoder. Only the MLP encoder has fully connected ones in its hidden and output layers.
I am still wondering b/c we have in the models/catalog.py
only the fcnet_hiddens
and post_fcnet_hiddens
where one can define such layer dimensions. As we use both in the encoder the heads can only use the same.
For example: We use in the catalog
the encoder_latent_dim
which is defined to be either the one given in the model_config_dict
(so directly by the user) or by fcnet_hiddens[-1]
.
exploit_actions = action_dist.to_deterministic().sample() | ||
|
||
# Apply epsilon-greedy exploration. | ||
# TODO (simon): Implement sampling for nested spaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sven1977 How is this actually best managed in the new stack? Do we support this already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't yet. After SAC/DQN/APEX-DQN are done, we should write an off-policy algo that can handle nested and arbitrary action spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we have also online algorithms with complex action spaces in the old stack?
|
||
# Apply epsilon-greedy exploration. | ||
# TODO (simon): Implement sampling for nested spaces. | ||
# TODO (simon): Implement different epsilon and schedules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sven1977 for this we need to resolve the model_config_dict
(or better named RLModule
configuration setup). Right now epsilon is passed in via the exploration_config
which does not exist for the new stack. Instead there we need to define new parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, forget about exploration_config
. This will not come back :) .
For now, let's make epsilon
available as a model_config_dict key that behaves similarly to learning_rate_schedule
(fixed float OR list of "scheduling"-tuples). We then create a ray.rllib.utils.schedules.scheduler::Scheduler
object (already exists and tested well in new stack) inside the RLModule and use the (newly added?) timestep to figure out the current epsilon, then use that to sample in the forward_exploration
(all other forward methods should not use this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and let's chat at some point about Module configs in general :)
|
||
return output | ||
|
||
# TODO (simon): Maybe returning results in a dict of dicts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sven1977 I am interested in you opinion on using here rather a nested dictionary (not necessarily a NestedDict
;) ) instead of the plain one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plain one. I'm trying hard right now to get rid of both NestedDict and SampleBatch/MultiAgentBatch. Wherever normal dicts already work, let's use them to simplify the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what's left to do:
- Learner.update() only accepts a MultiAgentBatch
- RLModules forward passes only accept
NestedDict
, but I'm not 100% sure. Normal dicts might also work already here. We just have to convert to tensors first, of course, but that's something that NestedDict also does not do automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far normal dicts work. Tried this out for PPO, SAC and DQN Rainbow.
In regard to the output_specs
being plain: the ActorCriticEncoder
works with a nested one (e.g. output[ENCODER][CRITIC]
).
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
… In addition added timesteps to the 'forward_exploration' calls in the 'EnvRunner's. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…nitions of the other algorithms (PPO, SAC, etc.). Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…' such that schedulers on remote workers can be updated. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Why are these changes needed?
This is the third part for moving the
DQN Rainbow
algorithm to our new stack and towards usingEnvRunner API
. See for the other parts #43196 and #43198. This PR introduces theDQNRainbowRLModule
tiogether with its catalog. The module implements a dueling arhcitecture and distributional Q-learning.Furthermore, it comes with noisy networks introducing a new
TorchNoisyMLP
(inEncoder
andHead
version) that makes use of aNoisyLinear
-layer similar to thenn.Linear
such that we can use the same design as for the other torch encoders and heads. Together with the networks corresponding configurations are introduced.Related issue number
Closes #37777
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.