[RLlib] SAC on new API stack (w/ EnvRunner and ConnectorV2): `training_step()` method added to `SACAlgorithm`. #42571

simonsays1980 · 2024-01-22T15:09:43Z

Why are these changes needed?

All major RL algorithms should be transferred over to the new stack and use the 'EnvRunner' for environment interactions.

Related issue number

Closes #37778

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…ining logic for 'SAC' using 'EnvRunner' and the new stack. Furthermore generated an example to tune in succeeding commits. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

sven1977 · 2024-01-23T14:22:32Z

rllib/tuned_examples/sac/pendulum_sac_envrunner.py

+    .training(
+        lr=3e-4,
+        target_entropy="auto",
+        # TODO (simon): Implement n-step.


We should do this inside the episode buffer:

When calling buffer.sample(), the buffer would pick a timestep in some episode and do the n-step on the fly. This way we can:

still only store each individual obs/action/reward (saves memory)

support n-step with a range of n (picked at random)

As far as I see it we only need to use batch_length_T=self.config.n_step + 1 in buffer.sample() and drop all steps in between the first and the last then correct?

Do we want to let the buffer know what an n_step is? In the way the buffer samples right now, it will sample anyways all the n_step + 1 timesteps and would need to drop the ones in between as well.

We also need to add all the rewards in between.

I do think it's best to do n-step in the buffer. It's the natural location to do it (right after the raw/space-saving episode data is retrieved). If we did it before, we would have to store all possible combinations of n-step samples. If we did it afterwards (e.g. in training_step), we would send more data through the network (well, at least if the buffer were on another node, e.g. for APEX algos??).

But happy to hear your opinion or objections to this.

Your arguments make fully sense. I was thinking to keep the buffer as clear as possible from algorithmic details. I was thinking about a BufferToLearnerConnector that does these kind of formations on the sampled data - this one could also run on the nodes and it would leave the buffer with some very basic things to do: store experiences and sample them (either shuffled or in sequences with or without prioritization). I see however, that the way from buffer to learner goes through the algorithm, so at least the naming of the connector would not be that favourable.

The thing I see is that we cannot sample n-step from the buffer without sampling all experiences in the sequence and then drop the in-between ones after we have calculated rewards. And that would also work in a connector I guess. Another possibility is to provide buffers with connectors (like the EnvRunner) and filter their samples through them.

All of these are simply ideas - not fully thought through. But it could be in near future that we need customizable sampling from experience buffers.

sven1977 · 2024-01-23T14:24:06Z

rllib/tuned_examples/sac/pendulum_sac_envrunner.py

+        },
+        num_steps_sampled_before_learning_starts=256,
+        model={
+            "initial_alpha": 1.001,


Ah, I see :) here is the initial_alpha (directly defined inside the model config).

Let's make a TODO that we have to revisit this problem: config properties that the Module must know about need to be written to the model_config_dict automatically.

Oh sorry, that one is obsolete. That was before I used the nn.Parameter directly inside of the TorchLearner._get_tensor_variable. Let's behave as if it is not there. I will remove it in the next commit.

We have the initial_alpha now used in the SACLearner.build() when SACLearner.curr_log_alpha is defined.

sven1977 · 2024-01-23T14:29:19Z

rllib/algorithms/sac/sac.py

+            )
+
+        if self.uses_new_env_runners:
+            self.model.update({"initial_alpha": self.initial_alpha})


This should not go here. We should never change the config inside validate() (it should only ever warn or error, but never taper with the contents of a AlgorithmConfig object).

Actually, let's use this opportunity to fix this problem. Let's brainstorm solutions together in a chat :) How do we get AlgorithmConfig properties/settings into the model_config dict in a user friendly manner?

Also, this parameter should probably go into the SACLearner, it has nothing to do with any RLModule forward pass and is only used in the loss.

Yes, I think this will be a general matter as some other algorithms will need it.

Yup it is in the SACLearner.

rllib/algorithms/sac/sac.py

sven1977 · 2024-01-23T14:30:40Z

rllib/algorithms/sac/sac.py

+
+        # If `RolloutWorker` is used and the old stack, then use as before
+        # the training step of the DQN algorithm.
+        if use_rollout_worker or not self.config._enable_new_api_stack:


replace w/:

if self.config.uses_new_env_runners: ...

Just for clarification: uses_new_env_runners can be used here, as the SingleAgentEnvRunner s expect an RLModule. What, if someone uses the EnvRunner as a base class, but not the RLModule?

Correct. I actually haven't thought about this combination (EnvRunner w/o RLModule). But I don't think we should at this point. if EnvRunner AND NOT RolloutWorker -> then there must be an RLModule (always under property: env_runner.module).

But you have a point: EnvRunner is the base class.
I guess self.uses_new_env_runners should only be True if we are using a sub-class of SingleAgentEnvRunner OR MultiAgentEnvRunner.
Hmm, this hybrid stack is getting tough to maintain :p

I think we should clearly point out that the new stack contains 'RLModule' , 'Learner' and 'AgentEnvRunner's. And that hybrid combinations are not covered. So either you do Policy/RolloutWorker or RLModule/EnvRunner. That makes it cleaner and makes maintenance easier imo

Sure, but then you have some users already (more or less deeply) invested in this hybrid stack (RLModule + Learner, BUT no EnvRunner), so you don't want to cold-force them to convert their Connectors/Policy/etc.. stack to the new API yet.

We can only deprecate the hybrid stack (at least for PPO) once we fully support multi-agent on it (a few days away).

rllib/algorithms/sac/sac.py

sven1977 · 2024-01-23T14:39:58Z

rllib/algorithms/sac/sac.py

+            # Run multiple training iterations.
+            for _ in range(sample_and_train_weight):
+                # Sample training batch from replay_buffer.
+                train_dict = self.local_replay_buffer.sample(


Let's implement the n-step logic in the existing episode buffer (add a n-step arg to its constructor, allowing for single int or 2-tuple (pick uniformly from the range)).

Then the buffer would create the n-step samples on-the-fly during its sample() method: Pick an episode, pick a ts in the episode, pick a n-step value from the given range, get the obs at ts+n (this would be the NEXT_OBS), sum up all the rewards between ts and ts+n -> done.

I think, we need to sum up rewards starting at index 1 instead of zero, correct? The reason for this is that the episode buffer samples for each observation the reward that came with it (or during the transfer to it), i.e. $(o_t, r_t, a_{t-1}, i_t)$ is a single experience sampled in the replay buffer.

So for 1-step we would need: $r_{t+1} + \gamma * Q(o_{t+1}, a_{t+1})$ and as we sample $a_{t+1}$ we need only two consecutive timesteps sampled from the buffer.

Then 2-step: $r_{t+1} + r_{t+2} + Q(o_{t+2}, a_{t+2})$ again $a_{t+2}$ is sampled from the action distribution. But we sum up the reward from the second and the third consecutive timestep in the 3-tuple $(r_t, r_{t+1}, r_{t+2})$.

I think index 0 for rewards is still correct.

o0 = reset obs
a0 = action taken after reset
r0 = reward received after taking action a0 (in state o0)
...

So the to-be-returned experience tuple from the buffer for n-step=2 would be:
(obs=o0, act=a0, rew=r0+r1, next_obs=o2)

Does this make sense?

For n-step=1 (default), this would simplify to the "normal" experience tuple:
(obs=o0, act=a0, reward=r0, next_obs=o1)

Its correct that we have for an observation its corresponding action chosen in that state.

I am not 100% sure, but I think when looking into lines 260 and 265 of the EpisodeReplayBuffer the reward is the one that came with the observation, i.e. for the reset obs o0 it would be a dummy reward of r0=0.0 like:

o0 = reset obs
a0 = action taken after reset
r0 = dummy reward 0.0
o1 = obs after choosing in o0 action a0
a1 = action taken in o1
r1 = reward received AFTER taking action a0 in o0
...
ot = obs after choosing in o(t-1) action a(t-1)
at = action taken in ot
rt = reward received AFTER taking action a(t-1) in o(t-1)

Following this logic the buffer would need to return

(obs=o0, act=a0, rew=r1 + r2, next_obs=o2)

and for the "normal" 1-step:

(obs=o0, act=a0, rew=r1, next_obs=o1)

@sven1977 As this is a crucial stage in all off-policy algorithms, can you take another look on the rewards in my logic above, pls? I think we have to take the next rewards when calculating e.g. targets in Q learning with this sampling in EpisodeReplayBuffer.

You are right: This particular EpisodeReplayBuffer does indeed that (for DreamerV3's very specific needs), it adds a dummy reward (r0) right after o0 and the first actual reward received is then r1.

But either way, we should probably implement a sub-class of this buffer with its own sample method that abides more to SAC's (or off-policy algos in general) needs, e.g. including the n-step mechanism. What do you think? I feel like the current EpisodeReplayBuffer's sample() method is too much geared toward DreamerV3's train batch requirements.

Regarding your comments here and the one below. I would sit down tomorrow on #42369 and rewrite this sampling. I will reuse the SegmentTrees and try to keep as much of the logic in our PriotizedReplayBuffer.

Okay the prioritized episode buffer with n-step sampling is ready for review now in PR #42832 . Let's combine all parts together and apply SAC on the tuned examples.

sven1977 · 2024-01-23T14:40:36Z

rllib/algorithms/sac/sac.py

+                        SampleBatch.REWARDS: train_dict[SampleBatch.REWARDS][:, 1],
+                        SampleBatch.TERMINATEDS: train_dict["is_terminated"][:, 0],
+                        SampleBatch.TRUNCATEDS: train_dict["is_truncated"][:, 0],
+                        "is_first": train_dict["is_first"][:, 0],


I don't think we need these flags here (I believe only DreamerV3 does for now).

What is with the Q target when the first timestep in the tuple is already the last one. Then the second one comes from the beginning of another episode, correct?

I used for the masked target Q (1.0 - batch[SampleBatch.TERMINATEDS]) * (1.0 - batch["is_last"]) * q_target_next

We just have to be careful: "is_last" does NOT mean the episode is done. If "is_last" is True, it could mean:

we simply are at the end of a batch row (of len=T), but still in the middle of an episode.

we are in the middle of a batch row, but the episode was truncated (not terminated!). Hence we should NOT set the q-value target to 0.0 at these locations (but would need to compute the actual Q-values for the observations found there).

But again, I don't think we should use this buffer class, but subclass it and override the sample() method. We can definitely reuse the EpisodeReplayBuffer's handling of adding episode chunks, but its sampling method would not be suitable for SAC, imo.

sven1977 · 2024-01-23T14:41:12Z

rllib/algorithms/sac/sac.py

+                        "is_last": train_dict["is_last"][:, 0],
+                    }
+                )
+                train_batch = train_batch.as_multi_agent()


Wouldn't Learner do this multi-agent conversion already automatically?

Hmmm, good question. I check.

Nope, LearnerGroup._update() expects a MultiAgentBatch. We can put it into LearnerGroup._update().

Yeah, let's put it into LearnerGroup._update().

Only if batch is not None should we check and convert if necessary.

@sven1977 This does not work moving it into LearnerGroup._update() b/c when using LearnerGroup.update_from_batch() it is iterated over the policy batches.

There is a comment to move this iteration into the Learners. Should we make this move now?

We cannot move this to LearnerGroup._update() as the LearnerGroup.update_from_batch() already iterates over policy_batches.

There is a comment that we should move this filtering into the individual Learners. SHould we do it now? Where do we place this best? Learner.update_from_batch()?.

Yeah, happy to make this change now. Let's do this in Learner.

After taking a deeper look into it, this appears to me that we have to invest more thoughts into where exactly, as the Learner.update_on_batch_or_episodes() already makes use of MultiAgentBatches and throws an error if it does not contain the module in the MA-batch. On the other side there is a comment in the LearnerGroup:

# TODO (sven): Move this filtering of input data into individual Learners. # It might be that the postprocessing of batch/episodes on each Learner # requires the non-trainable modules' data.

which tells me that there might be use cases where the data from agents with one module might be needed in the learner of other modules - which would then make the error that is thrown in the Learner obsolete. So let's take a minute in a chat to think what use cases we want to support and then adapt the Learner corrrespondingly.

rllib/algorithms/sac/sac.py

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

sven1977 · 2024-01-25T13:17:53Z

Could we also have a new learning test for the added pendulum tuned_example file in the rllib/BUILD file?
We probably shouldn't merge this before it's confirmed learning.

simonsays1980 · 2024-01-31T10:37:53Z

Could we also have a new learning test for the added pendulum tuned_example file in the rllib/BUILD file? We probably shouldn't merge this before it's confirmed learning.

Yes we definitely should. But we have to first merge the different branches of SAC, i.e. SACLearner, SACModule, SACAlgorithm - and at best also PrioritizedEpisodeReplayBuffer (PERB), together. Then we can run the new stack on the tuned example test.

…e not used. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…he different modules together in the 'training_step' of 'SAC'. Wrote a reduce funciton to get the TD-errors out of the train results and modified the utility funciton for updating priorities.# Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…oid index errors. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…ed'. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…ld stack example. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…ce' is not anymore available by 'air'. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…s changed in favour of concatenating episode chunks in the replay buffers when they were already finalized but not done). Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

rllib/algorithms/sac/sac.py

sven1977 · 2024-02-07T09:22:59Z

rllib/algorithms/sac/sac.py

+        for _ in range(store_weight):
+            # Otherwise, we use the new stack.
+            with self._timers[SAMPLE_TIMER]:
+                if self.workers.num_remote_workers() <= 0:


Actually, let's fix this for good:
Let's use the existing (old stack) synchronous_parallel_sample() method from rllib/execution/rollout_ops.py

We will have to slightly enhance this method to support the new stack (with EnvRunners returning episodes instead of batches). But the advantage will then be that we get a fault tolerant parallel sampling step!

We can also change PPO's training_step method then to use this utility again for the new stack.

Alright. Shall we do this in a new PR?

Fine with me, but let's add a TODO here, then.

sven1977 · 2024-02-07T09:23:21Z

rllib/algorithms/sac/sac.py

+                        local_worker=False,
+                    )
+                # Perform SAC postprocessing (Prio weights) on a (flattened)
+                # list of Episodes.


erase this comment?

DO we need to do any preprocessing here? Does a user pass in processing functions in the new stack anymore?

rllib/algorithms/sac/sac.py

sven1977 · 2024-02-07T09:30:40Z

rllib/algorithms/sac/torch/sac_torch_learner.py

-            # TODO (simon): Implement n-step adjustment.
-            # + (self.config["gamma"] ** self.config["n_step"]) * q_next_masked
-            + self.config.gamma * q_next_masked
+            + (self.config.gamma**self.config.n_step) * q_next_masked


Nice!

I see, so supporting a uniform n-step range (uniformly sample from n-step values between x and y) will not be that simple, I guess, b/c of this loss term here. We would have to know for each item in the batch, what the exact n-step was.

Exactly. We would need to provide this in EpisodeReplayBuffer.sample(). But this is not impossible. In fact it could be applied easily. The thing we need to do is to default it to n_step. And if it is sampled it gets overridden.

sven1977 · 2024-02-07T09:31:20Z

rllib/algorithms/sac/torch/sac_torch_learner.py

@@ -260,7 +250,8 @@ def compute_loss_for_module(
        critic_loss = torch.mean(
            # TODO (simon): Introduce priority weights when episode buffer is ready.
            # batch[PRIO_WEIGHTS] *
-            torch.nn.HuberLoss(reduction="none", delta=1.0)(
+            batch["weights"]


Why don't we use the PRIO_WEIGHTS constant anymore?

Yeah I changed this b/c I was on the opinion that this is in general a weighted loss and PRIO_WEIGHTS is a bit misleading to me as these are actually importance sampling weights. But I am fine with both - it is simply a suggestion.

We could also define a new constant like WEIGHTS that could be reused by any algorithm that weights the items in the loss.

Got it, makes sense! Another option could be "loss_mask", which I think I'm already using in PPO. It's a weighting between 0.0 and 1.0, correct?

rllib/core/learner/learner.py

rllib/env/single_agent_episode.py

rllib/tuned_examples/sac/pendulum_sac_envrunner.py

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

sven1977

LGTM now. Thanks @simonsays1980 !

sven1977 · 2024-02-07T16:15:58Z

Some learning tests are failing. Could you take a look? cc: @simonsays1980

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

sven1977 · 2024-02-09T12:23:24Z

Awesome work @simonsays1980 !! Thanks for this PR.

…g_step()` method added to `SACAlgorithm`. (ray-project#42571) Signed-off-by: Ratnopam Chakrabarti <ratnopamc@yahoo.com>

…g_step()` method added to `SACAlgorithm`. (ray-project#42571) Signed-off-by: tterrysun <terry@anyscale.com>

Added 'training_step' to 'SACAlgorithm' to implement sampling and tra…

114ed39

…ining logic for 'SAC' using 'EnvRunner' and the new stack. Furthermore generated an example to tune in succeeding commits. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

simonsays1980 changed the title ~~Added 'training_step' to 'SACAlgorithm' to implement sampling and training with EnvRunner and new stack~~ Added training_step to SACAlgorithm to implement sampling and training with EnvRunner and new stack Jan 22, 2024

sven1977 changed the title ~~Added training_step to SACAlgorithm to implement sampling and training with EnvRunner and new stack~~ [RLlib] SAC on new API stack (w/ EnvRunner and ConnectorV2): training_step() method added to SACAlgorithm. Jan 23, 2024

sven1977 marked this pull request as ready for review January 23, 2024 14:20

sven1977 requested review from sven1977, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla and kouroshHakha as code owners January 23, 2024 14:20

sven1977 reviewed Jan 23, 2024

View reviewed changes

rllib/algorithms/sac/sac.py Outdated Show resolved Hide resolved

sven1977 reviewed Jan 23, 2024

View reviewed changes

rllib/algorithms/sac/sac.py Outdated Show resolved Hide resolved

sven1977 reviewed Jan 23, 2024

View reviewed changes

rllib/algorithms/sac/sac.py Outdated Show resolved Hide resolved

sven1977 reviewed Jan 23, 2024

View reviewed changes

rllib/algorithms/sac/sac.py Show resolved Hide resolved

sven1977 reviewed Jan 23, 2024

View reviewed changes

rllib/algorithms/sac/sac.py Outdated Show resolved Hide resolved

simonsays1980 added 2 commits January 23, 2024 16:16

Added changes from @sven1977's review.

ceb330b

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Merge branch 'master' into sac-algorithm

7d483a4

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

sven1977 self-assigned this Jan 25, 2024

simonsays1980 added 3 commits January 31, 2024 11:42

Removed 'is_last' and 'is_first' entries from batch as in SAC they ar…

53e0c36

…e not used. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Merge branch 'master' into sac-algorithm

f7f26c2

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Added 'weights' to losses in preparation for prioritized replay buffer.

34c85a6

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

simonsays1980 added 8 commits February 2, 2024 18:50

Added freed nodes in prioritized replay buffer on the left side to av…

e90532b

…oid index errors. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Added counters for 'num_agent_steps_trained' and 'num_env_steps_train…

aca1fb1

…ed'. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Merge branch 'master' into sac-algorithm

24265b8

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Added 'initial_alpha' to pendulum tuned example to conform with the o…

cac8251

…ld stack example. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Changed 'det_device' in 'TorchLearner' to 'get_devices' b/c 'get_devi…

dd8bbf8

…ce' is not anymore available by 'air'. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Merge branch 'master' into sac-algorithm

216e95f

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

CHanged assertion in 'concat_episode' to check for 'is_done' (this wa…

afde6a1

…s changed in favour of concatenating episode chunks in the replay buffers when they were already finalized but not done). Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>