Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] SAC on new API stack (w/ EnvRunner and ConnectorV2): training_step() method added to SACAlgorithm. #42571

Merged
merged 18 commits into from
Feb 9, 2024

Conversation

simonsays1980
Copy link
Collaborator

@simonsays1980 simonsays1980 commented Jan 22, 2024

Why are these changes needed?

All major RL algorithms should be transferred over to the new stack and use the 'EnvRunner' for environment interactions.

Related issue number

Closes #37778

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…ining logic for 'SAC' using 'EnvRunner' and the new stack. Furthermore generated an example to tune in succeeding commits.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
@simonsays1980 simonsays1980 changed the title Added 'training_step' to 'SACAlgorithm' to implement sampling and training with EnvRunner and new stack Added training_step to SACAlgorithm to implement sampling and training with EnvRunner and new stack Jan 22, 2024
@sven1977 sven1977 changed the title Added training_step to SACAlgorithm to implement sampling and training with EnvRunner and new stack [RLlib] SAC on new API stack (w/ EnvRunner and ConnectorV2): training_step() method added to SACAlgorithm. Jan 23, 2024
@sven1977 sven1977 marked this pull request as ready for review January 23, 2024 14:20
.training(
lr=3e-4,
target_entropy="auto",
# TODO (simon): Implement n-step.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this inside the episode buffer:

When calling buffer.sample(), the buffer would pick a timestep in some episode and do the n-step on the fly. This way we can:

  • still only store each individual obs/action/reward (saves memory)
  • support n-step with a range of n (picked at random)

Copy link
Collaborator Author

@simonsays1980 simonsays1980 Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I see it we only need to use batch_length_T=self.config.n_step + 1 in buffer.sample() and drop all steps in between the first and the last then correct?

Do we want to let the buffer know what an n_step is? In the way the buffer samples right now, it will sample anyways all the n_step + 1 timesteps and would need to drop the ones in between as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add all the rewards in between.

I do think it's best to do n-step in the buffer. It's the natural location to do it (right after the raw/space-saving episode data is retrieved). If we did it before, we would have to store all possible combinations of n-step samples. If we did it afterwards (e.g. in training_step), we would send more data through the network (well, at least if the buffer were on another node, e.g. for APEX algos??).

But happy to hear your opinion or objections to this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your arguments make fully sense. I was thinking to keep the buffer as clear as possible from algorithmic details. I was thinking about a BufferToLearnerConnector that does these kind of formations on the sampled data - this one could also run on the nodes and it would leave the buffer with some very basic things to do: store experiences and sample them (either shuffled or in sequences with or without prioritization). I see however, that the way from buffer to learner goes through the algorithm, so at least the naming of the connector would not be that favourable.

The thing I see is that we cannot sample n-step from the buffer without sampling all experiences in the sequence and then drop the in-between ones after we have calculated rewards. And that would also work in a connector I guess. Another possibility is to provide buffers with connectors (like the EnvRunner) and filter their samples through them.

All of these are simply ideas - not fully thought through. But it could be in near future that we need customizable sampling from experience buffers.

},
num_steps_sampled_before_learning_starts=256,
model={
"initial_alpha": 1.001,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see :) here is the initial_alpha (directly defined inside the model config).

Let's make a TODO that we have to revisit this problem: config properties that the Module must know about need to be written to the model_config_dict automatically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, that one is obsolete. That was before I used the nn.Parameter directly inside of the TorchLearner._get_tensor_variable. Let's behave as if it is not there. I will remove it in the next commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the initial_alpha now used in the SACLearner.build() when SACLearner.curr_log_alpha is defined.

)

if self.uses_new_env_runners:
self.model.update({"initial_alpha": self.initial_alpha})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not go here. We should never change the config inside validate() (it should only ever warn or error, but never taper with the contents of a AlgorithmConfig object).

Actually, let's use this opportunity to fix this problem. Let's brainstorm solutions together in a chat :) How do we get AlgorithmConfig properties/settings into the model_config dict in a user friendly manner?

Also, this parameter should probably go into the SACLearner, it has nothing to do with any RLModule forward pass and is only used in the loss.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this will be a general matter as some other algorithms will need it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup it is in the SACLearner.


# If `RolloutWorker` is used and the old stack, then use as before
# the training step of the DQN algorithm.
if use_rollout_worker or not self.config._enable_new_api_stack:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace w/:

if self.config.uses_new_env_runners:
 ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarification: uses_new_env_runners can be used here, as the SingleAgentEnvRunner s expect an RLModule. What, if someone uses the EnvRunner as a base class, but not the RLModule?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I actually haven't thought about this combination (EnvRunner w/o RLModule). But I don't think we should at this point. if EnvRunner AND NOT RolloutWorker -> then there must be an RLModule (always under property: env_runner.module).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you have a point: EnvRunner is the base class.
I guess self.uses_new_env_runners should only be True if we are using a sub-class of SingleAgentEnvRunner OR MultiAgentEnvRunner.
Hmm, this hybrid stack is getting tough to maintain :p

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should clearly point out that the new stack contains 'RLModule' , 'Learner' and 'AgentEnvRunner's. And that hybrid combinations are not covered. So either you do Policy/RolloutWorker or RLModule/EnvRunner. That makes it cleaner and makes maintenance easier imo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but then you have some users already (more or less deeply) invested in this hybrid stack (RLModule + Learner, BUT no EnvRunner), so you don't want to cold-force them to convert their Connectors/Policy/etc.. stack to the new API yet.

We can only deprecate the hybrid stack (at least for PPO) once we fully support multi-agent on it (a few days away).

# Run multiple training iterations.
for _ in range(sample_and_train_weight):
# Sample training batch from replay_buffer.
train_dict = self.local_replay_buffer.sample(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's implement the n-step logic in the existing episode buffer (add a n-step arg to its constructor, allowing for single int or 2-tuple (pick uniformly from the range)).

Then the buffer would create the n-step samples on-the-fly during its sample() method: Pick an episode, pick a ts in the episode, pick a n-step value from the given range, get the obs at ts+n (this would be the NEXT_OBS), sum up all the rewards between ts and ts+n -> done.

Copy link
Collaborator Author

@simonsays1980 simonsays1980 Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we need to sum up rewards starting at index 1 instead of zero, correct? The reason for this is that the episode buffer samples for each observation the reward that came with it (or during the transfer to it), i.e. $(o_t, r_t, a_{t-1}, i_t)$ is a single experience sampled in the replay buffer.

So for 1-step we would need: $r_{t+1} + \gamma * Q(o_{t+1}, a_{t+1})$ and as we sample $a_{t+1}$ we need only two consecutive timesteps sampled from the buffer.

Then 2-step: $r_{t+1} + r_{t+2} + Q(o_{t+2}, a_{t+2})$ again $a_{t+2}$ is sampled from the action distribution. But we sum up the reward from the second and the third consecutive timestep in the 3-tuple $(r_t, r_{t+1}, r_{t+2})$.

Copy link
Contributor

@sven1977 sven1977 Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think index 0 for rewards is still correct.

o0 = reset obs
a0 = action taken after reset
r0 = reward received after taking action a0 (in state o0)
...

So the to-be-returned experience tuple from the buffer for n-step=2 would be:
(obs=o0, act=a0, rew=r0+r1, next_obs=o2)

Does this make sense?

For n-step=1 (default), this would simplify to the "normal" experience tuple:
(obs=o0, act=a0, reward=r0, next_obs=o1)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its correct that we have for an observation its corresponding action chosen in that state.

I am not 100% sure, but I think when looking into lines 260 and 265 of the EpisodeReplayBuffer the reward is the one that came with the observation, i.e. for the reset obs o0 it would be a dummy reward of r0=0.0 like:

o0 = reset obs
a0 = action taken after reset
r0 = dummy reward 0.0
o1 = obs after choosing in o0 action a0
a1 = action taken in o1
r1 = reward received AFTER taking action a0 in o0
...
ot = obs after choosing in o(t-1) action a(t-1)
at = action taken in ot
rt = reward received AFTER taking action a(t-1) in o(t-1)


Following this logic the buffer would need to return

(obs=o0, act=a0, rew=r1 + r2, next_obs=o2)

and for the "normal" 1-step:

(obs=o0, act=a0, rew=r1, next_obs=o1)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 As this is a crucial stage in all off-policy algorithms, can you take another look on the rewards in my logic above, pls? I think we have to take the next rewards when calculating e.g. targets in Q learning with this sampling in EpisodeReplayBuffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right: This particular EpisodeReplayBuffer does indeed that (for DreamerV3's very specific needs), it adds a dummy reward (r0) right after o0 and the first actual reward received is then r1.

But either way, we should probably implement a sub-class of this buffer with its own sample method that abides more to SAC's (or off-policy algos in general) needs, e.g. including the n-step mechanism. What do you think? I feel like the current EpisodeReplayBuffer's sample() method is too much geared toward DreamerV3's train batch requirements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your comments here and the one below. I would sit down tomorrow on #42369 and rewrite this sampling. I will reuse the SegmentTrees and try to keep as much of the logic in our PriotizedReplayBuffer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay the prioritized episode buffer with n-step sampling is ready for review now in PR #42832 . Let's combine all parts together and apply SAC on the tuned examples.

SampleBatch.REWARDS: train_dict[SampleBatch.REWARDS][:, 1],
SampleBatch.TERMINATEDS: train_dict["is_terminated"][:, 0],
SampleBatch.TRUNCATEDS: train_dict["is_truncated"][:, 0],
"is_first": train_dict["is_first"][:, 0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need these flags here (I believe only DreamerV3 does for now).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is with the Q target when the first timestep in the tuple is already the last one. Then the second one comes from the beginning of another episode, correct?

I used for the masked target Q (1.0 - batch[SampleBatch.TERMINATEDS]) * (1.0 - batch["is_last"]) * q_target_next

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just have to be careful: "is_last" does NOT mean the episode is done. If "is_last" is True, it could mean:

  • we simply are at the end of a batch row (of len=T), but still in the middle of an episode.
  • we are in the middle of a batch row, but the episode was truncated (not terminated!). Hence we should NOT set the q-value target to 0.0 at these locations (but would need to compute the actual Q-values for the observations found there).

But again, I don't think we should use this buffer class, but subclass it and override the sample() method. We can definitely reuse the EpisodeReplayBuffer's handling of adding episode chunks, but its sampling method would not be suitable for SAC, imo.

"is_last": train_dict["is_last"][:, 0],
}
)
train_batch = train_batch.as_multi_agent()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't Learner do this multi-agent conversion already automatically?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, good question. I check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, LearnerGroup._update() expects a MultiAgentBatch. We can put it into LearnerGroup._update().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's put it into LearnerGroup._update().

Only if batch is not None should we check and convert if necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 This does not work moving it into LearnerGroup._update() b/c when using LearnerGroup.update_from_batch() it is iterated over the policy batches.

There is a comment to move this iteration into the Learners. Should we make this move now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot move this to LearnerGroup._update() as the LearnerGroup.update_from_batch() already iterates over policy_batches.

There is a comment that we should move this filtering into the individual Learners. SHould we do it now? Where do we place this best? Learner.update_from_batch()?.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, happy to make this change now. Let's do this in Learner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking a deeper look into it, this appears to me that we have to invest more thoughts into where exactly, as the Learner.update_on_batch_or_episodes() already makes use of MultiAgentBatches and throws an error if it does not contain the module in the MA-batch. On the other side there is a comment in the LearnerGroup:

# TODO (sven): Move this filtering of input data into individual Learners.
#  It might be that the postprocessing of batch/episodes on each Learner
#  requires the non-trainable modules' data. 

which tells me that there might be use cases where the data from agents with one module might be needed in the learner of other modules - which would then make the error that is thrown in the Learner obsolete. So let's take a minute in a chat to think what use cases we want to support and then adapt the Learner corrrespondingly.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
@sven1977
Copy link
Contributor

Could we also have a new learning test for the added pendulum tuned_example file in the rllib/BUILD file?
We probably shouldn't merge this before it's confirmed learning.

@sven1977 sven1977 self-assigned this Jan 25, 2024
@simonsays1980
Copy link
Collaborator Author

Could we also have a new learning test for the added pendulum tuned_example file in the rllib/BUILD file? We probably shouldn't merge this before it's confirmed learning.

Yes we definitely should. But we have to first merge the different branches of SAC, i.e. SACLearner, SACModule, SACAlgorithm - and at best also PrioritizedEpisodeReplayBuffer (PERB), together. Then we can run the new stack on the tuned example test.

…e not used.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…he different modules together in the 'training_step' of 'SAC'. Wrote a reduce funciton to get the TD-errors out of the train results and modified the utility funciton for updating priorities.#

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…oid index errors.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…ed'.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…ld stack example.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…ce' is not anymore available by 'air'.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…s changed in favour of concatenating episode chunks in the replay buffers when they were already finalized but not done).

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
for _ in range(store_weight):
# Otherwise, we use the new stack.
with self._timers[SAMPLE_TIMER]:
if self.workers.num_remote_workers() <= 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's fix this for good:
Let's use the existing (old stack) synchronous_parallel_sample() method from rllib/execution/rollout_ops.py

We will have to slightly enhance this method to support the new stack (with EnvRunners returning episodes instead of batches). But the advantage will then be that we get a fault tolerant parallel sampling step!

We can also change PPO's training_step method then to use this utility again for the new stack.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Shall we do this in a new PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me, but let's add a TODO here, then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me, but let's add a TODO here, then.

local_worker=False,
)
# Perform SAC postprocessing (Prio weights) on a (flattened)
# list of Episodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

erase this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DO we need to do any preprocessing here? Does a user pass in processing functions in the new stack anymore?

# TODO (simon): Implement n-step adjustment.
# + (self.config["gamma"] ** self.config["n_step"]) * q_next_masked
+ self.config.gamma * q_next_masked
+ (self.config.gamma**self.config.n_step) * q_next_masked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I see, so supporting a uniform n-step range (uniformly sample from n-step values between x and y) will not be that simple, I guess, b/c of this loss term here. We would have to know for each item in the batch, what the exact n-step was.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. We would need to provide this in EpisodeReplayBuffer.sample(). But this is not impossible. In fact it could be applied easily. The thing we need to do is to default it to n_step. And if it is sampled it gets overridden.

@@ -260,7 +250,8 @@ def compute_loss_for_module(
critic_loss = torch.mean(
# TODO (simon): Introduce priority weights when episode buffer is ready.
# batch[PRIO_WEIGHTS] *
torch.nn.HuberLoss(reduction="none", delta=1.0)(
batch["weights"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use the PRIO_WEIGHTS constant anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I changed this b/c I was on the opinion that this is in general a weighted loss and PRIO_WEIGHTS is a bit misleading to me as these are actually importance sampling weights. But I am fine with both - it is simply a suggestion.

We could also define a new constant like WEIGHTS that could be reused by any algorithm that weights the items in the loss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense! Another option could be "loss_mask", which I think I'm already using in PPO. It's a weighting between 0.0 and 1.0, correct?

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. Thanks @simonsays1980 !

@sven1977
Copy link
Contributor

sven1977 commented Feb 7, 2024

Some learning tests are failing. Could you take a look? cc: @simonsays1980

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
@sven1977 sven1977 merged commit ee87ece into ray-project:master Feb 9, 2024
9 checks passed
@sven1977
Copy link
Contributor

sven1977 commented Feb 9, 2024

Awesome work @simonsays1980 !! Thanks for this PR.

ratnopamc pushed a commit to ratnopamc/ray that referenced this pull request Feb 11, 2024
…g_step()` method added to `SACAlgorithm`. (ray-project#42571)

Signed-off-by: Ratnopam Chakrabarti <ratnopamc@yahoo.com>
tterrysun pushed a commit to tterrysun/ray that referenced this pull request Feb 14, 2024
…g_step()` method added to `SACAlgorithm`. (ray-project#42571)

Signed-off-by: tterrysun <terry@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RLlib] Enabling RLModule by default on SAC
2 participants