-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Rename some items in episode APIs. #44669
[RLlib] Rename some items in episode APIs. #44669
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The set of cached "begin" values appear not to be used.
self._hanging_extra_model_outputs_end = defaultdict(dict) | ||
self._hanging_rewards_end = defaultdict(float) | ||
|
||
# In case of a `cut()` or `slice()`, we also need to store the hanging actions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question: Why do we need these extra buffers? Can't we just copy over the _hanging_actions_end
to the successor?
When we concatenate them again self
has these hanging actions and the first observation in the successor episode for the agent will be used to store the now valid transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a dumb question at all!
An example of where it is crucial to have both these chaches:
obs = [
a0: 0, a1: 0 # ts=0
a0: 1 # ts=1
a0: 2 # ts=2
]
a1's action at ts=0 now is hanging, b/c there is no follow-up observation after its first one. Thus, when we cut()
(or slice), we put this hanging action in the "before" buffer:
successor = MAEps(
a0: 2, <- first observation
self._hanging_actions_before[a1] = [a1's hanging action]
)
Then we continue adding timesteps to the succesor chunk (the one returned by cut()
). Let's assume that in the successor, eventually, a new obs comes in for a1.
successor = MAEps(
a0: 2, <- first observation
a0: 3,
...
a0: 5, a1: 5
self._hanging_actions_before[a1] = [a1's hanging action]
)
At this point, we can't do anything with the hanging action (before) as belongs to an observation (the original one of a1 at ts=0 in the previous chunk) that is NOT part of the successor anymore. It's long gone, so to speak. We thus simply leave this hanging action (before) as-is and ONLY use it to check against a preceeding chunk when we do concat_episodes()
.
Furthermore, assume that we continue the episode and there is another sequence of missing observations for a1, in which case we also now need to store the hanging action at the end for the time when the next obs for a1 DOES come in. However this hanging action at the end should not be confused with the hanging action at the beginning, they are from different timesteps and have nothing to do with each other.
# agents have at least one observation to look back to. | ||
for agent_id, agent_actions in self._agent_buffered_actions.items(): | ||
for agent_id, agent_actions in self._hanging_actions_end.items(): | ||
assert self.env_t_to_agent_t[agent_id].get(-1) == self.SKIP_ENV_TS_TAG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always true? What if an agent stepped in the last timestep but did not receive an observation? Do we then denote this as no-step (SKIP_ENV_TS_TAG
) and count the step at the timestep the missing observation comes in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct.
As long as the agent does NOT receive a new obs, all steps get marked with "SKIP_ENV_TS_TAG".
@@ -1547,7 +1557,7 @@ def get_state(self) -> Dict[str, Any]: | |||
Returns: A dicitonary containing pickable data fro a | |||
`MultiAgentEpisode`. | |||
""" | |||
# TODO (simon): Add the buffers. | |||
# TODO (simon): Add the agent caches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add the caches to the state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet. Do we need the states somewhere already?
When storing replay buffers?
) -> float: | ||
"""Returns all-agent return. | ||
|
||
Args: | ||
consider_buffer: Whether we should also consider | ||
buffered rewards wehn calculating the overall return. Agents might | ||
consider_hanging_rewards: Whether we should also consider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include_hanging_rewards
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next PR.
Rename some items in episode APIs.
MultiAgentEpisode._agent_buffered_[actions/rewards/extra_model_outputs]
->_hanging_[actions/rewards/extra_model_outputs]_end
_hanging_[actions/rewards/extra_model_outputs]_begin
to allow for more accurate cuts and slices of MultiAgentEpisode. Right now, when slicing through a sequence of e.g. agent1 NOT receiving observations, but already having acted/received rewards, these hanging actions/rewards should be included in the successor chunk (but at the beginngin). Only this way, proper concatenation can be guaranteed. Otherwise, individual timesteps might get lost entirely. Note that these properties are not actually used yet in this PR. A follow up PR will fix the behavior of MultiAgentEpisode'scut()
,slice()
, and getters.Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.