Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Rename some items in episode APIs. #44669

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Apr 11, 2024

Rename some items in episode APIs.

  • MultiAgentEpisode._agent_buffered_[actions/rewards/extra_model_outputs] -> _hanging_[actions/rewards/extra_model_outputs]_end
  • Add: _hanging_[actions/rewards/extra_model_outputs]_begin to allow for more accurate cuts and slices of MultiAgentEpisode. Right now, when slicing through a sequence of e.g. agent1 NOT receiving observations, but already having acted/received rewards, these hanging actions/rewards should be included in the successor chunk (but at the beginngin). Only this way, proper concatenation can be guaranteed. Otherwise, individual timesteps might get lost entirely. Note that these properties are not actually used yet in this PR. A follow up PR will fix the behavior of MultiAgentEpisode's cut(), slice(), and getters.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 11, 2024
Copy link
Collaborator

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The set of cached "begin" values appear not to be used.

self._hanging_extra_model_outputs_end = defaultdict(dict)
self._hanging_rewards_end = defaultdict(float)

# In case of a `cut()` or `slice()`, we also need to store the hanging actions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question: Why do we need these extra buffers? Can't we just copy over the _hanging_actions_end to the successor?
When we concatenate them again self has these hanging actions and the first observation in the successor episode for the agent will be used to store the now valid transition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a dumb question at all!
An example of where it is crucial to have both these chaches:

obs = [
   a0: 0, a1: 0   # ts=0
   a0: 1     # ts=1
   a0: 2   # ts=2
]

a1's action at ts=0 now is hanging, b/c there is no follow-up observation after its first one. Thus, when we cut() (or slice), we put this hanging action in the "before" buffer:

successor = MAEps(
   a0: 2,  <- first observation

   self._hanging_actions_before[a1] = [a1's hanging action]
)

Then we continue adding timesteps to the succesor chunk (the one returned by cut()). Let's assume that in the successor, eventually, a new obs comes in for a1.

successor = MAEps(
   a0: 2,  <- first observation
   a0: 3,
   ...
   a0: 5, a1: 5

   self._hanging_actions_before[a1] = [a1's hanging action]
)

At this point, we can't do anything with the hanging action (before) as belongs to an observation (the original one of a1 at ts=0 in the previous chunk) that is NOT part of the successor anymore. It's long gone, so to speak. We thus simply leave this hanging action (before) as-is and ONLY use it to check against a preceeding chunk when we do concat_episodes().

Furthermore, assume that we continue the episode and there is another sequence of missing observations for a1, in which case we also now need to store the hanging action at the end for the time when the next obs for a1 DOES come in. However this hanging action at the end should not be confused with the hanging action at the beginning, they are from different timesteps and have nothing to do with each other.

# agents have at least one observation to look back to.
for agent_id, agent_actions in self._agent_buffered_actions.items():
for agent_id, agent_actions in self._hanging_actions_end.items():
assert self.env_t_to_agent_t[agent_id].get(-1) == self.SKIP_ENV_TS_TAG
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always true? What if an agent stepped in the last timestep but did not receive an observation? Do we then denote this as no-step (SKIP_ENV_TS_TAG) and count the step at the timestep the missing observation comes in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct.
As long as the agent does NOT receive a new obs, all steps get marked with "SKIP_ENV_TS_TAG".

@@ -1547,7 +1557,7 @@ def get_state(self) -> Dict[str, Any]:
Returns: A dicitonary containing pickable data fro a
`MultiAgentEpisode`.
"""
# TODO (simon): Add the buffers.
# TODO (simon): Add the agent caches.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add the caches to the state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. Do we need the states somewhere already?
When storing replay buffers?

) -> float:
"""Returns all-agent return.

Args:
consider_buffer: Whether we should also consider
buffered rewards wehn calculating the overall return. Agents might
consider_hanging_rewards: Whether we should also consider
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include_hanging_rewards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next PR.

@sven1977 sven1977 merged commit 9bebefb into ray-project:master Apr 11, 2024
5 checks passed
@sven1977 sven1977 deleted the rename_some_items_in_episode_apis branch April 11, 2024 13:36
harborn pushed a commit to harborn/ray that referenced this pull request Apr 18, 2024
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants