Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] New ConnectorV2 API #02: SingleAgentEpisode enhancements. #41075

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Nov 10, 2023

This PR is the 2nd in the "enhanced/new ConnectorV2 API" series:

  • It enhances the SingleAgentEpisode class; more consistent API names and additional convenience getter APIs for obs, actions, etc.; removes state property from Episodes (now just another extra_model_output subkey).
  • added concept of a "lookback buffer" inside an ongoing episode to prepare enabling connectors to look at any data users would like (and replace the trajectory view API).
  • cleans up some minor things in the code.
  • Allow for nested action- and obs spaces in episodes.
  • Removed SingleAgentGymEnvRunner (only used for testing; mostly the same as our now-standard SingleAgenEnvRunner). Merged and activated test cases.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>
@@ -16,7 +20,6 @@ def __init__(
actions: List[ActType] = None,
rewards: List[SupportsFloat] = None,
infos: List[Dict] = None,
states=None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State outputs are no longer needed as a separate field. They are treated just like any other extra model outputs (e.g. as a (possibly nested) dict under the STATE_OUT key).

@@ -101,8 +102,7 @@ def __init__(
self.t = self.t_started = (
t_started if t_started is not None else max(len(self.observations) - 1, 0)
)
if self.t_started < len(self.observations) - 1:
self.t = len(self.observations) - 1
self._len_pre_buffer = len(self.rewards)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added concept of a "lookback buffer" inside an ongoing episode.
This allows for custom connectors to look back at previous data until a certain (user defined) amount of timesteps, e.g. to be able to add "prev. rewards", "prev. 5 actions", etc.. to a model's input (via custom connectors).

@@ -128,7 +126,7 @@ def concat_episode(self, episode_chunk: "SingleAgentEpisode"):
from both episodes.
"""
assert episode_chunk.id_ == self.id_
assert not self.is_done
assert not self.is_done and not self.is_numpy
Copy link
Contributor Author

@sven1977 sven1977 Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, we assume that Episode is still in the "list-format" (not numpyized yet).
We might have to change this concat_episode() API in the future, but right now, it's only used inside DreamerV3's replay buffer anyways (and in some test cases).


# Validate.
self.validate()

def add_initial_observation(
def add_env_reset(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed name for clarity:

  1. add_env_reset(): Add all the data returned by an env.reset call
  2. add_env_step(): Add all the data returned by an env.step call

# TODO (sven): Do we have to call validate here? It is our own function
# that manipulates the object.
self.validate()

def add_timestep(
def add_env_step(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

self.extra_model_outputs[k] = [v]
else:
self.extra_model_outputs[k].append(v)
self.extra_model_outputs[k].append(v)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified via defaultdict

"""

self.observations = np.array(self.observations)
self.actions = np.array(self.actions)
self.observations = batch(self.observations)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow for nested obs/action spaces.

self.render_images = np.array(self.render_images, dtype=np.uint8)
for k, v in self.extra_model_outputs.items():
self.extra_model_outputs[k] = np.array(v)
self.extra_model_outputs[k] = batch(v)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow for complex (nested) model outs (especially now that states are part of these extra model outs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use batch and not np.array conversion everywhere? This allows us to unittest batch and make sure it's behavior is predictable and re-used that everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument against it is that this would be overkill (we know that rewards are only a lits of floats, never complex structs). But yes, batch() should work on these as well, of course. There is a proper unit test for batch, which was added recently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved: I added an extra test for batch/unbatch on simple structs AND used batch() everywhere in this method (even on rewards).

for k in extra_model_output_keys
},
)
def get_observations(self, indices: Optional[Union[int, List[int], slice]] = None) -> Any:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these very practical new APIs to get data from the episode in a user friendly fashion.

)

@staticmethod
Copy link
Contributor Author

@sven1977 sven1977 Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll try to get rid of SampleBatch eventually (it's kind of an overloaded mess). There is no application currently that requires constructing an episode from an existing SampleBatch (only the other way around: Episode -> SampleBatch)

@sven1977 sven1977 changed the title [RLlib] Preparatory PR: Make EnvRunners use (enhanced) Connector API (#02: SingleAgentEpisode enhancements) [RLlib] New ConnectorV2 API #02: SingleAgentEpisode enhancements. (#41074) Nov 17, 2023
gym.register(
"custom-env-v0",
partial(
if (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug fix. Otherwise, passing in a class to config.environment(env=[some class]) does not work (only strings work).

…e_fixes

Signed-off-by: Sven Mika <svenmika1977@gmail.com>
# TODO (simon): Check, if this works for the default
# stateful encoders.
initial_state={k: s[i] for k, s in states.items()},
self._episodes[i].add_env_reset(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaner naming of these Episode methods:

  • add_env_reset
  • add_env_step

Both add to an episode the return values of those gym.Env calls.

@@ -95,6 +96,11 @@ def __init__(self, config: "AlgorithmConfig", **kwargs):
self._ts_since_last_metrics: int = 0
self._weights_seq_no: int = 0

# TODO (sven): This is a temporary solution. STATE_OUTs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp. fix: We need the new connectors to make this work w/o having to keep self._states around here. The PRs for this are lined up and rely on this one here to be merged first.

@@ -71,8 +71,7 @@ def test_init(self):
rewards = []
actions = []
infos = []
extra_model_outputs = []
states = np.random.random(10)
extra_model_outputs = {"extra_1": [], "state_out": np.random.random()}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the tests to move state_out into being just another extra_model_out.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
…_sa_episode_fixes' into env_runner_support_connectors_02_sa_episode_fixes
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
…runner_support_connectors_02_sa_episode_fixes
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
…runner_support_connectors_02_sa_episode_fixes
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
…runner_support_connectors_02_sa_episode_fixes
Signed-off-by: sven1977 <svenmika1977@gmail.com>
…runner_support_connectors_02_sa_episode_fixes
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@kouroshHakha
Copy link
Contributor

the rllib tests are still failing.

…runner_support_connectors_02_sa_episode_fixes
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
…runner_support_connectors_02_sa_episode_fixes
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 merged commit d6d2dee into ray-project:master Nov 30, 2023
8 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants