[rllib] Access to global (env) observations #8965

ramondalmau · 2020-06-16T07:25:44Z

Dear all

I am trying to build a system that looks as the attached figure:

Essentially, I have set of N agents which interact in a multi-agent environment. At every time step, each agent has its own observation, and they can communicate with each-other using a differentiable communication protocol before taking the joint action. Note that I am working under the paradigm of centralised learning and decentralised execution. In order to allow for such communication, I will use the recently lockstep replay_mode #7341 (thank you for your effort).

The novelty is that the communication is conditioned on a global observation of the environment (red box in the figure), which is NOT specific to any of the agents but to all of them. The naive solution is to include a copy the global observation in the observation of each individual agent. Then, I will have access to the global observation in the policy/vf model. This should work, but is not efficient, because I am creating N-1 redundant copies of the same data.

If the number of agents (N) and the size of the global observation space were small, the naive solution would be a good choice. When N and/or the size of the global observation space is large, however, it may be infeasible from the memory point of view.

My question is: does the current implementation of rllib allow to work with this architecture? I was thinking on creating a dummy agent which only work is to give access to the global environment state to all other agents (thus it does not contribute to update the policy nor takes actions in the environment). However, as far as I know, the space of all agents using the same policy must be the same. Is there any possible solution for this problem?

Many thanks in advance
Ramon

ericl · 2020-06-16T22:32:18Z

@ramondalmau in principle this is possible with lockstep mode if you're directly implementing learn_on_batch for a policy. This is because that method gets full access to the entire batch of steps from all agents regardless of policy.

However, the current TF/TorchPolicy templates will only receive data from data for agents of that policy, so this optimization isn't possible. We're working on improving the amount of data accessible from policies with a new trajectory view paradigm though (cc @sven1977 ). Currently it's more targeted for accessing historical data from the trajectory of the same agents, but no reason it can't expose historical data from other agents not of the same policy I think.

ramondalmau · 2020-06-17T06:15:26Z

Got it. Yes indeed... the TF/TorchPolicy will only receive data from agents using the same policy.
Yet, I must say that ALL my agents are supposed to be homogeneous, so ideally there is only one policy to be learned, and all agents contribute to update the shared weights. Last figure was quite abstract. I am going to be more concrete :)

Long story short: different from typical MARL, in which agents act 'tactically' (meaning that they act only knowing their CURRENT state, even if they aim at maximising the long-term return), in the problem I want to address each agent has a 'plan', and they act 'strategically' based on these plans.

Attached a very simplified representation of the problem with 3 agents. Agents operate in an environment defined by a grid of cells (3x3 cell in the example). Each agent knows in which cells it will be in the next time steps (let us assume a look-ahead horizon of 3 for the sake of clarity). Each cell has a maximum capacity or resources (e.g., 2 agents at the same time). Fortunately, based on the plans, each cell can know how many agents plan to be inside in the look-ahead horizon. For instance, for cell (0,0) (in the bottom left) the sequence of 'occupancy' is (1,0,0). One agent in the first time step, and 0 in the second and third.

My plan is: each cell passes the sequence of occupancy through a Bi-directional RNN (BRNN). This BRNN generates as many hidden states as time steps in the look-ahead horizon (3 in the example). Each hidden state shall capture the occupancy information in the whole time horizon! This is why I am using BRNN.

The observation of each agent is the sequence of hidden states of the cells it will traverse according to the plan (see example). Then, agents can communicate to decide how they act to avoid overloads (i.e., occupancy exceeding the resources of the cells).

IMPORTANT: this is a very simplified representation of the problem. In practise, the cells include additional information. Furthermore, each agent also has some parameters that condition the policy before taking the action.

In my previous post, the 'global observation' refers to the sequence of occupancy for each one of the cells. Then my plan was to use torch.gather such that each agent selects the corresponding hidden states based on the sequence of cells it traverses.

I was thinking a lot on how to implement this in Rllib, not obvious :). Yesterday night I was thinking on using a hierarchical environment, in which the agents of the higher level are the cells,
and the agents of the lower level the agents that I described above . But I do not know if this is a correct way to address the problem :S

sven1977 · 2020-06-17T20:12:49Z

Hi @ramondalmau. Yes, I'm working on more general access options to the entire episode by the Model (it will be able to pick, what exactly will be in its "view" in terms of past timesteps on any previous env- and model outputs). So far, I was thinking only about same-policy access, but yeah, this is a good point: Why not allow one policy to access other policies' observations and outputs as well? I will factor this into the current project.

ramondalmau added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 16, 2020

ericl self-assigned this Jun 16, 2020

ericl added P2 Important issue, but not time-critical rllib and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 16, 2020

ericl removed their assignment Sep 21, 2020

ericl added this to the RLlib Bugs milestone Mar 11, 2021

ericl removed the rllib label Mar 11, 2021

richardliaw added the rllib RLlib related issues label Oct 5, 2021

avnishn closed this as completed Apr 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib] Access to global (env) observations #8965

[rllib] Access to global (env) observations #8965

ramondalmau commented Jun 16, 2020 •

edited

Loading

ericl commented Jun 16, 2020

ramondalmau commented Jun 17, 2020 •

edited

Loading

sven1977 commented Jun 17, 2020

[rllib] Access to global (env) observations #8965

[rllib] Access to global (env) observations #8965

Comments

ramondalmau commented Jun 16, 2020 • edited Loading

ericl commented Jun 16, 2020

ramondalmau commented Jun 17, 2020 • edited Loading

sven1977 commented Jun 17, 2020

ramondalmau commented Jun 16, 2020 •

edited

Loading

ramondalmau commented Jun 17, 2020 •

edited

Loading