-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Access to global (env) observations #8965
Comments
@ramondalmau in principle this is possible with lockstep mode if you're directly implementing However, the current TF/TorchPolicy templates will only receive data from data for agents of that policy, so this optimization isn't possible. We're working on improving the amount of data accessible from policies with a new trajectory view paradigm though (cc @sven1977 ). Currently it's more targeted for accessing historical data from the trajectory of the same agents, but no reason it can't expose historical data from other agents not of the same policy I think. |
Got it. Yes indeed... the TF/TorchPolicy will only receive data from agents using the same policy. Long story short: different from typical MARL, in which agents act 'tactically' (meaning that they act only knowing their CURRENT state, even if they aim at maximising the long-term return), in the problem I want to address each agent has a 'plan', and they act 'strategically' based on these plans. Attached a very simplified representation of the problem with 3 agents. Agents operate in an environment defined by a grid of cells (3x3 cell in the example). Each agent knows in which cells it will be in the next time steps (let us assume a look-ahead horizon of 3 for the sake of clarity). Each cell has a maximum capacity or resources (e.g., 2 agents at the same time). Fortunately, based on the plans, each cell can know how many agents plan to be inside in the look-ahead horizon. For instance, for cell (0,0) (in the bottom left) the sequence of 'occupancy' is (1,0,0). One agent in the first time step, and 0 in the second and third. My plan is: each cell passes the sequence of occupancy through a Bi-directional RNN (BRNN). This BRNN generates as many hidden states as time steps in the look-ahead horizon (3 in the example). Each hidden state shall capture the occupancy information in the whole time horizon! This is why I am using BRNN. The observation of each agent is the sequence of hidden states of the cells it will traverse according to the plan (see example). Then, agents can communicate to decide how they act to avoid overloads (i.e., occupancy exceeding the resources of the cells). IMPORTANT: this is a very simplified representation of the problem. In practise, the cells include additional information. Furthermore, each agent also has some parameters that condition the policy before taking the action. In my previous post, the 'global observation' refers to the sequence of occupancy for each one of the cells. Then my plan was to use torch.gather such that each agent selects the corresponding hidden states based on the sequence of cells it traverses. I was thinking a lot on how to implement this in Rllib, not obvious :). Yesterday night I was thinking on using a hierarchical environment, in which the agents of the higher level are the cells, |
Hi @ramondalmau. Yes, I'm working on more general access options to the entire episode by the Model (it will be able to pick, what exactly will be in its "view" in terms of past timesteps on any previous env- and model outputs). So far, I was thinking only about same-policy access, but yeah, this is a good point: Why not allow one policy to access other policies' observations and outputs as well? I will factor this into the current project. |
Dear all
I am trying to build a system that looks as the attached figure:
Essentially, I have set of N agents which interact in a multi-agent environment. At every time step, each agent has its own observation, and they can communicate with each-other using a differentiable communication protocol before taking the joint action. Note that I am working under the paradigm of centralised learning and decentralised execution. In order to allow for such communication, I will use the recently lockstep replay_mode #7341 (thank you for your effort).
The novelty is that the communication is conditioned on a global observation of the environment (red box in the figure), which is NOT specific to any of the agents but to all of them. The naive solution is to include a copy the global observation in the observation of each individual agent. Then, I will have access to the global observation in the policy/vf model. This should work, but is not efficient, because I am creating N-1 redundant copies of the same data.
If the number of agents (N) and the size of the global observation space were small, the naive solution would be a good choice. When N and/or the size of the global observation space is large, however, it may be infeasible from the memory point of view.
My question is: does the current implementation of rllib allow to work with this architecture? I was thinking on creating a dummy agent which only work is to give access to the global environment state to all other agents (thus it does not contribute to update the policy nor takes actions in the environment). However, as far as I know, the space of all agents using the same policy must be the same. Is there any possible solution for this problem?
Many thanks in advance
Ramon
The text was updated successfully, but these errors were encountered: