-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Custom model for multi-agent environment: access to all states #7341
Comments
@janblumenkamp you can try these two options: (1) Grouping the agents: https://ray.readthedocs.io/en/latest/rllib-env.html#grouping-agents (2) Reshape the batch in your custom model from (batch, data) to (batch, agent_id, data) for processing (i.e., do the grouping in the model only). You will need some way of figuring out the agent id for each batch entry. |
Hi Eric, |
Yes, that would work great! I guess you will end up evaluating the 'shared' layer once per agent but the overhead there should be minimal. |
Thank you! |
Hi Eric,
but in the agent grouping documentation, it says
Do I still have the advantages of multiple single agents (i.e. more distributed experience) if I use agent grouping, or is the grouped super-agent literally treated as one big agent with a huge observation and action space? I assume the former is the case? |
It's the latter, it really is one big super-agent. You could potentially still do an architectural decomposition within the super agent model though (i.e., to emulate certain multi-agent architectures). |
(I hope it is okay to follow up again in this issue since the question is again closely related.) |
Please ignore my previous question, I had a misconception of how the batching works.
So far I followed this strategy, but now that I add LSTMs it seems that the overhead adds up significantly, especially for larger amounts of agents.
Probably I will have to do something like this then. I will not only need to figure out each agent's id, but also which elements in the batch are coming from the same time step, right? Is it guaranteed that consecutive elements in the batch are coming from the same environment and same time step? Even if I don't allow any early termination of agents, I sometimes have cases where I receive a batch that contains fewer entries than I have agents (here printed the batch size from inside my model):
(Here I have 5 agents in my environment, the batches initially sometimes contain 20 elements and then only 1, but there are exactly 20 of those size 1 batches and the agent ID I am sending in the observation indicates that those 20 batches are one size 20 batch disassembled into multiple smaller batches - why is that happening? Should I just discard those batches?) |
Hmm what exactly is being printed out here? The batches should get merged together to the train batch size eventually, but you might see smaller fragments during processing. Also, note that batch.count always measures number of environment steps, which could be much lower than the sum of agent steps. |
Oh, the "size 1" batches is probably during inference if you're printing from your model, that's normal. |
Oh I see, that makes sense, then it's coming from the evaluation I am running during training! I guess reshaping the batch in the model won't work then. Then I have to group the agents? For PyTorch that will only be possible after #8101 is finalized. It looks like that will happen very soon, but just out of curiosity, what other options do I have? Just to recap, basically I want n actions for n agents from a single shared observation space (and thus perform n policy updates for each observation). I probably could implement a standard gym environment (so not use the MultiAgentEnv) that provides the shared observation space and takes a MultiDiscreteAction. My model would take this observation and provide the actions. I can then take this MultiDiscreteAction in my own RolloutWorker similar to this example to perform n policy updates? Or is there an easier way? |
There is one workaround, which is to have the policies execute forward() through the "shared" communication layer separately. Basically the shared unit will be evaluated N times for N policies instead of once, so this is more computationally expensive, but should give the same results. You'll have to make sure the observation space contains the observations of all other agents though. Edit: example, suppose I have agents A, B, obs_a, obs_b, pi_a, pi_b, and shared_layer. Independent agents
With shared layer
This is basically a version of a grouped agent that trades efficiency for simplicity. |
Not sure if I understand, can this be implemented entirely in the model definition or does this require changes in RLlib? If it can be implemented in the model, it surely would require a normal Gym env instead of the MultiAgentEnv? If I have only a single policy and multiple agents, the shared unit would only be evaluated once with your proposed solution? |
Now I am trying this: I have a |
The example above both (1) altering the env to provide the required obs, and (2) altering the model to implement the desired communication.
I'm not sure I understand this. Why not instead give the full observation to each agent as in the above example? Then, each agent's policy can filter out the desired data without needing to mess with the batch dimension (no guarantees on how that dimension is organized). |
I'm sorry, maybe I am completely misunderstanding how this works... I highly appreciate your patience with me! Let me reiterate what I want to do: In my use case, I have a single policy for n agents (Level 1: Multiple agents, shared policy 'homogenous agent') and I want them to share information at every time step. The idea is that each agent receives the output features of a CNN of the observation of all other agents at the same time steps and processes this feature vector to an action in the shared layer. If I use the MultiAgentEnv, each individual agent's observation would be fed to the model as one batch entry, right? This means that at every time step, each agent has to compute the CNN output vector of every other agent: At t=0, agent 0 evaluates the CNN observation for itself and the n-1 other agents, agent 1 then computes the same CNN observation for itself and the other agents etc. This means I have to include the observations of all other agents in every observation, as you said. This is fine if I have like five agents, but the training will slow down almost linearly with the number of agents due to the redundant evaluations, or am I missing something? So far, this was not a big problem, the training generally works. Now I am adding an LSTM between the CNN and the shared layer (to share the output features of the LSTM among agents). Now, not only the CNN, but also the LSTM is evaluated n times at every time step. Most importantly, in order to evaluate the LSTM for all other agents, each agent has to maintain a redundant copy of the recurrent state for the other agents, right? This is my main concern. Beside wasting memory and computation resources, wouldn't this mess with the learning if the states that are maintained by each agents for all other agents diverge since it's the same policy for all agents? I tried to visualize this architecture: As I see it, this problem can only be solved if there is something in between the MultiAgentEnv and the standard Gym Env: I want a single observation each time step (while maintaining the LSTM state for n agents) but output n actions from this single observation while updating the policy in the same way as if I had five subsequent observations (i.e., I don't want a single super agent since I want to be able to change the number of agents during training). Also here a small visualization of how I thought this might work. Does this make sense?
Fair, makes sense! |
(deleted an older post that was wrong)
Yep, makes sense that it doesn't scale to a very large number of agents, since it's O(n^2) work as noted. It also makes sense that the super-agent approach doesn't work if the number of agents is varying dynamically over time (unless you use padding hacks, etc).
Yep. So I think one way of doing this is treating this just as an optimization problem. So logically, we want to implement the first figure you have. However, in many cases we can under the hood change our computation graph to actually execute the second figure. This should be possible since forward / backward passes are batched and if we are willing to peek across the batch dimension to see what computation can be shared. I'm thinking you can try something like this, suppose we have a batch of data that looks like this (two env steps):
Naively you'd compute the output as
But you can instead compute it like this:
Does this seem workable? Ideally RLlib would make this easy (cc @sven1977 ) but I think it's doable manually for now. There's also the question of duplication of observations, but that could be solvable with better data representations (also, compression algos are really good at de-duplicating data). |
Thank you very much Eric, this sounds great! I will give it a try! |
Yeah I think the super agent approach is almost equivalent... You can implement the same data flow within a single policy if that works for the env. You are right that there would be a single VF estimate for the baseline though, which would be one difference. That might reduce training efficiency. |
Hmmm, but in order to implement your proposed solution I would have to assume a certain spatial arrangement in the batch and as you said there are no guarantees that the data is aligned like that, right? I think what you proposed is essentially what I suggested here. At the very least I would have to know which rows in the batch are from the same time step of the same environment instance? |
You don't need to assume a spatial arrangement, you can inspect the data and de-duplicate based on whether you see the same identical obs. (you could include extra data like, "timestep id" in the obs, to make this easier). Might be a bit convoluted in TF (i.e., gather_nd), easier in torch. |
I see. I now tried the super-agent approach again and could not really replicate the performance of my current naive implementation. Probably both the single VF estimate and, more importantly, summing the reward signal for all agents contributes to that (PPO has no way of knowing which individual action resulted in a reward of an individual agent, right? Aand this would probably be even worse for a larger number of agents). I also tried implementing your proposed solution. I added a random identifier to the observation for each time step so I can identify batch rows of the same time steps in the model. Then I can pick the unique values from those identifiers and compute the shared layer only for them. Afterwards, I can transform this back to the original batch size. This works if I have no recurrent layer. If I add that, I also have to keep track of the time dimension with This solution feels a bit hacky, to be honest... If observations from the same time step are spread across multiple batches it will not work (and I believe independent states would then again be maintained for the same agents). Is there absolutely no way to implement my second architecture more directly? I'd imagine a model that takes a super-observation and provides n actions and n value function estimates which can then be used to perform n policy updates. Can't this be done by implementing my own RolloutWorker or is that for a different use case? |
Great! I really like the proposed solution @ericl, and in my humble opinion is the way to go. This will allow to use many of the already existing algorithms in RLlib with minor changes. Looking forward to use this :) Just a minor question: let us imagine a scenario in which a team of agents wants to achieve a cooperative goal, thus maximising a shared reward function (e.g., the sum or mean of individual rewards). All of them may share the same policy (homogeneous agents), but they communicate through a differentiable protocol before jointly taking the action. Yet, NOT ALL agents are "active" during the entire episode. That is, some agents can start contributing to the common goal in the middle of the episode, and some others can stop contributing before the end. Nevertheless, those agents entering late / exiting early still want to achieve the most long term shared reward. I am wondering if this new implementation will help to deal with such problem :O |
@ramondalmau here's how I would tackle that one:
Here, the env can produce a global reward and give it to each agent.
This part should be doable once we add the agent id array and allow lockstep replay.
IIUC this is the key difficulty, that you want to give the final shared reward to exited agents. One way of handling this is to not send done=True for exited agents until the env itself finishes. Then, you can have the env send done to all agents, discounting it appropriately depending on the time delay between agent exit and env exit. |
Hey @ericl sorry for the late reply :) I am approaching the problem a little bit different:
I only see a problem: imagine an hypothetical case with a batch size of 200, and with 10 agents. Therefore, each batch actually contains therefore 200 time steps * 10 agents = 2000 samples. However, some of these samples may be masked and not considered to update the policy. This means that, actually, the size of the batch is not fixed to 2000, but it will depend on how many agents were active in those 200 steps :) I do not know if it was clear... but my conclusion is that having agents that exit before and a batch size that refers to time steps and not actual valid samples, may lead to dangerous dynamic batch sizes. |
I would recommend an optimization to not need the masking strategy by
calculating the discounted final reward in the env. Then, you only need to
have one "final" observation for each agent, instead of a long sequence of
noops. That should avoid the problem you mentioned but still apply proper
temporal discounting.
…On Thu, Jun 11, 2020, 7:41 AM ramondalmau ***@***.***> wrote:
Hey @ericl <https://github.com/ericl> sorry for the late reply :)
I am approaching the problem a little bit different:
-
Each agent *a* receives a team reward *r* at every time step *t*
(which is the sum of individual rewards), even if the agent already exited.
The reason is that what an agent did before exiting may have an effect on
the team return, which is what I want to maximise! If I only consider the
rewards that the agent received while it was active, I do not really
achieve a cooperative behavior.
-
When an agent exits the environment, I do not consider its actions
anymore in the 'step' of the environment (i completely ignore it because it
already finished), and I fill its observation with 0s (ensuring that
observing all 0s is never possible if the agent is active, of course)
-
After generating a batch of experiences from the environment, the
return (or advantage in my case) of each agent (which is identical to all
of them because they share the reward) is computed taking into account all
the rewards, from time 0 to the end of the batch, even if the agent exited
before.
-
However, in the loss function I mask the samples for which an agent
was not active (which I know because the observation was all 0s), meaning
that these samples do not contribute to update the policy.
I only see a problem: imagine an hypothetical case with a batch size of
200, and with 10 agents. Each element in the batch contains therefore 200
time steps * 10 agents = 2000 samples. However, some of these samples may
be masked and not considered to update the policy. This means that,
actually, the size of the batch is not fixed to 2000, but it will depend on
how many agents were active in those 200 steps :)
I do not know if it was clear... but my conclusion is that having agents
that exit before and a batch size that refers to time steps and not actual
valid samples, may lead to dangerous dynamic batch sizes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7341 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSSJXLSVBXDFZDBQOU3RWDUJPANCNFSM4K4Q7YFA>
.
|
Yes. This is definitely a good proposal. Many thanks @ericl I had a look at your recent replay implementation, and it will help me to avoid the with_agents_groups wrappers and solve the above-mentioned issue by computing the discounted reward in the environment. |
Thanks for all the efforts so far Eric, very helpful! Those changes look really good! |
Hmm, I think the main barrier would be that we would have to re-execute the observation function at training time. Currently, it's run only during rollout and we save the observed output as numpy arrays. This would require some tricky plumbing changes in RLlib. I'm also not sure how batching would work. |
What if we would let the information sharing completely be the model's (and therefore the user's) job? Additional to the agent grouping, we have an independent model grouping configuration. Different (groups of) agents may have different policies but share the same model. It's then up to the user to have different policy-networks for different agents in the model if that is desired. For each model grouping, there is a separate optimizer. Depending on the model grouping, the policy losses are combined, and the models are optimized accordingly. |
Dear MARL friends I hope you are doing well Ramon |
Hi all, I have been working on MARL with differentiable communication between agents and I just stumbled on this ticket. We came up with exactly the same solution that consists in implementing a "super agent" (that we call "central agent": CentralPPO, CentralDQN), with all the tricks described so well by @janblumenkamp in his message:
It works, with a few difficulties at evaluation time like #10228. I've seen that the replay modes (independent/lockstep) are now integrated to Ray 0.8.7 but was wondering where we stand regarding the second point raised by @ericl:
Generally, how far are we from enabling differentiable communication with built-in agents? Many thanks! |
Hi Thomas, last time I talked to Sven he told me that according to the current schedule, the trajectory view API for multi-agent use cases will be tackled after the Ray summit in October. |
Hi Jan, thanks for the quick reply and congrats on the paper, I will definitely read it and see how it relates to the other approaches like Graph Convolutional RL, STMARL, MARL for networked system control, Intention propagation, etc. that all revolve around the same idea of differentiable communication channels. Thanks for the info, we are still using a super-agent as well but I would love to get rid of the hacks and workarounds, as the problem we are trying to solve is very complex (large scale, long horizon, highly cooperative, continuous, dynamic neighbourhoods, agents appearing and disappearing during the episode, potentially formulated with heterogeneous agents, ...) and the hacks to accommodate for all of this keep piling up. Good luck with you research! |
What is currently the best way to go about sharing states or any other information between agents with a dynamic number of agents? I define a maximum number of active agents in the environment, but during an episode agents can finish and some time later a new agent will start. So the total number of agents continue increasing, however there will never be more activate agents than allowed. |
I've built a custom centralised critic RNN model that receives two inputs, one to predict next action that only contains agent specific observations and one to compute value that contains all observations from all agents. Still using rllib version
I also modified
My complete observation space looks as follows:
In my environment at the end of the It's training extremely slow and I can't tell if it's because I've essentially created two LSTM models? I'm also not sure if I should be populating |
This issue can hopefully be closed once #10884 is done :) |
Do you guys have a rough idea when #10884 will be finished? I am currently using @janblumenkamp 's awesome workarounds. However, I don't want to build things twice if I can help it. Thanks! |
To add to this, as another working example, this is the project/repository which is the result of this thread from me. As a working minimal example with a more recent Ray version, I have created this repository. It's a toy problem that serves as a reference implementation for the changes that are due to be done in RLlib. I talked to Sven recently and the plan is to hopefully get this done over the next few weeks :) EDIT: Just an update regarding my minimal example, it now supports both continuous and discrete action spaces and I have cleaned up the trainer implementation quite a bit, should be much clearer now. Let me know if you have any questions. |
Hi @ericl @janblumenkamp. This whole thread was very helpful, thanks for the detailed explanations from both of you! I am currently in the process of migrating a project to the RLlib framework, and I had some doubts about some of the points in your discussion.
My doubts revolve around the Agent grouping mechanism
Thank you so much again! I can't wait to onboard to RLlib! |
Hi @Rohanjames1997! |
Hi @janblumenkamp ! Assuming I had no inter-agent communication, could you answer my previous questions? And an additional one: Thanks again! And congratulations on the paper! It was a great read! 😄 |
What is your question?
My goal is to learn a single policy that is deployed to multiple agents (i.e. all agents learn the same policy, but are able to communicate with each other through a shared neural network). RLlib's multi-agent interface works with the dict indicating an action for each individual agent.
It is not entirely clear to me how my custom model is supposed to obtain the current state after the last time-step for all agents at once (it appears to me that RLLib calls the
forward
-function in my subclass inherited fromTorchModelV2
for each agent individually and passes the state for each agent into thestate
argument of theforward
function).tl;dr, if this is my custom model:
Then how do I manage to predict the logits for all of my n agents at once while having access to the current state of all my agents? Am I supposed to use variable-sharing? Is #4748 describing this exact problem? If so, is there any progress?
The text was updated successfully, but these errors were encountered: