Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] Custom model for multi-agent environment: access to all states #7341

Closed
janblumenkamp opened this issue Feb 27, 2020 · 54 comments
Closed
Assignees
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical question Just a question :) rllib RLlib related issues

Comments

@janblumenkamp
Copy link
Contributor

What is your question?

My goal is to learn a single policy that is deployed to multiple agents (i.e. all agents learn the same policy, but are able to communicate with each other through a shared neural network). RLlib's multi-agent interface works with the dict indicating an action for each individual agent.

It is not entirely clear to me how my custom model is supposed to obtain the current state after the last time-step for all agents at once (it appears to me that RLLib calls the forward-function in my subclass inherited from TorchModelV2 for each agent individually and passes the state for each agent into the state argument of the forward function).

tl;dr, if this is my custom model:

class AdaptedVisionNetwork(TorchModelV2, nn.Module):
    """Generic vision network."""

    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs, model_config, name)
        nn.Module.__init__(self)
        # ... NN model definition

    @override(TorchModelV2)
    def forward(self, input_dict, state, seq_lens):
        features = self.predict(input_dict["obs"].float())
        logits = self._logits(features)
        self._cur_value = self._value_branch(features).squeeze(1)
        return logits, state

    @override(TorchModelV2)
    def value_function(self):
        assert self._cur_value is not None, "must call forward() first"
        return self._cur_value

Then how do I manage to predict the logits for all of my n agents at once while having access to the current state of all my agents? Am I supposed to use variable-sharing? Is #4748 describing this exact problem? If so, is there any progress?

@janblumenkamp janblumenkamp added the question Just a question :) label Feb 27, 2020
@ericl
Copy link
Contributor

ericl commented Feb 27, 2020

@janblumenkamp you can try these two options:

(1) Grouping the agents: https://ray.readthedocs.io/en/latest/rllib-env.html#grouping-agents
Then you can write a single policy/model that computes all the agents logits at once, and can implement the desired information sharing structure in the model.

(2) Reshape the batch in your custom model from (batch, data) to (batch, agent_id, data) for processing (i.e., do the grouping in the model only). You will need some way of figuring out the agent id for each batch entry.

@janblumenkamp
Copy link
Contributor Author

Hi Eric,
in the meantime, I found this example. The states of the other agents are basically considered as part of the observation space, which seems to be the simplest solution to me and seems similar to your second proposed solution. Would that make sense or do you see any downsides/problems in doing that (in particular in terms of performance)?

@ericl
Copy link
Contributor

ericl commented Feb 27, 2020

Yes, that would work great! I guess you will end up evaluating the 'shared' layer once per agent but the overhead there should be minimal.

@janblumenkamp
Copy link
Contributor Author

Thank you!

@janblumenkamp
Copy link
Contributor Author

Hi Eric,
I have a follow-up question: In this blog post, you write

decomposing the actions and observations of a single monolithic agent into multiple simpler agents not only reduces the dimensionality of agent inputs and outputs, but also effectively increases the amount of training data generated per step of the environment

but in the agent grouping documentation, it says

RLlib treats agent groups like a single agent with a Tuple action and observation space.

Do I still have the advantages of multiple single agents (i.e. more distributed experience) if I use agent grouping, or is the grouped super-agent literally treated as one big agent with a huge observation and action space? I assume the former is the case?

@ericl
Copy link
Contributor

ericl commented Mar 23, 2020

or is the grouped super-agent literally treated as one big agent with a huge observation and action space

It's the latter, it really is one big super-agent. You could potentially still do an architectural decomposition within the super agent model though (i.e., to emulate certain multi-agent architectures).

@janblumenkamp
Copy link
Contributor Author

(I hope it is okay to follow up again in this issue since the question is again closely related.)
How would one go ahead if one wanted to incorporate RNNs into the shared neural network? So right now I have one policy and multiple agents, the state contains the state of all agents and an ID for the corresponding agents (exactly as in my example posted here). In my model I process the state of all the other agents and know from the transmitted ID which state is the one for this agent (since forward will be called for each agent, the first part is therefore, as you pointed out before, redundant for all agents). If I now add RNNs (in my example directly after running self.cnn_model) I need a shared state among all agents, right? Otherwise each agent would have a different internal state for all the other agents? Or is this not a problem?

@janblumenkamp
Copy link
Contributor Author

Please ignore my previous question, I had a misconception of how the batching works.

Yes, that would work great! I guess you will end up evaluating the 'shared' layer once per agent but the overhead there should be minimal.

So far I followed this strategy, but now that I add LSTMs it seems that the overhead adds up significantly, especially for larger amounts of agents.

(2) Reshape the batch in your custom model from (batch, data) to (batch, agent_id, data) for processing (i.e., do the grouping in the model only). You will need some way of figuring out the agent id for each batch entry.

Probably I will have to do something like this then. I will not only need to figure out each agent's id, but also which elements in the batch are coming from the same time step, right? Is it guaranteed that consecutive elements in the batch are coming from the same environment and same time step? Even if I don't allow any early termination of agents, I sometimes have cases where I receive a batch that contains fewer entries than I have agents (here printed the batch size from inside my model):

(pid=86346) 1                                                      
(pid=86346) 1  
(pid=86346) 1                     
(pid=86346) 1                                  
(pid=86347) 20 
(pid=86347) 1                                 
(pid=86347) 1                        
(pid=86347) 1                                                                                                              
(pid=86347) 1    
(pid=86347) 1                                                                                                                       
(pid=86347) 1                         
(pid=86347) 1                                                                                                                                  
(pid=86347) 1                                                         
(pid=86347) 1                                                                                                                                                 
(pid=86347) 1                                  
(pid=86347) 1                                                                                                             
(pid=86347) 1              
(pid=86347) 1    
(pid=86347) 1  
(pid=86347) 1  
(pid=86347) 1  
(pid=86347) 1                    
(pid=86347) 1                     
(pid=86347) 1                                                                                                                
(pid=86347) 1      
(pid=86327) 700  
(pid=86327) 700                                                                            
(pid=86327) 600                                                                            
(pid=86327) 600                                                                            

(Here I have 5 agents in my environment, the batches initially sometimes contain 20 elements and then only 1, but there are exactly 20 of those size 1 batches and the agent ID I am sending in the observation indicates that those 20 batches are one size 20 batch disassembled into multiple smaller batches - why is that happening? Should I just discard those batches?)
Is it guaranteed that consecutive entries in the batch are from the same time step or could there be a possibility that this is not the case?

@janblumenkamp janblumenkamp reopened this Apr 20, 2020
@ericl
Copy link
Contributor

ericl commented Apr 20, 2020

Hmm what exactly is being printed out here? The batches should get merged together to the train batch size eventually, but you might see smaller fragments during processing.

Also, note that batch.count always measures number of environment steps, which could be much lower than the sum of agent steps.

@ericl
Copy link
Contributor

ericl commented Apr 20, 2020

Oh, the "size 1" batches is probably during inference if you're printing from your model, that's normal.

@janblumenkamp
Copy link
Contributor Author

Oh I see, that makes sense, then it's coming from the evaluation I am running during training! I guess reshaping the batch in the model won't work then. Then I have to group the agents? For PyTorch that will only be possible after #8101 is finalized. It looks like that will happen very soon, but just out of curiosity, what other options do I have? Just to recap, basically I want n actions for n agents from a single shared observation space (and thus perform n policy updates for each observation). I probably could implement a standard gym environment (so not use the MultiAgentEnv) that provides the shared observation space and takes a MultiDiscreteAction. My model would take this observation and provide the actions. I can then take this MultiDiscreteAction in my own RolloutWorker similar to this example to perform n policy updates? Or is there an easier way?

@ericl
Copy link
Contributor

ericl commented Apr 20, 2020

There is one workaround, which is to have the policies execute forward() through the "shared" communication layer separately. Basically the shared unit will be evaluated N times for N policies instead of once, so this is more computationally expensive, but should give the same results.

You'll have to make sure the observation space contains the observations of all other agents though.

Edit: example, suppose I have agents A, B, obs_a, obs_b, pi_a, pi_b, and shared_layer.

Independent agents

obs_a -> pi_a(obs_a) -> action for A
obs_b -> pi_b(obs_b) -> action for B

With shared layer

(obs_a, obs_b) -> pi_a(obs_a, shared_layer(obs_b)) -> action for A
(obs_b, obs_a) -> pi_b(obs_b, shared_layer(obs_a)) -> action for B

This is basically a version of a grouped agent that trades efficiency for simplicity.

@janblumenkamp
Copy link
Contributor Author

Not sure if I understand, can this be implemented entirely in the model definition or does this require changes in RLlib? If it can be implemented in the model, it surely would require a normal Gym env instead of the MultiAgentEnv? If I have only a single policy and multiple agents, the shared unit would only be evaluated once with your proposed solution?

@janblumenkamp
Copy link
Contributor Author

Now I am trying this: I have a MultiAgentEnv that contains all agent's observations in each observation as well as an id for each agent. My model now receives the batched observations for all agents. The idea was to filter out only the observations from the agent of id 0 from the whole batch in the model (so basically ignore all agent's observations except one agent), then compute the actions for all agents from that single observation in forward and return that batch. This would only work as long as the batch size is a multiple of the number of agents and as long as the batch contains the same number of samples from each agent. Unfortunately, the latter is not necessarily true - it seems like it can happen that the samples in a batch can be completely scrambled (but not always). Why is that the case?

@ericl
Copy link
Contributor

ericl commented Apr 24, 2020

Not sure if I understand, can this be implemented entirely in the model definition or does this require changes in RLlib?

The example above both (1) altering the env to provide the required obs, and (2) altering the model to implement the desired communication.

My model now receives the batched observations for all agents. The idea was to filter out only the observations from the agent of id 0 from the whole batch in the model (so basically ignore all agent's observations except one agent),

I'm not sure I understand this. Why not instead give the full observation to each agent as in the above example? Then, each agent's policy can filter out the desired data without needing to mess with the batch dimension (no guarantees on how that dimension is organized).

@janblumenkamp
Copy link
Contributor Author

I'm sorry, maybe I am completely misunderstanding how this works... I highly appreciate your patience with me! Let me reiterate what I want to do:

In my use case, I have a single policy for n agents (Level 1: Multiple agents, shared policy 'homogenous agent') and I want them to share information at every time step. The idea is that each agent receives the output features of a CNN of the observation of all other agents at the same time steps and processes this feature vector to an action in the shared layer. If I use the MultiAgentEnv, each individual agent's observation would be fed to the model as one batch entry, right? This means that at every time step, each agent has to compute the CNN output vector of every other agent: At t=0, agent 0 evaluates the CNN observation for itself and the n-1 other agents, agent 1 then computes the same CNN observation for itself and the other agents etc. This means I have to include the observations of all other agents in every observation, as you said. This is fine if I have like five agents, but the training will slow down almost linearly with the number of agents due to the redundant evaluations, or am I missing something?

So far, this was not a big problem, the training generally works. Now I am adding an LSTM between the CNN and the shared layer (to share the output features of the LSTM among agents). Now, not only the CNN, but also the LSTM is evaluated n times at every time step. Most importantly, in order to evaluate the LSTM for all other agents, each agent has to maintain a redundant copy of the recurrent state for the other agents, right? This is my main concern. Beside wasting memory and computation resources, wouldn't this mess with the learning if the states that are maintained by each agents for all other agents diverge since it's the same policy for all agents? I tried to visualize this architecture:
masters-project-arch-0

As I see it, this problem can only be solved if there is something in between the MultiAgentEnv and the standard Gym Env: I want a single observation each time step (while maintaining the LSTM state for n agents) but output n actions from this single observation while updating the policy in the same way as if I had five subsequent observations (i.e., I don't want a single super agent since I want to be able to change the number of agents during training). Also here a small visualization of how I thought this might work. Does this make sense?
masters-project-arch-1

no guarantees on how that dimension is organized

Fair, makes sense!

@ericl
Copy link
Contributor

ericl commented Apr 24, 2020

(deleted an older post that was wrong)

This is fine if I have like five agents, but the training will slow down almost linearly with the number of agents due to the redundant evaluations, or am I missing something?

Yep, makes sense that it doesn't scale to a very large number of agents, since it's O(n^2) work as noted. It also makes sense that the super-agent approach doesn't work if the number of agents is varying dynamically over time (unless you use padding hacks, etc).

Also here a small visualization of how I thought this might work. Does this make sense?

Yep. So I think one way of doing this is treating this just as an optimization problem. So logically, we want to implement the first figure you have.

However, in many cases we can under the hood change our computation graph to actually execute the second figure. This should be possible since forward / backward passes are batched and if we are willing to peek across the batch dimension to see what computation can be shared.

I'm thinking you can try something like this, suppose we have a batch of data that looks like this (two env steps):

[
    [obs_a1, [obs_a1, obs_b1, obs_c1]],
    [obs_b1, [obs_a1, obs_b1, obs_c1]],
    [obs_c1, [obs_a1, obs_b1, obs_c1]],
    [obs_a2, [obs_a2, obs_b2, obs_c2]],
    [obs_b2, [obs_a2, obs_b2, obs_c2]],
    [obs_c2, [obs_a2, obs_b2, obs_c2]],
]

Naively you'd compute the output as

[
    policy(obs_a1, shared_layer([obs_a1, obs_b1, obs_c1])),
    policy(obs_b1, shared_layer([obs_a1, obs_b1, obs_c1])),
    policy(obs_c1, shared_layer([obs_a1, obs_b1, obs_c1])),
    policy(obs_a2, shared_layer([obs_a2, obs_b2, obs_c2])),
    policy(obs_b2, shared_layer([obs_a2, obs_b2, obs_c2])),
    policy(obs_c2, shared_layer([obs_a2, obs_b2, obs_c2])),
]

But you can instead compute it like this:

stage1_out = [
    shared_layer([obs_a1, obs_b1, obs_c1])),
    shared_layer([obs_a2, obs_b2, obs_c2])),
]

stage2_out = [
[
    policy(obs_a1, stage1_out[0]),
    policy(obs_b1, stage1_out[0]),
    policy(obs_c1, stage1_out[0]),
    policy(obs_a2, stage1_out[1]),
    policy(obs_b2, stage1_out[1]),
    policy(obs_c2, stage1_out[1]),
]

Does this seem workable? Ideally RLlib would make this easy (cc @sven1977 ) but I think it's doable manually for now.

There's also the question of duplication of observations, but that could be solvable with better data representations (also, compression algos are really good at de-duplicating data).

@ericl ericl added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks labels Apr 24, 2020
@janblumenkamp
Copy link
Contributor Author

Thank you very much Eric, this sounds great! I will give it a try!
But would the problem really be easier if I would go for a fixed number of agents? Would the superagent approach really be equivalent, i.e. would the policy benefit from the experience of multiple agents and internally output only a single action so that I would have the flexibility to deploy it to any number of agents? To me, it seemed like this wouldn't work because I would have to provide a single value function estimate for each action tuple, so I assumed that a single policy is trained that outputs n actions at once (which would make the training much harder, wouldn't it?).

@ericl
Copy link
Contributor

ericl commented Apr 24, 2020

Yeah I think the super agent approach is almost equivalent... You can implement the same data flow within a single policy if that works for the env.

You are right that there would be a single VF estimate for the baseline though, which would be one difference. That might reduce training efficiency.

@janblumenkamp
Copy link
Contributor Author

Hmmm, but in order to implement your proposed solution I would have to assume a certain spatial arrangement in the batch and as you said there are no guarantees that the data is aligned like that, right? I think what you proposed is essentially what I suggested here. At the very least I would have to know which rows in the batch are from the same time step of the same environment instance?

@ericl
Copy link
Contributor

ericl commented Apr 24, 2020

You don't need to assume a spatial arrangement, you can inspect the data and de-duplicate based on whether you see the same identical obs.

(you could include extra data like, "timestep id" in the obs, to make this easier).

Might be a bit convoluted in TF (i.e., gather_nd), easier in torch.

@janblumenkamp
Copy link
Contributor Author

janblumenkamp commented Apr 26, 2020

I see. I now tried the super-agent approach again and could not really replicate the performance of my current naive implementation. Probably both the single VF estimate and, more importantly, summing the reward signal for all agents contributes to that (PPO has no way of knowing which individual action resulted in a reward of an individual agent, right? Aand this would probably be even worse for a larger number of agents).

I also tried implementing your proposed solution. I added a random identifier to the observation for each time step so I can identify batch rows of the same time steps in the model. Then I can pick the unique values from those identifiers and compute the shared layer only for them. Afterwards, I can transform this back to the original batch size. This works if I have no recurrent layer. If I add that, I also have to keep track of the time dimension with seq_lens which is doable, but it gets very messy...

This solution feels a bit hacky, to be honest... If observations from the same time step are spread across multiple batches it will not work (and I believe independent states would then again be maintained for the same agents). Is there absolutely no way to implement my second architecture more directly? I'd imagine a model that takes a super-observation and provides n actions and n value function estimates which can then be used to perform n policy updates. Can't this be done by implementing my own RolloutWorker or is that for a different use case?

@ramondalmau
Copy link

Great! I really like the proposed solution @ericl, and in my humble opinion is the way to go. This will allow to use many of the already existing algorithms in RLlib with minor changes. Looking forward to use this :)

Just a minor question: let us imagine a scenario in which a team of agents wants to achieve a cooperative goal, thus maximising a shared reward function (e.g., the sum or mean of individual rewards). All of them may share the same policy (homogeneous agents), but they communicate through a differentiable protocol before jointly taking the action. Yet, NOT ALL agents are "active" during the entire episode. That is, some agents can start contributing to the common goal in the middle of the episode, and some others can stop contributing before the end. Nevertheless, those agents entering late / exiting early still want to achieve the most long term shared reward. I am wondering if this new implementation will help to deal with such problem :O

@ericl
Copy link
Contributor

ericl commented Jun 9, 2020

@ramondalmau here's how I would tackle that one:

shared reward function

Here, the env can produce a global reward and give it to each agent.

All of them may share the same policy (homogeneous agents), but they communicate through a differentiable protocol before jointly taking the action.

This part should be doable once we add the agent id array and allow lockstep replay.

Nevertheless, those agents entering late / exiting early still want to achieve the most long term shared reward.

IIUC this is the key difficulty, that you want to give the final shared reward to exited agents. One way of handling this is to not send done=True for exited agents until the env itself finishes. Then, you can have the env send done to all agents, discounting it appropriately depending on the time delay between agent exit and env exit.

@ramondalmau
Copy link

ramondalmau commented Jun 11, 2020

Hey @ericl sorry for the late reply :)

I am approaching the problem a little bit different:

  • Each agent a receives a team reward r at every time step t (which is the sum of individual rewards), even if the agent already exited. The reason is that what an agent did before exiting may have an effect on the team return, which is what I want to maximise! If I only consider the rewards that the agent received while it was active, I do not really achieve a cooperative behavior.

  • When an agent exits the environment, I do not consider its actions anymore in the 'step' of the environment (i completely ignore it because it already finished), and I fill its observation with 0s (ensuring that observing all 0s is never possible if the agent is active, of course)

  • After generating a batch of experiences from the environment, the return (or advantage in my case) of each agent (which is identical to all of them because they share the reward) is computed taking into account all the rewards, from time 0 to the end of the batch, even if the agent exited before.

  • However, in the loss function I mask the samples for which an agent was not active (which I know because the observation was all 0s), meaning that these samples do not contribute to update the policy.

I only see a problem: imagine an hypothetical case with a batch size of 200, and with 10 agents. Therefore, each batch actually contains therefore 200 time steps * 10 agents = 2000 samples. However, some of these samples may be masked and not considered to update the policy. This means that, actually, the size of the batch is not fixed to 2000, but it will depend on how many agents were active in those 200 steps :)

I do not know if it was clear... but my conclusion is that having agents that exit before and a batch size that refers to time steps and not actual valid samples, may lead to dangerous dynamic batch sizes.

@ericl
Copy link
Contributor

ericl commented Jun 11, 2020 via email

@ramondalmau
Copy link

Yes. This is definitely a good proposal. Many thanks @ericl
Unfortunately, at present I was not able do apply a similar strategy because I was using the with_agents_groups wrapper. Therefore, each sample in the batch was composed by the observations of all agents, similar to the solution of #7341 (comment)

I had a look at your recent replay implementation, and it will help me to avoid the with_agents_groups wrappers and solve the above-mentioned issue by computing the discounted reward in the environment.

@janblumenkamp
Copy link
Contributor Author

Thanks for all the efforts so far Eric, very helpful! Those changes look really good!
I was wondering what exactly stands in the way of having a differentiable communication channel between different policies now. Is it only the observation function or is there something else?

@ericl
Copy link
Contributor

ericl commented Jun 19, 2020

Hmm, I think the main barrier would be that we would have to re-execute the observation function at training time. Currently, it's run only during rollout and we save the observed output as numpy arrays. This would require some tricky plumbing changes in RLlib. I'm also not sure how batching would work.

@janblumenkamp
Copy link
Contributor Author

What if we would let the information sharing completely be the model's (and therefore the user's) job? Additional to the agent grouping, we have an independent model grouping configuration. Different (groups of) agents may have different policies but share the same model. It's then up to the user to have different policy-networks for different agents in the model if that is desired. For each model grouping, there is a separate optimizer. Depending on the model grouping, the policy losses are combined, and the models are optimized accordingly.
Essentially this would mean that a policy is separated from the model, so I am not sure if that is really the best solution, but it would allow vast flexibility in multi-agent settings and make differentiable communication channels very easy to implement directly (and only) inside the model. Or am I missing something?

@ramondalmau
Copy link

Dear MARL friends

I hope you are doing well
Just a small question: is it possible to use lockstep mode with multi-agent PPO?
Kind regards

Ramon

@ThomasLecat
Copy link
Contributor

Hi all,

I have been working on MARL with differentiable communication between agents and I just stumbled on this ticket. We came up with exactly the same solution that consists in implementing a "super agent" (that we call "central agent": CentralPPO, CentralDQN), with all the tricks described so well by @janblumenkamp in his message:

That makes a lot of sense, that is exactly what I needed! I implemented it and it yields slightly better results as the naive implementation, but faster and with less required memory!

...

Just for reference. Please let me know if you have any tips or if this can be done with fewer modifications. I am looking forward to when something like this is possible in the future more easily :)

It works, with a few difficulties at evaluation time like #10228.

I've seen that the replay modes (independent/lockstep) are now integrated to Ray 0.8.7 but was wondering where we stand regarding the second point raised by @ericl:

The replay mode option by itself isn't enough. In a custom model, you still need to be able to identify all the co-executing agents at a timestep. This can be done if we also pass in ("episode_id, "agent_id", and "timestep") arrays to model.forward() that identify the origin of each batch element. @sven1977 is exploring automatically making these available in the model input_dict.

Generally, how far are we from enabling differentiable communication with built-in agents?
@janblumenkamp, are you were still using the "super agent" approach?

Many thanks!

@janblumenkamp
Copy link
Contributor Author

Hi Thomas, last time I talked to Sven he told me that according to the current schedule, the trajectory view API for multi-agent use cases will be tackled after the Ray summit in October.
We stuck to the super-agent approach and it worked quite well for us, but it involves a few not so nice hacks and workarounds, so I am definitely still looking forward to the multi-agent trajectory API.
If you are interested, this is our paper which is the result of this ticket.

@ThomasLecat
Copy link
Contributor

ThomasLecat commented Sep 10, 2020

Hi Jan, thanks for the quick reply and congrats on the paper, I will definitely read it and see how it relates to the other approaches like Graph Convolutional RL, STMARL, MARL for networked system control, Intention propagation, etc. that all revolve around the same idea of differentiable communication channels.

Thanks for the info, we are still using a super-agent as well but I would love to get rid of the hacks and workarounds, as the problem we are trying to solve is very complex (large scale, long horizon, highly cooperative, continuous, dynamic neighbourhoods, agents appearing and disappearing during the episode, potentially formulated with heterogeneous agents, ...) and the hacks to accommodate for all of this keep piling up.

Good luck with you research!

@OnTheRicky
Copy link

What is currently the best way to go about sharing states or any other information between agents with a dynamic number of agents?

I define a maximum number of active agents in the environment, but during an episode agents can finish and some time later a new agent will start. So the total number of agents continue increasing, however there will never be more activate agents than allowed.

@OnTheRicky
Copy link

I've built a custom centralised critic RNN model that receives two inputs, one to predict next action that only contains agent specific observations and one to compute value that contains all observations from all agents.

Still using rllib version 0.8.4

class CentralisedCriticModel(RecurrentTFModelV2):
    """Multi-agent model that implements a centralised value function."""

    def __init__(self,
                 obs_space,
                 action_space,
                 num_outputs,
                 model_config,
                 name,
                 hidden_size=256,
                 cell_size=64):
        super(CentralisedCriticModel, self).__init__(obs_space,
                                                     action_space,
                                                     num_outputs,
                                                     model_config,
                                                     name)
        self.cell_size = cell_size

        # Define input layers
        action_input_layer = tf.keras.layers.Input(
            shape=(None,18), name="action_inputs")
        value_input_layer = tf.keras.layers.Input(
            shape=(None,obs_space.shape[0]), name="value_inputs")
        state_in_h = tf.keras.layers.Input(shape=(cell_size, ), name="h")
        state_in_c = tf.keras.layers.Input(shape=(cell_size, ), name="c")
        seq_in = tf.keras.layers.Input(shape=(), name="seq_in", dtype=tf.int32)

        # Preprocess observation with a hidden layer and send to LSTM cell
        dense_action = tf.keras.layers.Dense(
            hidden_size, activation=tf.nn.relu, name="dense_action")(action_input_layer)
        dense_value = tf.keras.layers.Dense(
            hidden_size, activation=tf.nn.relu, name="dense_value")(value_input_layer)

        action_lstm_out, state_h, state_c = tf.keras.layers.LSTM(
            cell_size, return_sequences=True, return_state=True, name="action_lstm")(
                inputs=dense_action,
                mask=tf.sequence_mask(seq_in),
                initial_state=[state_in_h, state_in_c])

        value_lstm_out, state_h, state_c = tf.keras.layers.LSTM(
            cell_size, return_sequences=True, return_state=True, name="value_lstm")(
                inputs=dense_value,
                mask=tf.sequence_mask(seq_in),
                initial_state=[state_in_h, state_in_c])

        # Postprocess LSTM output with another hidden layer and compute values
        action_logits = tf.keras.layers.Dense(
            self.num_outputs,
            activation=tf.keras.activations.linear,
            name="action_logits")(action_lstm_out)

        values = tf.keras.layers.Dense(
            1, activation=None, name="values")(value_lstm_out)

        # Create the RNN model
        self.rnn_model = tf.keras.Model(
            inputs=[action_input_layer, value_input_layer, seq_in, state_in_h, state_in_c],
            outputs=[action_logits, values, state_h, state_c])
        self.register_variables(self.rnn_model.variables)
        self.rnn_model.summary()

I also modified forward() and forward_rnn()

    def forward(self, input_dict, state, seq_lens):
            """Adds time dimension to batch before sending inputs to forward_rnn().
            You should implement forward_rnn() in your subclass."""
            assert seq_lens is not None

            action_padded_inputs = tf.concat([input_dict["obs"]["capacity"],input_dict["obs"]["sensors"]],axis=1)#flatten(action_obs,framework="tf")
            value_padded_inputs = input_dict["obs_flat"]
            output, new_state = self.forward_rnn(
                add_time_dimension(action_padded_inputs,
                                   seq_lens=seq_lens),
                add_time_dimension(value_padded_inputs,
                                   seq_lens=seq_lens),
                state,
                seq_lens)
            return tf.reshape(output, [-1, self.num_outputs]), new_state

    @override(RecurrentTFModelV2)
    def forward_rnn(self, action_input_dict, value_input_dict, state, seq_lens):
        model_out, self._value_out, h, c = self.rnn_model([action_input_dict,
                                                           value_input_dict,
                                                           seq_lens] + state)
        return model_out, [h, c]

My complete observation space looks as follows:

self.observation_space = spaces.Dict({"sensors":spaces.Box(low=-0.1, high=1.1, shape=(len(self.sensors),),dtype=np.float64),
                                      "capacity":spaces.Box(low=-(n_agents*failure_t)+capacity,high=capacity,shape=(1,),dtype=np.float64),
                                      "other_sensors":spaces.Box(low=-0.1, high=1.1, shape=(3,len(self.sensors),),dtype=np.float64)})

In my environment at the end of the step() I then fill up other_sensors for each agent, using the observations of all other agents, or fill with zeros if there are missing agents.

It's training extremely slow and I can't tell if it's because I've essentially created two LSTM models?

I'm also not sure if I should be populating other_sensors in my observations at the end of step() or if I should be using the observation_fn API?

@janblumenkamp
Copy link
Contributor Author

This issue can hopefully be closed once #10884 is done :)

@wullli
Copy link

wullli commented Jan 7, 2021

Do you guys have a rough idea when #10884 will be finished? I am currently using @janblumenkamp 's awesome workarounds. However, I don't want to build things twice if I can help it. Thanks!

@janblumenkamp
Copy link
Contributor Author

janblumenkamp commented Jan 30, 2021

To add to this, as another working example, this is the project/repository which is the result of this thread from me.

As a working minimal example with a more recent Ray version, I have created this repository. It's a toy problem that serves as a reference implementation for the changes that are due to be done in RLlib. I talked to Sven recently and the plan is to hopefully get this done over the next few weeks :)

EDIT: Just an update regarding my minimal example, it now supports both continuous and discrete action spaces and I have cleaned up the trainer implementation quite a bit, should be much clearer now. Let me know if you have any questions.

@ericl ericl added this to the RLlib Bugs milestone Mar 11, 2021
@ericl ericl removed the rllib label Mar 11, 2021
@Rohanjames1997
Copy link

Hi @ericl @janblumenkamp. This whole thread was very helpful, thanks for the detailed explanations from both of you!

I am currently in the process of migrating a project to the RLlib framework, and I had some doubts about some of the points in your discussion.
Here's some context before I begin:

  • Just like @janblumenkamp, I too have a single policy for n agents, and I want them to share information at every time step. I'm currently using a GNN where each node is an agent, and message passing rounds occur during forward passes.
  • The number of agents (n) varies for each episode (but it's constant within an episode)

My doubts revolve around the Agent grouping mechanism

or is the grouped super-agent literally treated as one big agent with a huge observation and action space

It's the latter, it really is one big super-agent. You could potentially still do an architectural decomposition within the super agent model though (i.e., to emulate certain multi-agent architectures).

  1. If I have a single policy for all my agents, is it necessary to provide a grouping mechanism? I'm guessing it isn't required.
    1.1 In this case, would I be allowed to have a varying number of agents?
  2. For this homogenous policy case, what exactly is the benefit of subclassing the MultiAgentEnv class? Is it somehow related to the part of the blog where it says this? 👇

First, decomposing the actions and observations of a single monolithic agent into multiple simpler agents not only reduces the dimensionality of agent inputs and outputs, but also effectively increases the amount of training data generated per step of the environment.

  1. If I did have > 1 policies, would the number of agents per policy be constrained by the superagent mechanism of concatenating agents? Are there cases where number of agents per policy can be dynamic?

Thank you so much again! I can't wait to onboard to RLlib!

@janblumenkamp
Copy link
Contributor Author

Hi @Rohanjames1997!
Have a look at the discussion further down in this thread. You can't use the MultiAgentEnv and also grouping does not help if you want to run backpropagation through communication. Check out my minimal example:
https://github.com/janblumenkamp/rllib_multi_agent_demo
It involves many ugly hacks (most notably, formulating the multi-agent env as one standard gym super-observation and super-action space that contains the observations and actions for a fixed number of agents - in your case, maybe you can just mask the agents you don't need out - and also passing rewards for each agent through the info dict to the trainer). Will update it soon to Ray 1.3.0!

@Rohanjames1997
Copy link

Hi @janblumenkamp !
Thank you so much for the link! I had missed that part of the discussion. I shall probably implement something very similar.

Assuming I had no inter-agent communication, could you answer my previous questions?

And an additional one:
Since #10884 is still in progress, is it right to say that RLlib's MultiAgentEnv class currently does not support Graph neural networks? (Since GNN's involve communications by default)

Thanks again! And congratulations on the paper! It was a great read! 😄

@richardliaw richardliaw added the rllib RLlib related issues label Oct 5, 2021
@avnishn avnishn closed this as completed Apr 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical question Just a question :) rllib RLlib related issues
Projects
None yet
Development

No branches or pull requests

10 participants