Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to support multi-agent reinforcement learning #121

Open
3 of 8 tasks
youkaichao opened this issue Jul 9, 2020 · 34 comments · Fixed by #122 or #207
Open
3 of 8 tasks

How to support multi-agent reinforcement learning #121

youkaichao opened this issue Jul 9, 2020 · 34 comments · Fixed by #122 or #207
Assignees
Labels
discussion Discussion of a typical issue good first issue Good for newcomers MARL Temporary label to group all things MARL

Comments

@youkaichao
Copy link
Collaborator

  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • documentation request (i.e. "X is missing from the documentation.")
    • new feature request
  • I have visited the source website, and in particular read the known issues
  • I have searched through the issue tracker and issue categories for duplicates
  • I have mentioned version numbers, operating system and environment, where applicable:
    import tianshou, torch, sys
    print(tianshou.__version__, torch.__version__, sys.version, sys.platform)
@youkaichao
Copy link
Collaborator Author

This issue can be used to track the design of multi-agent reinforcement learning implementation.

@youkaichao
Copy link
Collaborator Author

youkaichao commented Jul 9, 2020

After some pilot study, I find there are three paradigms of multi-agent reinforcement learning:

  1. simultaneous move, at each timestep, all the agent take their actions (example: moba games)

  2. cyclic move, players take action in turn (example: Go game)

  3. conditional move, at each timestep, the environment conditionally selects an agent to take action. (example: Pig Game)

The problem is how to transform these paradigms to the following standard RL procedure:

action = policy(state)
next_state, reward = env.step(action)

For simultaneous move, the solution is simple: we can just add a num_agent dimension to state, action, reward. Nothing else is going to change.

For 2 & 3 (cyclic move & conditional move), an elegant solution is:

action = policy(state, agent_id)
next_state, next_agent_id, reward = env.step(action)

By constructing new state state_ma = {'state':state, 'agent_id':agent_id}, essentially we can go back to the standard case:

action = policy(state_ma)
next_state_ma, reward = env.step(action)

Just be careful that reward here can be rewards for all the players (the action of one player may affect all players).

Usually, the legal action set varies with state, so it is more convenient to denote state_ma = {'state':state, 'legal_actions':legal_actions, 'agent_id':agent_id}.

@duburcqa
Copy link
Collaborator

duburcqa commented Jul 9, 2020

I don't think there is a need for multi-agent reinforcement learning at short term. To me the priority is to improve the current functionality. There is major flaws in the current implementation, especially regarding the efficiency of distributed sampling, which are way more critical to handle than adding new features. The core has be to robust before building on top of it.

Yet, of course, it is still interesting to discuss the implementation of future features!

@Trinkle23897
Copy link
Collaborator

I don't think there is a need for multi-agent reinforcement learning at short term. To me the priority is to improve the current functionality. There is major flaws in the current implementation, especially regarding the efficiency of distributed sampling, which are way more critical to handle than adding new features. The core has be to robust before building on top of it.

Yet, of course, it is still interesting to discuss the implementation of future features!

This feature does not change any of the current code and is also compatible with 2d-buffer. I think it is independent with what you said.

@Trinkle23897 Trinkle23897 linked a pull request Jul 9, 2020 that will close this issue
@Trinkle23897 Trinkle23897 added this to In progress in v1.0 roadmap Jul 9, 2020
@duburcqa
Copy link
Collaborator

duburcqa commented Jul 9, 2020

I think it is independent with what you said.

Of course it is! Work force being limited, someone working on a specific feature necessarily affects / slow-down the development of the others. That's the point of prioritizing development of some features wrt to others, it is often not because of interdependencies, but rather because of limited work force.

But obviously, in the setup of an open-source project, anyone is free to work on the features they want to.

@Trinkle23897 Trinkle23897 added the enhancement Feature that is not a new algorithm or an algorithm enhancement label Jul 9, 2020
@Trinkle23897 Trinkle23897 added this to TODO in Issue/PR Categories via automation Jul 9, 2020
@Trinkle23897 Trinkle23897 moved this from TODO to Algorithms in Issue/PR Categories Jul 9, 2020
@youkaichao
Copy link
Collaborator Author

I gave an example of playing Tic Tac Toe in the test case, without modifying the core code.

What is the next step seems unclear. I do not know how people in MARL typically train their model, especially how to sample experience.Some possible ideas may be:

  1. each player only learns on his own experience
  2. each player leans on experience of all players

Which one is commonly used? Or both? Or there are other paradigms? This should be cleared by some experts in MARL area.

@youkaichao
Copy link
Collaborator Author

Issue #136 is a great discussion on multi-agent rl with simultaneous move. The conclusion is that, it can be dealt with, without modifying the core of Tianshou. Just to inherit some class and re-implement one function with several lines, depending on the specific scenario.

@Trinkle23897 Trinkle23897 pinned this issue Jul 16, 2020
@Trinkle23897 Trinkle23897 added the good first issue Good for newcomers label Jul 16, 2020
@p-veloso
Copy link

p-veloso commented Jul 16, 2020

I am not an expert in MARL and I have just discovered Tianshou. With that said, here are some thoughts based on the papers I have been reading recently. There are many workflows for MARL with different training (centralized/decentralized), execution (centralized/decentralized), type of agents (homogeneous/heterogeneous), task setting (cooperative/competitive/mixed), types of reward (individual/global), etc. Here are some common MARL types:

  1. independent policies (- -)
  2. shared individual policy (parameter sharing, see Working with agent dimension in multi-agent workflows based on single policy (parameter sharing) #136) (+ +)
  3. centralized training and decentralized execution (+ -)
  4. centralized training and execution (- +)

The problem with 1 is that it tends not to converge and might require managing different networks in training and execution (heterogeneous agents). Type 2 and Type 3 are strong trends in MARL. Type 2 can use the algorithms already implemented in Tianshou with small changes in the policy or collector. Type 3 would require implementing specific algorithms (ex: QMIX, MADDPG or COMA), like in RLlib. I think type 4 can be implemented using Tianshou with no change, but in practice it is hard to scale, as the joint action space grows exponentially in the number of agents.

Therefore, type 2 might be a good start. If there is a large interest in making Tianshou a MARL library, maybe it is worth developing 3.

@youkaichao
Copy link
Collaborator Author

youkaichao commented Jul 17, 2020

There are many workflows for MARL with different training (centralized/decentralized), execution (centralized/decentralized), type of agents (homogeneous/heterogeneous), task setting (cooperative/competitive/mixed), types of reward (individual/global), etc.

Thank you for your comment and the overall description of MARL algorithms. It seems MARL has many variants and it is difficult to support them once at all. We have to support MARL step by step.

@youkaichao
Copy link
Collaborator Author

Some updates:

  1. to support centeralized training and decenteralized execution, one can inherit the tianshou.policy.MultiAgentPolicyManager class to implement the train and evalfunction to act differently in different mode.

  2. allow agents to see the state of other agents during training: wrap the environment to return the state of other agents in info.

@Trinkle23897 Trinkle23897 added discussion Discussion of a typical issue and removed enhancement Feature that is not a new algorithm or an algorithm enhancement labels Jul 28, 2020
@Trinkle23897 Trinkle23897 linked a pull request Sep 8, 2020 that will close this issue
v1.0 roadmap automation moved this from In progress to Done Sep 8, 2020
@Trinkle23897 Trinkle23897 reopened this Sep 8, 2020
v1.0 roadmap automation moved this from Done to TODO Sep 8, 2020
@Trinkle23897 Trinkle23897 moved this from TODO to In progress in v1.0 roadmap Sep 13, 2020
@jkterry1
Copy link
Contributor

You guys might want to take a look at this: https://github.com/PettingZoo-Team/PettingZoo

@Trinkle23897 Trinkle23897 unpinned this issue Mar 7, 2021
@p-veloso
Copy link

p-veloso commented Mar 31, 2021

I noticed that the cheat sheet for rl states that tianshou supports simultaneous move (case 1):

"for simultaneous move, the solution is simple: we can just add a num_agent dimension to state, action, and reward. Nothing else is going to change".

**Question: Is it possible to use the MultiAgentPolicyManager to work with

  • varied number of agents (each simulation has a different number of agents)
  • parameter sharing (every agent shares the same policy, so the agent data can be merged in the batch for training)
  • simultaneous actions?**

Last year (#136) I had to "tweak" the policy and net classes by

  • merging num agents and batch
  • adapting the operations
  • unmerging batch and num agents
    ...but it would be ideal to use the new class for that.

@Trinkle23897
Copy link
Collaborator

varied number of agents (each simulation has a different number of agents)

Sorry about that, that is actually beyond the current scope. I haven't come up with a good design choice for this kind of requirement...

@p-veloso
Copy link

p-veloso commented Apr 1, 2021

Isn't that just the case of adding a dimension of n_agent to the actions and observations, like I did last year and adapting the operations in the policy to that? When you reset the environment, it will set a new number of agents for that episode.

@Trinkle23897
Copy link
Collaborator

hmm yep, you're right

@Trinkle23897 Trinkle23897 moved this from In progress to TODO in v1.0 roadmap Apr 5, 2021
@p-veloso
Copy link

p-veloso commented Apr 6, 2021

Assuming that I can fix the number of agents for a certain period of training, does the class MultiAgentPolicyManager support parameter sharing (single policy for multiple agents) and simultaneous actions? If not, are there other classes that can support it?

@Trinkle23897
Copy link
Collaborator

Can you pass the same reference into MAPM, i.e., MultiAgentPolicyManager([policy_1, policy_1, policy_2])? I think that would be fine to some extend.

@p-veloso
Copy link

p-veloso commented Apr 6, 2021

That might work, but I think it will train the same network with separate batches, right?

        results = {}
        for policy in self.policies:
            data = batch[f"agent_{policy.agent_id}"]
            if not data.is_empty():
                out = policy.learn(batch=data, **kwargs)
                for k, v in out.items():
                    results["agent_" + str(policy.agent_id) + "/" + k] = v
        return results

@Trinkle23897
Copy link
Collaborator

Exactly, that should be a tiny issue but I think it would be fine for agent to learn, though it is a little bit inefficient.

@p-veloso
Copy link

p-veloso commented Apr 6, 2021

Thanks for the quick reply, @Trinkle23897 .
As I mentioned before, last year I changed the code of the policies and networks to manage the extra agent dimension in the batch. The problem of that approach is that I had to change the code for the specific algorithm, so it might be tricky to compare multiple algorithms. Do you think that customizing the MultiAgentPolicyManager would be a more general solution in my case? Or would I still have to deal with specific changes in the policy and network classes?

@Trinkle23897
Copy link
Collaborator

I had to change the code for the specific algorithm

I don't quite understand. Do you mean you use different algorithms in the same MAPM? Current implementation only accepts a list of on-policy algorithms or a list of off-policy algorithms. It would be no code changes if you use something like [dqn_agent, c51_agent, c51_agent].

@p-veloso
Copy link

p-veloso commented Apr 6, 2021

No. I am using a single policy.

What I mean is that last year I changed the policy of PPO and the neural network to deal with parameter sharing and simultaneous actions. I changed the code of tianshou to manage the additional dimension in my setting.

The environment sends observations, rewards, etc. with the number of agents as the first dimension (n_agents, ?)
This results in a batch with (batch dim, n agents, ?)
The policy and networks use (batch x n_agents, ?) for updates and learning
The resulting actions are then reshaped to (batch, n_agents, ?) to be returned to the environments

These changes were made in

  1. PPO:
    compute_episodic_return: change how m is calculated
    learn: change how ratio and u are calculated

  2. NETWORK
    forward: merge batch and agent dims, do the forward pass, unmerge batch and agents dimensions

So, my point is that it would be tricky to do that for multiple algorithms. So, I am curious if I could address these changes only by customizing my own multi agent policy manager (similar to the MultiAgentPolicyManager, but with one policy and the changes mentioned above done directly in its methods).

@Trinkle23897
Copy link
Collaborator

Yeah, you can definitely do that. In MAPM the only thing need to do is to re-organize the Batch to buffer-style data (reshape or flatten to let the 1st dim be n_agent*bsz, also pay attention to done flag) and that would be the same as PPO single-agent.

@jkterry1
Copy link
Contributor

Could I offer a thought? You might want to consider using the PettingZoo parallel API (it has two). The parallel API isn't significantly different what I understand you're proposing your 1.0 one to be, and this way you're using a standard instead of a custom one. A bunch of third party libraries for PettingZoo already exist (e.g. p-veloso above has one), and RLlib and stable baselines interface with it, as do several more minor RL libraries.

@Trinkle23897
Copy link
Collaborator

Could I offer a thought? You might want to consider using the PettingZoo parallel API (it has two). The parallel API isn't significantly different what I understand you're proposing your 1.0 one to be, and this way you're using a standard instead of a custom one. A bunch of third party libraries for PettingZoo already exist (e.g. p-veloso above has one), and RLlib and stable baselines interface with it, as do several more minor RL libraries.

That's pretty cool! I'm wondering if you have any interest in integrating this standard library into tianshou.

@jkterry1
Copy link
Contributor

Sure we can do that! It'll probably take a few weeks though

@p-veloso
Copy link

Can you pass the same reference into MAPM, i.e., MultiAgentPolicyManager([policy_1, policy_1, policy_2])? I think that would be fine to some extend.

@Trinkle23897, For now, I gave up on my previous approach (manually changing the shape of the batch), because it would require tweaking some of the algorithms that rely on the order of the observations (e.g., GAE, multi-step value prediction, etc.). I am trying the approach that you cited above, but there is a problem:

  • The environment should produce all the agents' observations at once. Therefore, instead of a dict {"agent_id": ... , "obs": ... "mask": ...} I have a list with n of these dictionaries which results in a Batch with wrong format.
  • I created a preprocess_fn to correct that (see below) , but this ends up resulting in errors in other parts of the code that rely on the assumption that the length of the data is equal the number of environments or workers ... but in reality each environment produces n_agents observations per step (i.e., one for each agent).
  • While I can start customizing the classes (e.g., BaseVectorEnv) to make it work, I feel this is a structural part of the library and it might take a lot of time to solve. Any suggestion for a simpler approach that preserve simultaneous actions?
def preprocess_fn(**kwargs):
    """convert from lists to dict"""
    # if only obs exist -> reset
    # if obs_next/act/rew/done/policy exist -> normal step
    agent_idx_array = []
    obs_array = []
    mask_array = []
    if 'rew' not in kwargs:
        tag = "obs"
    else:
        tag = "obs_next"
    for env_idx in range(len(kwargs[tag])):
        for obs_dict_idx in range(len(kwargs[tag][env_idx])):
            agent_idx_array.append(kwargs[tag][env_idx][obs_dict_idx]["agent_id"])
            obs_array.append(kwargs[tag][env_idx][obs_dict_idx][tag])
            mask_array.append(kwargs[tag][env_idx][obs_dict_idx]["mask"])
    obs_batch = Batch(agent_id = np.array(agent_idx_array), obs = np.array(obs_array), mask = np.array(mask_array))
    new_batch = Batch()
    if 'rew' not in kwargs:
        new_batch.obs = obs_batch
    else:
        new_batch.obs_next = obs_batch
    return new_batch

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Jun 10, 2021

I go through the above discussion again. So if the environment produces all agent's step simultaneously and uses only one policy, there's no need to use/follow MAPM. Instead, treat this environment as a normal single-agent environment, e.g.,

  • the new obs_space would be [num_agents] + original obs_space.shape,
  • the new action_space would be [num_agents] + original action_space.shape,
  • reward should be a numpy array
  • done is still a bool
  • info should be the same

and then use normal way of collector/buffer (a transition in buffer stores several agent's obs/act/rew/... in this timestep), one thing you need to do is to customize your network to squash the second dimension (num_agent) into the first dimension (batch size). Since this reward is an array instead of scalar, you should pass reward_metric into trainer (and I'm not sure if any of the existing code should be tuned, like GAE, but nstep supports this feature, see #266 and 3695f12).

Please let me know if there's anything unclear (or maybe I misunderstood some parts lol)

@p-veloso
Copy link

p-veloso commented Jun 10, 2021

one thing you need to do is to customize your network to squash the second dimension (num_agent) into the first dimension (batch size)

Yes. In a previous version of tianshou I tried to fix that with the following changes:

NETWORK
forward: merge batch and agent dims, do the forward pass, unmerge batch and agents dimensions

    def forward(self, s, state=None, info={}):
        shape = s.shape
        s = to_torch(s, device=self.device, dtype=torch.float)
        edit = len(s.shape) == 5
        if edit:
            s = s.view(shape[0] * shape[1], shape[2], shape[3], shape[4]).cuda()
        logits = self.model(s)
        if edit:
            logits = logits.view(shape[0], shape[1], -1)
        return logits, state

But that is not enough, because the single agent algorithms also assume that shape, so I have to change the batch or change them directly, such as

PPO:

  • compute_episodic_return: change how m is calculated
        ## ORIGINAL
        m = (1. - batch.done) * gamma #

        ## ADAPTATION I was using a single done ...so in these 2 lines, I made sure done has the same shape as rew
        d = np.repeat(batch.done, batch.rew.shape[1]).reshape(batch.rew.shape) 
        m = (1. - d) * gamma
  • learn: change how ratio and u are calculated
                ## ORIGINAL
                ratio = (dist.log_prob(b.act) - b.logp_old).exp().float()

                # ADAPTATION merge batch and agents
                s = b.adv.shape
                merged_shape = s[0] * s[1]
                b.adv, b.logp_old, b.returns, b.v = \
                    b.adv.reshape(merged_shape), \
                    b.logp_old.reshape(merged_shape), \
                    b.returns.reshape(merged_shape), \
                    b.v.reshape(merged_shape)
                value = value.reshape(s[0] * s[1])
                d_log_prob = dist.log_prob(b.act)
                d_log_prob = d_log_prob.reshape(s[0] * s[1])
                ratio = (d_log_prob - b.logp_old).exp().float()
                
                ...
                

                #AT THE END unmerge batch and agents
                b.adv, b.logp_old, b.returns, b.v = \
                    b.adv.reshape(s[0], s[1], s[2]), \
                    b.logp_old.reshape(s[0], s[1]),\
                    b.returns.reshape(s[0], s[1], s[2]), \
                    b.v.reshape(s[0], s[1], s[2])

As I mentioned in another post, the problems with this approach are:

  • I would have to change every algorithm
  • When I merge the first dimensions of the batch, I might have problems with parts of the algorithm that assume that the merged dimensions is a sequence collected over time and not n sequences that are "flattened".

That is why I was looking for a more general approach outside of the policies. I tried your idea of repeating the same policy in the policy manager for each agent, which is nice because each policy will only deal with a batch of one of the agents separately. However, that

  • required a custom preprocess_fn to convert the multi-agent observation to the right format (per agent_id).
  • resulted in errors because the increment in the size of the resulting batches breaks many of the assertions, such as

in BaseVectorEnv:
assert len(self.data) == len(ready_env_ids)
in Collector:
assert len(action) == len(id)

@Trinkle23897
Copy link
Collaborator

Trinkle23897 commented Jun 11, 2021

  • compute_episodic_return: change how m is calculated
  • learn: change how ratio and u are calculated
  • I would have to change every algorithm

Yep, that's what I was doing previously. But I don't think there's a free lunch that can use single-agent codebase to support multi-agent with little modifying and no performance decrease (for wall-time). But ... wait

in BaseVectorEnv:
assert len(self.data) == len(ready_env_ids)
in Collector:
assert len(action) == len(id)

Yeah I thought about this approach last night. Let's say if you have 4 envs and each env needs 5 agent, so that you need a VectorReplayBuffer(buffer_num=20) (num_agents x num_env). However, currently the high-level module such as collector doesn't know the exact format of this given batch (e.g., how to split a batch with batch_size==4 into 20 buffer slots).

So this comes to one natural way: construct another vector env that inherit existing BaseVectorEnv, where:

  • __len__ should return num_agent x num_env
  • step and reset should unpack the corresponding result ([num_env, num_agent, ...] into [env1_agent1, env1_agent2, ..., env1_agentn, ..., envn_agentn]) (and also pack actions, the input action length is num_env x num_agent, it should resize to [num_env, num_agent, ...] and send to each environment with [num_agent, ...])
  • info["env_id"] should change accordingly (from num_env to num_agent x num_env)

therefore you can execute only num_env envs but get num_env x num_agent results at each step without modifying the agent's code and without using MAPM.

@p-veloso
Copy link

@Trinkle23897

I spent this last day trying the different approaches. I have just solved the "policy approach" for DQN in the latest version of the Tianshou, but I would definitely prefer a higher-level modification that can work with all the original policies. I think your suggestion is similar to what supersuit does ... but I had no idea how to do that in Tianshou. Thanks for the suggestion.

Just for clarification, according to your current idea, would I still need to change other parts, such as the forward pass of the neural network?

@Trinkle23897
Copy link
Collaborator

Just for clarification, according to your current idea, would I still need to change other parts, such as the forward pass of the neural network?

None of them I think.

@p-veloso
Copy link

It works! Thanks again.

@benblack769
Copy link

@p-veloso I just saw this, but while the supersuit example here: https://github.com/PettingZoo-Team/SuperSuit#parallel-environment-vectorization is for stable baselines, all it does is translate the parallel environment into a vector environment. Since tianshou support vector environments out of box for all algorithms, you should just be able to use supersuit's environment vectorization, rather than your own custom code. If there is some reason it doesn't work out of box, feel free to raise an issue with supersuit asking for support for tianshou.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion of a typical issue good first issue Good for newcomers MARL Temporary label to group all things MARL
Projects
No open projects
7 participants