[rllib] Grouping of agents during parameter sharing (same policy) #7422

lennardsnoeks · 2020-03-03T18:01:47Z

What is your question?

My goal is to create a crowd simulation with multiple agents, where each one the agents is driven by the same policy (because the agents are homogeneous). This can be achieved with variable sharing, which is easy with the multi agent framework in the following way.

single_env = SingleAgentEnv(env_config)
obs_space = single_env.get_observation_space()
action_space = single_env.get_action_space()
config["multiagent"] = {
    "policies": {
        "policy_0": (None, obs_space, action_space, {"gamma": 0.95})
    },
    "policy_mapping_fn": lambda agent_id: "policy_0"
}

However, the agents need to know the positions of the other agents. Each agent acts through a SingleAgentEnvironment (single agents simulations work fine), which updates the position of the agent, and the positions of other agents are distributed to these agents through the MultiAgentEnvironment (don't really know if this is a correct approach, see code for current MultiAgentEnvironment).

class MultiAgentEnvironment(MultiAgentEnv):

    def __init__(self, env_config):
        self.sim_state: SimulationState = env_config["sim_state"]
        self.env_config = env_config

        self.load_agents()

    def load_agents(self):
        self.original_shared_state = copy.deepcopy(self.shared_state)

        self.agents = []
        for i, agent in self.shared_state.agents:
            self.env_config["agent_id"] = i
            self.agents.append(SingleAgentEnv(self.env_config))

    def step(self, action_dict):
        obs, rew, done, info = {}, {}, {}, {}

        for i, action in action_dict.items():
            obs[i], rew[i], done[i], info[i] = self.agents[i].step(action)
            if done[i]:
                self.dones.add(i)

        done["__all__"] = len(self.dones) > 0

        return obs, rew, done, info

    def reset(self):
        self.resetted = True
        self.dones = set()
        self.shared_state = copy.deepcopy(self.original_shared_state)

        for agent in self.agents:
            agent.load_params(self.sim_state)

        return {i: a.reset() for i, a in enumerate(self.agents)}

The problem is that these agents need to act simultaneously, so would grouping of the agents be a good idea here? (Similar to #7341) Also how would a grouping like this be applied, and would it alter the observation/action space of the policy (which is currently the same as the single agent environment)?

I plan to use curriculum learning to gradually increase the number of agents, so I'm not sure if this approach with interfere with that.

The text was updated successfully, but these errors were encountered:

ericl · 2020-03-03T21:24:41Z

positions of other agents are distributed to these agents through the MultiAgentEnvironment

I didn't see the code for this in your example above, do you mean to append some shared observation to each of the individual agent observations? (i.e., obs space for each independent agent becomes Tuple([own_obs, other_agent_positions]).

need to act simultaneously

This should already be the case for the above example. The only case where you'd want grouping is if you wanted to implement a policy with centralized execution (or centralized training, decentralized execution).

lennardsnoeks · 2020-03-03T22:29:30Z

Yea I had to clarify it better but the shared_state variable contains all the agents positions (and obstacles) and is passed to all the single agent environments by reference. This is how the agent knows the positions of the others. I don't know if this is the best option but I didn't really know how else to do it. When the environment is reset, the original shared_state is redistributed across the agents.

So in my case I should be fine with this setup and training it with the DDPG trainer?

Also, I noticed that using the multi agent environment with one agent vs just one single agent environment (with the rest of the config being the same and using the same environment setup, obstacles and such) the mean episode reward of the multi environment setup remained very low (in the far negatives) for almost the entirety of the training iterations, while it converged quickly for the single agent environment. Is this expected behavior? I can provide some tensorboard graphs in a bit.

ericl · 2020-03-03T23:19:52Z

Yeah the setup looks good then.

mean episode reward

So the episode reward will be the sum of all agent rewards within the env. To see the per-agent rewards, you can look at "policy_reward_mean". Not sure if this accounts for what you are seeing.

lennardsnoeks · 2020-03-04T18:20:26Z

The policy_reward_mean would be the same as the episode_reward_mean because the starting point of the multi agent environment only uses one agent.

If I use a multi agent environment for one agent (again as the starting point because I will apply curriculum learning), the results are not as good as when I used the single agent environment while all the config/parameter settings are the same (as seen below). Shouldn't behavior be more or less the same in these two cases? Or am I overseeing something?

ericl · 2020-03-04T20:27:46Z

That's odd. Can you try printing the rewards from the env directly and doing the average manually, to see if it differs from what rllib is reporting?

…

On Wed, Mar 4, 2020, 10:20 AM Lennard Snoeks ***@***.***> wrote: The policy_reward_mean would be the same as the episode_reward_mean because the starting point of the multi agent environment only uses one agent. If I use a multi agent environment for one agent (again as the starting point because I will apply curriculum learning), the results are not as good as when I used the single agent environment while all the config/parameter settings are the same (as seen below). Shouldn't behavior be more or less the same in these two cases? Or am I overseeing something? [image: tensorboard] <https://user-images.githubusercontent.com/17469736/75908370-4a939c00-5e4a-11ea-9dbc-e9439e052144.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7422?email_source=notifications&email_token=AAADUSVHFNNQV3S5NXVWLR3RF2LWVA5CNFSM4LAP4GF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENZMPGI#issuecomment-594724761>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSWWMBVWINNAR6EUQ5LRF2LWVANCNFSM4LAP4GFQ> .

lennardsnoeks · 2020-03-05T19:22:06Z

Does the episode_reward_mean only look at episodes completed in the current iteration, or all the previous episodes in other iterations as well? Because if I only take episodes of the current iteration, the mean values are not the same. However, there is no difference between the rewards in the single agent environment and the multi agent one so I don't think the problem lies there (by this I mean that the multi agent uses the single agent environment and in both step functions rewards were the same when using one agent in the multi agent setup).

Sometimes one episode finishes only after 30k steps, and it is the only time in the learning process an episode is finished (50k steps), as seen here.

In the single agent environment, it takes max. 2 iterations of 1000 steps to finish the first episode. While the multi agent struggles to finish an episode an improve afterwards. Could it be due to exploration not being applied in the same way through the multi agent env?

Another thing I noticed is that, when the number of workers is increased, convergence also appears later in the learning process, as seen below (in the single agent environment with 0, 3, 5 and 7 workers)

Is this expected behavior when introducing extra workers?

ericl · 2020-03-05T19:31:27Z

For rewards, we smooth them over the last 100 (metric_smoothing_window config) episodes. When adding more workers depending on the batch size yeah it could hurt sample efficiency. With super long episodes especially this could be the case since rewards are delayed and adding workers doesn't help with this.

…

On Thu, Mar 5, 2020, 11:22 AM Lennard Snoeks ***@***.***> wrote: Does the episode_reward_mean only look at episodes completed in the current iteration, or all the previous episodes in other iterations as well? Because if I only take episodes of the current iteration, the mean values are not the same. However, there is no difference between the rewards in the single agent environment and the multi agent one so I don't think the problem lies there. Sometimes one episode finishes only after 30k steps, and it is the only time in the learning process an episode is finished (50k steps), as seen here. [image: jfkjf] <https://user-images.githubusercontent.com/17469736/76013747-bd6b4880-5f18-11ea-98f7-afdaf74c09b1.png> In the single agent environment, it takes max. 2 iterations of 1000 steps to finish the first episode. While the multi agent struggles to finish an episode an improve afterwards. Could it be due to exploration not being applied in the same way through the multi agent env? Another thing I noticed is that, when the number of workers is increased, convergence also appears later in the learning process, as seen below (in the single agent environment with 0, 3, 5 and 7 workers) [image: compare_workers] <https://user-images.githubusercontent.com/17469736/76009799-171c4480-5f12-11ea-8fea-e316f018ce16.png> Is this expected behavior when introducing extra workers? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7422?email_source=notifications&email_token=AAADUSQVKBMFA7MDTT55673RF73V5A5CNFSM4LAP4GF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN6RXMI#issuecomment-595401649>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSWICJ4QBJS5EAOTJF3RF73V5ANCNFSM4LAP4GFQ> .

lennardsnoeks · 2020-03-05T23:41:04Z

Any idea about the difference in single vs multi environment though? Could it be related to exploration?

lennardsnoeks added the question Just a question :) label Mar 3, 2020

richardliaw added the rllib label Mar 5, 2020

lennardsnoeks mentioned this issue Mar 10, 2020

[rllib] Use of multi agent environment degrades performance / obtained reward #7540

Closed

lennardsnoeks closed this as completed Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib] Grouping of agents during parameter sharing (same policy) #7422

[rllib] Grouping of agents during parameter sharing (same policy) #7422

lennardsnoeks commented Mar 3, 2020

ericl commented Mar 3, 2020

lennardsnoeks commented Mar 3, 2020

ericl commented Mar 3, 2020

lennardsnoeks commented Mar 4, 2020

ericl commented Mar 4, 2020 via email

lennardsnoeks commented Mar 5, 2020 •

edited

Loading

ericl commented Mar 5, 2020 via email

lennardsnoeks commented Mar 5, 2020

[rllib] Grouping of agents during parameter sharing (same policy) #7422

[rllib] Grouping of agents during parameter sharing (same policy) #7422

Comments

lennardsnoeks commented Mar 3, 2020

What is your question?

ericl commented Mar 3, 2020

lennardsnoeks commented Mar 3, 2020

ericl commented Mar 3, 2020

lennardsnoeks commented Mar 4, 2020

ericl commented Mar 4, 2020 via email

lennardsnoeks commented Mar 5, 2020 • edited Loading

ericl commented Mar 5, 2020 via email

lennardsnoeks commented Mar 5, 2020

lennardsnoeks commented Mar 5, 2020 •

edited

Loading