Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] Grouping of agents during parameter sharing (same policy) #7422

Closed
lennardsnoeks opened this issue Mar 3, 2020 · 8 comments
Closed
Labels
question Just a question :)

Comments

@lennardsnoeks
Copy link

What is your question?

My goal is to create a crowd simulation with multiple agents, where each one the agents is driven by the same policy (because the agents are homogeneous). This can be achieved with variable sharing, which is easy with the multi agent framework in the following way.

single_env = SingleAgentEnv(env_config)
obs_space = single_env.get_observation_space()
action_space = single_env.get_action_space()
config["multiagent"] = {
    "policies": {
        "policy_0": (None, obs_space, action_space, {"gamma": 0.95})
    },
    "policy_mapping_fn": lambda agent_id: "policy_0"
}

However, the agents need to know the positions of the other agents. Each agent acts through a SingleAgentEnvironment (single agents simulations work fine), which updates the position of the agent, and the positions of other agents are distributed to these agents through the MultiAgentEnvironment (don't really know if this is a correct approach, see code for current MultiAgentEnvironment).

class MultiAgentEnvironment(MultiAgentEnv):

    def __init__(self, env_config):
        self.sim_state: SimulationState = env_config["sim_state"]
        self.env_config = env_config

        self.load_agents()

    def load_agents(self):
        self.original_shared_state = copy.deepcopy(self.shared_state)

        self.agents = []
        for i, agent in self.shared_state.agents:
            self.env_config["agent_id"] = i
            self.agents.append(SingleAgentEnv(self.env_config))

    def step(self, action_dict):
        obs, rew, done, info = {}, {}, {}, {}

        for i, action in action_dict.items():
            obs[i], rew[i], done[i], info[i] = self.agents[i].step(action)
            if done[i]:
                self.dones.add(i)

        done["__all__"] = len(self.dones) > 0

        return obs, rew, done, info

    def reset(self):
        self.resetted = True
        self.dones = set()
        self.shared_state = copy.deepcopy(self.original_shared_state)

        for agent in self.agents:
            agent.load_params(self.sim_state)

        return {i: a.reset() for i, a in enumerate(self.agents)}

The problem is that these agents need to act simultaneously, so would grouping of the agents be a good idea here? (Similar to #7341) Also how would a grouping like this be applied, and would it alter the observation/action space of the policy (which is currently the same as the single agent environment)?

I plan to use curriculum learning to gradually increase the number of agents, so I'm not sure if this approach with interfere with that.

@lennardsnoeks lennardsnoeks added the question Just a question :) label Mar 3, 2020
@ericl
Copy link
Contributor

ericl commented Mar 3, 2020

positions of other agents are distributed to these agents through the MultiAgentEnvironment

I didn't see the code for this in your example above, do you mean to append some shared observation to each of the individual agent observations? (i.e., obs space for each independent agent becomes Tuple([own_obs, other_agent_positions]).

need to act simultaneously

This should already be the case for the above example. The only case where you'd want grouping is if you wanted to implement a policy with centralized execution (or centralized training, decentralized execution).

@lennardsnoeks
Copy link
Author

Yea I had to clarify it better but the shared_state variable contains all the agents positions (and obstacles) and is passed to all the single agent environments by reference. This is how the agent knows the positions of the others. I don't know if this is the best option but I didn't really know how else to do it. When the environment is reset, the original shared_state is redistributed across the agents.

So in my case I should be fine with this setup and training it with the DDPG trainer?

Also, I noticed that using the multi agent environment with one agent vs just one single agent environment (with the rest of the config being the same and using the same environment setup, obstacles and such) the mean episode reward of the multi environment setup remained very low (in the far negatives) for almost the entirety of the training iterations, while it converged quickly for the single agent environment. Is this expected behavior? I can provide some tensorboard graphs in a bit.

@ericl
Copy link
Contributor

ericl commented Mar 3, 2020

Yeah the setup looks good then.

mean episode reward

So the episode reward will be the sum of all agent rewards within the env. To see the per-agent rewards, you can look at "policy_reward_mean". Not sure if this accounts for what you are seeing.

@lennardsnoeks
Copy link
Author

The policy_reward_mean would be the same as the episode_reward_mean because the starting point of the multi agent environment only uses one agent.

If I use a multi agent environment for one agent (again as the starting point because I will apply curriculum learning), the results are not as good as when I used the single agent environment while all the config/parameter settings are the same (as seen below). Shouldn't behavior be more or less the same in these two cases? Or am I overseeing something?

tensorboard

@ericl
Copy link
Contributor

ericl commented Mar 4, 2020 via email

@lennardsnoeks
Copy link
Author

lennardsnoeks commented Mar 5, 2020

Does the episode_reward_mean only look at episodes completed in the current iteration, or all the previous episodes in other iterations as well? Because if I only take episodes of the current iteration, the mean values are not the same. However, there is no difference between the rewards in the single agent environment and the multi agent one so I don't think the problem lies there (by this I mean that the multi agent uses the single agent environment and in both step functions rewards were the same when using one agent in the multi agent setup).

Sometimes one episode finishes only after 30k steps, and it is the only time in the learning process an episode is finished (50k steps), as seen here.
jfkjf
In the single agent environment, it takes max. 2 iterations of 1000 steps to finish the first episode. While the multi agent struggles to finish an episode an improve afterwards. Could it be due to exploration not being applied in the same way through the multi agent env?

Another thing I noticed is that, when the number of workers is increased, convergence also appears later in the learning process, as seen below (in the single agent environment with 0, 3, 5 and 7 workers)
compare_workers
Is this expected behavior when introducing extra workers?

@ericl
Copy link
Contributor

ericl commented Mar 5, 2020 via email

@lennardsnoeks
Copy link
Author

Any idea about the difference in single vs multi environment though? Could it be related to exploration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Just a question :)
Projects
None yet
Development

No branches or pull requests

3 participants