-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Grouping of agents during parameter sharing (same policy) #7422
Comments
I didn't see the code for this in your example above, do you mean to append some shared observation to each of the individual agent observations? (i.e., obs space for each independent agent becomes Tuple([own_obs, other_agent_positions]).
This should already be the case for the above example. The only case where you'd want grouping is if you wanted to implement a policy with centralized execution (or centralized training, decentralized execution). |
Yea I had to clarify it better but the shared_state variable contains all the agents positions (and obstacles) and is passed to all the single agent environments by reference. This is how the agent knows the positions of the others. I don't know if this is the best option but I didn't really know how else to do it. When the environment is reset, the original shared_state is redistributed across the agents. So in my case I should be fine with this setup and training it with the DDPG trainer? Also, I noticed that using the multi agent environment with one agent vs just one single agent environment (with the rest of the config being the same and using the same environment setup, obstacles and such) the mean episode reward of the multi environment setup remained very low (in the far negatives) for almost the entirety of the training iterations, while it converged quickly for the single agent environment. Is this expected behavior? I can provide some tensorboard graphs in a bit. |
Yeah the setup looks good then.
So the episode reward will be the sum of all agent rewards within the env. To see the per-agent rewards, you can look at "policy_reward_mean". Not sure if this accounts for what you are seeing. |
The policy_reward_mean would be the same as the episode_reward_mean because the starting point of the multi agent environment only uses one agent. If I use a multi agent environment for one agent (again as the starting point because I will apply curriculum learning), the results are not as good as when I used the single agent environment while all the config/parameter settings are the same (as seen below). Shouldn't behavior be more or less the same in these two cases? Or am I overseeing something? |
That's odd. Can you try printing the rewards from the env directly and
doing the average manually, to see if it differs from what rllib is
reporting?
…On Wed, Mar 4, 2020, 10:20 AM Lennard Snoeks ***@***.***> wrote:
The policy_reward_mean would be the same as the episode_reward_mean
because the starting point of the multi agent environment only uses one
agent.
If I use a multi agent environment for one agent (again as the starting
point because I will apply curriculum learning), the results are not as
good as when I used the single agent environment while all the
config/parameter settings are the same (as seen below). Shouldn't behavior
be more or less the same in these two cases? Or am I overseeing something?
[image: tensorboard]
<https://user-images.githubusercontent.com/17469736/75908370-4a939c00-5e4a-11ea-9dbc-e9439e052144.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7422?email_source=notifications&email_token=AAADUSVHFNNQV3S5NXVWLR3RF2LWVA5CNFSM4LAP4GF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENZMPGI#issuecomment-594724761>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSWWMBVWINNAR6EUQ5LRF2LWVANCNFSM4LAP4GFQ>
.
|
For rewards, we smooth them over the last 100 (metric_smoothing_window
config) episodes.
When adding more workers depending on the batch size yeah it could hurt
sample efficiency. With super long episodes especially this could be the
case since rewards are delayed and adding workers doesn't help with this.
…On Thu, Mar 5, 2020, 11:22 AM Lennard Snoeks ***@***.***> wrote:
Does the episode_reward_mean only look at episodes completed in the
current iteration, or all the previous episodes in other iterations as
well? Because if I only take episodes of the current iteration, the mean
values are not the same. However, there is no difference between the
rewards in the single agent environment and the multi agent one so I don't
think the problem lies there.
Sometimes one episode finishes only after 30k steps, and it is the only
time in the learning process an episode is finished (50k steps), as seen
here.
[image: jfkjf]
<https://user-images.githubusercontent.com/17469736/76013747-bd6b4880-5f18-11ea-98f7-afdaf74c09b1.png>
In the single agent environment, it takes max. 2 iterations of 1000 steps
to finish the first episode. While the multi agent struggles to finish an
episode an improve afterwards. Could it be due to exploration not being
applied in the same way through the multi agent env?
Another thing I noticed is that, when the number of workers is increased,
convergence also appears later in the learning process, as seen below (in
the single agent environment with 0, 3, 5 and 7 workers)
[image: compare_workers]
<https://user-images.githubusercontent.com/17469736/76009799-171c4480-5f12-11ea-8fea-e316f018ce16.png>
Is this expected behavior when introducing extra workers?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7422?email_source=notifications&email_token=AAADUSQVKBMFA7MDTT55673RF73V5A5CNFSM4LAP4GF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN6RXMI#issuecomment-595401649>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSWICJ4QBJS5EAOTJF3RF73V5ANCNFSM4LAP4GFQ>
.
|
Any idea about the difference in single vs multi environment though? Could it be related to exploration? |
What is your question?
My goal is to create a crowd simulation with multiple agents, where each one the agents is driven by the same policy (because the agents are homogeneous). This can be achieved with variable sharing, which is easy with the multi agent framework in the following way.
However, the agents need to know the positions of the other agents. Each agent acts through a SingleAgentEnvironment (single agents simulations work fine), which updates the position of the agent, and the positions of other agents are distributed to these agents through the MultiAgentEnvironment (don't really know if this is a correct approach, see code for current MultiAgentEnvironment).
The problem is that these agents need to act simultaneously, so would grouping of the agents be a good idea here? (Similar to #7341) Also how would a grouping like this be applied, and would it alter the observation/action space of the policy (which is currently the same as the single agent environment)?
I plan to use curriculum learning to gradually increase the number of agents, so I'm not sure if this approach with interfere with that.
The text was updated successfully, but these errors were encountered: