-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Rewrite PPO to use training_iteration + enable DD-PPO for Win32. #23673
Conversation
what's the relationship between this and #23686 ?? |
Doh, I did not see this open PR. This implements identical capabilities to #23686. @sven1977 it's probably worth double checking against mine to see if there are any important differences. I think the main difference is this PR still warns about bad weight/clipping scales like the original verison. And looking at his, I think I actually train on |
Oh no! :D Sorry @smorad , I didn't see your PR before I did something similar to this yesterday. |
rllib/agents/ppo/ppo.py
Outdated
"vf_share_layers.".format(policy_id, scaled_vf_loss, policy_loss) | ||
) | ||
# Warn about bad clipping configs | ||
mean_reward = rollouts["rewards"][rollouts["agent_index"] == i].mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea just doing this here and in this fashion! True, we don't really need entire episodes for this avg-reward estimate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we log 1 in N times or something? won't this flood the training logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @smorad fixed this by a log_once
if-block, which seems like always the best option for these warnings.
@@ -337,6 +337,35 @@ def __call__(self, samples: SampleBatchType) -> SampleBatchType: | |||
return samples | |||
|
|||
|
|||
def standardize_fields(samples: SampleBatchType, fields: List[str]) -> SampleBatchType: | |||
"""Standardize fields of the given SampleBatch""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, could we just simplify this even further and directly add the following code to the training_iteration()
method:
# Standardize `advantages` values in train_batch.
for policy_id, batch in train_batch.policy_batches.items():
if Postprocessing.ADVANTAGES not in batch:
raise KeyError(
f"`{Postprocessing.ADVANTAGES}` not found in SampleBatch for "
f"policy `{policy_id}`! Maybe this policy fails to add "
f"`{Postprocessing.ADVANTAGES}` in its `postprocess_trajectory` "
f"method? Or this policy is not meant to learn at all and you "
"forgot to remove it from the list under `config."
"multiagent.policies_to_train`."
)
batch[Postprocessing.ADVANTAGES] = standardized(
batch[Postprocessing.ADVANTAGES])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require two identical for loops, which are quite ugly IMO. This is because we need to train AFTER normalizing, not before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fair enough. Fine with me :)
rllib/agents/ppo/ppo_torch_policy.py
Outdated
@@ -152,6 +152,7 @@ def reduce_mean_valid(t): | |||
mean_vf_loss = reduce_mean_valid(vf_loss_clipped) | |||
# Ignore the value function. | |||
else: | |||
value_fn_out = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great fix! I guess this is already merged from the other PR.
rllib/agents/ppo/ppo.py
Outdated
rollouts = synchronous_parallel_sample(self.workers) | ||
|
||
# Concatenate the SampleBatches from each worker into one large SampleBatch | ||
rollouts = SampleBatch.concat_samples(rollouts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need some train_batch_size
check here (while loop just like in rllib/agents/pg/pg.py).
Otherwise, the train batch size might be too small after just a single round of parallel RolloutWorker.sample()
collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it into a util to collect exactly train_batch # of samples from available workers in truncated_episode mode, and at least train_batch # of samples in complete_episode mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really great. Just a few nits.
I've implemented nearly all suggestions and the code trains. I have not written unit tests and this is quite an important piece of code to be tested IMO. Unfortunately, I'm leaving on holiday very soon! We can either merge now and I will write tests when I return, or we can wait until I return to add tests to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I vote we merge this as long as you promise to write the tests after you come back :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for this PR @smorad !
Fixed LINTer, just waiting for tests to pass. |
Fixed an older bug in multi_gpu_train_one_step (sgd_num_iter instead of num_sgd_iter :( ). This was causing PPO not to learn. It's fixed now and I ran several tests on Pendulum and CartPole. I'm not too concerned, this broke other benchmarks, as PPO is fairly sequential in its execution pattern. Will merge as soon as all tests pass again ... Also, switched training_iteration to True by default. |
env.set_task(new_task) | ||
|
||
fn = functools.partial(fn, task_fn=self.config["env_task_fn"]) | ||
self.workers.foreach_env_with_context(fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this time out and create errors that we don't recover from properly due to this being outside of step_attempt()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the max_****_steps change, we should totally do this in other training iteration functions. Other than that I have some minor questions, nothing pressing :)
Sorry I'm late to the party 😄
Why are these changes needed?
I'm trying to benchmark pytorch PPO performance, but this is difficult with the
execution_plan
api and associated DataFlow programming paradigm. This PR enables the use of the new experimental imperativetraining_iteration
API. This also significantly aids in readability, and appears to provide a marginal (~7%) speedup.Orange is before fix and pink is after fix, using config
Another benchmark on Atari (Pong), using 15 workers: This indicates that w/o the changes, PPO has a much harder time to learn, but I'll need to confirm this more properly (more seeds):
Config:
Results:
Checks
scripts/format.sh
to lint the changes in this PR.