[RLlib] Rewrite PPO to use training_iteration + enable DD-PPO for Win32. #23673

smorad · 2022-04-02T20:41:30Z

Why are these changes needed?

I'm trying to benchmark pytorch PPO performance, but this is difficult with the execution_plan api and associated DataFlow programming paradigm. This PR enables the use of the new experimental imperative training_iteration API. This also significantly aids in readability, and appears to provide a marginal (~7%) speedup.

Orange is before fix and pink is after fix, using config

cartpole-ppo:
    env: CartPole-v0
    run: PPO
    stop:
        timesteps_total: 300000
    config:
        min_time_s_per_reporting: 5
        _disable_execution_plan_api: true,
        # Works for both torch and tf.
        framework: torch
        gamma: 0.99
        lr: 0.0003
        num_workers: 0
        observation_filter: MeanStdFilter
        num_sgd_iter: 6
        vf_loss_coeff: 0.01
        model:
            fcnet_hiddens: [32]
            fcnet_activation: linear
            vf_share_layers: true

Another benchmark on Atari (Pong), using 15 workers: This indicates that w/o the changes, PPO has a much harder time to learn, but I'll need to confirm this more properly (more seeds):

Config:

# On a single GPU, this achieves maximum reward in ~15-20 minutes.
#
# $ python train.py -f tuned_configs/pong-ppo.yaml
#
pong-ppo:
    env: PongNoFrameskip-v4
    stop:
        episode_reward_mean: 19.0
    run: PPO
    config:
        # Works for both torch and tf.
        framework: tf
        lambda: 0.95
        kl_coeff: 0.5
        clip_rewards: True
        clip_param: 0.1
        vf_clip_param: 10.0
        entropy_coeff: 0.01
        train_batch_size: 5000
        rollout_fragment_length: 20
        sgd_minibatch_size: 500
        num_sgd_iter: 10
        num_workers: 15
        num_envs_per_worker: 5
        batch_mode: truncate_episodes
        observation_filter: NoFilter
        num_gpus: 1
        model:
            dim: 42
            vf_share_layers: true

        _disable_execution_plan_api:
            grid_search: [true, false]

Results:

== Status ==
Current time: 2022-04-08 04:36:59 (running for 00:22:00.63)
Memory usage on this node: 27.1/239.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 32.0/32 CPUs, 2.0/4 GPUs, 0.0/147.65 GiB heap, 0.0/67.27 GiB objects
Result logdir: /home/ray/ray_results/pong-ppo
Number of trials: 2/2 (2 RUNNING)
+------------------------------------+----------+-------------------+-------------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
| Trial name                         | status   | loc               | _disable_execution_plan_api   |   iter |   total time (s) |      ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|------------------------------------+----------+-------------------+-------------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------|
| PPO_PongNoFrameskip-v4_203c6_00000 | RUNNING  | 10.0.67.156:17005 | True                          |    327 |          1280.29 | 3237300 |    19.02 |                   21 |                   12 |            7343.62 |
| PPO_PongNoFrameskip-v4_203c6_00001 | RUNNING  | 10.0.67.156:17006 | False                         |    638 |          2522.73 | 6316200 |    19.15 |                   21 |                   -3 |            8530.15 |
+------------------------------------+----------+-------------------+-------------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gjoliver · 2022-04-04T20:28:19Z

what's the relationship between this and #23686 ??
which one should we reivew?

smorad · 2022-04-04T20:34:52Z

what's the relationship between this and #23686 ?? which one should we reivew?

Doh, I did not see this open PR. This implements identical capabilities to #23686. @sven1977 it's probably worth double checking against mine to see if there are any important differences. I think the main difference is this PR still warns about bad weight/clipping scales like the original verison. And looking at his, I think I actually train on rollout_fragment_length rather than train_batch_size :P

sven1977 · 2022-04-05T07:11:30Z

Oh no! :D Sorry @smorad , I didn't see your PR before I did something similar to this yesterday.
But no problem. Yours actually looks like it solved some of the problems my PR still had (e.g. PPO reward range checking).
Let me take a look ...

sven1977 · 2022-04-05T07:14:50Z

rllib/agents/ppo/ppo.py

+                    "vf_share_layers.".format(policy_id, scaled_vf_loss, policy_loss)
+                )
+            # Warn about bad clipping configs
+            mean_reward = rollouts["rewards"][rollouts["agent_index"] == i].mean()


Great idea just doing this here and in this fashion! True, we don't really need entire episodes for this avg-reward estimate.

should we log 1 in N times or something? won't this flood the training logs?

I think @smorad fixed this by a log_once if-block, which seems like always the best option for these warnings.

sven1977 · 2022-04-05T07:19:30Z

rllib/execution/rollout_ops.py

@@ -337,6 +337,35 @@ def __call__(self, samples: SampleBatchType) -> SampleBatchType:
        return samples


+def standardize_fields(samples: SampleBatchType, fields: List[str]) -> SampleBatchType:
+    """Standardize fields of the given SampleBatch"""


Actually, could we just simplify this even further and directly add the following code to the training_iteration() method:

# Standardize `advantages` values in train_batch. for policy_id, batch in train_batch.policy_batches.items(): if Postprocessing.ADVANTAGES not in batch: raise KeyError( f"`{Postprocessing.ADVANTAGES}` not found in SampleBatch for " f"policy `{policy_id}`! Maybe this policy fails to add " f"`{Postprocessing.ADVANTAGES}` in its `postprocess_trajectory` " f"method? Or this policy is not meant to learn at all and you " "forgot to remove it from the list under `config." "multiagent.policies_to_train`." ) batch[Postprocessing.ADVANTAGES] = standardized( batch[Postprocessing.ADVANTAGES])

This would require two identical for loops, which are quite ugly IMO. This is because we need to train AFTER normalizing, not before.

Ok, fair enough. Fine with me :)

sven1977 · 2022-04-05T07:19:47Z

rllib/agents/ppo/ppo_torch_policy.py

@@ -152,6 +152,7 @@ def reduce_mean_valid(t):
            mean_vf_loss = reduce_mean_valid(vf_loss_clipped)
        # Ignore the value function.
        else:
+            value_fn_out = 0


Great fix! I guess this is already merged from the other PR.

sven1977 · 2022-04-05T07:20:54Z

rllib/agents/ppo/ppo.py

+        rollouts = synchronous_parallel_sample(self.workers)
+
+        # Concatenate the SampleBatches from each worker into one large SampleBatch
+        rollouts = SampleBatch.concat_samples(rollouts)


We might need some train_batch_size check here (while loop just like in rllib/agents/pg/pg.py).
Otherwise, the train batch size might be too small after just a single round of parallel RolloutWorker.sample() collection.

make it into a util to collect exactly train_batch # of samples from available workers in truncated_episode mode, and at least train_batch # of samples in complete_episode mode?

sven1977

Looks really great. Just a few nits.

sven1977 · 2022-04-05T07:24:39Z

@gjoliver , @smorad , we'll move forward with this PR. Please ignore #23686 (already closed)

This PR here solves the reward checking problem. The other PR (#23686) solves the batch size problem, the solution of which we should move into here (see my comments above).

smorad · 2022-04-05T21:08:57Z

I've implemented nearly all suggestions and the code trains. I have not written unit tests and this is quite an important piece of code to be tested IMO. Unfortunately, I'm leaving on holiday very soon! We can either merge now and I will write tests when I return, or we can wait until I return to add tests to this PR.

gjoliver

nice! I vote we merge this as long as you promise to write the tests after you come back :)

sven1977

LGTM. Thanks for this PR @smorad !

sven1977 · 2022-04-07T08:54:44Z

Fixed LINTer, just waiting for tests to pass.

…train

sven1977 · 2022-04-07T13:16:56Z

Fixed an older bug in multi_gpu_train_one_step (sgd_num_iter instead of num_sgd_iter :( ). This was causing PPO not to learn. It's fixed now and I ran several tests on Pendulum and CartPole. I'm not too concerned, this broke other benchmarks, as PPO is fairly sequential in its execution pattern. Will merge as soon as all tests pass again ...

Also, switched training_iteration to True by default.

…train

rllib/agents/ppo/ppo.py

ArturNiederfahrenhorst · 2022-04-09T16:25:42Z

rllib/agents/trainer.py

+                    env.set_task(new_task)
+
+            fn = functools.partial(fn, task_fn=self.config["env_task_fn"])
+            self.workers.foreach_env_with_context(fn)


Can this time out and create errors that we don't recover from properly due to this being outside of step_attempt()?

rllib/agents/trainer.py

rllib/execution/train_ops.py

ArturNiederfahrenhorst

I love the max_****_steps change, we should totally do this in other training iteration functions. Other than that I have some minor questions, nothing pressing :)
Sorry I'm late to the party 😄

Steven Morad added 2 commits April 2, 2022 18:55

Remove extra call to model.value_function

6f3ab18

Rewrite PPO to use training_iteration

81f9627

smorad requested review from sven1977, gjoliver and avnishn as code owners April 2, 2022 20:41

sven1977 reviewed Apr 5, 2022

View reviewed changes

sven1977 mentioned this pull request Apr 5, 2022

[RLlib] PPO training iteration fn #23686

Closed

6 tasks

Implement suggestions

3eaf934

gjoliver approved these changes Apr 5, 2022

View reviewed changes

sven1977 approved these changes Apr 7, 2022

View reviewed changes

LINT.

b933e57

smorad requested a review from ArturNiederfahrenhorst as a code owner April 7, 2022 08:54

sven1977 added 4 commits April 7, 2022 14:42

wip

1b69bf0

wip

8f80bfc

Merge branch 'master' of https://github.com/ray-project/ray into ppo_…

713bf30

…train

wip

b51e963

sven1977 added 3 commits April 7, 2022 15:49

wip

b6fb974

wip

2c618e8

wip

4132cbd

sven1977 added 8 commits April 7, 2022 17:55

fix

5a78436

Merge branch 'master' of https://github.com/ray-project/ray into ppo_…

5b553fd

…train

wip

1f37af7

Merge branch 'master' of https://github.com/ray-project/ray into ppo_…

1d7afe2

…train

wip

4ce3e28

wip

a2cac3f

wip

0288569

fix and LINT

ad57746

sven1977 changed the title ~~[RLlib] Rewrite PPO to use training_iteration~~ [RLlib] Rewrite PPO to use training_iteration + enable DD-PPO for Win32. Apr 8, 2022