Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Rewrite PPO to use training_iteration + enable DD-PPO for Win32. #23673

Merged
merged 20 commits into from
Apr 11, 2022

Conversation

smorad
Copy link
Contributor

@smorad smorad commented Apr 2, 2022

Why are these changes needed?

I'm trying to benchmark pytorch PPO performance, but this is difficult with the execution_plan api and associated DataFlow programming paradigm. This PR enables the use of the new experimental imperative training_iteration API. This also significantly aids in readability, and appears to provide a marginal (~7%) speedup.

Orange is before fix and pink is after fix, using config

cartpole-ppo:
    env: CartPole-v0
    run: PPO
    stop:
        timesteps_total: 300000
    config:
        min_time_s_per_reporting: 5
        _disable_execution_plan_api: true,
        # Works for both torch and tf.
        framework: torch
        gamma: 0.99
        lr: 0.0003
        num_workers: 0
        observation_filter: MeanStdFilter
        num_sgd_iter: 6
        vf_loss_coeff: 0.01
        model:
            fcnet_hiddens: [32]
            fcnet_activation: linear
            vf_share_layers: true

Screen Shot 2022-04-02 at 9 39 18 PM

Another benchmark on Atari (Pong), using 15 workers: This indicates that w/o the changes, PPO has a much harder time to learn, but I'll need to confirm this more properly (more seeds):

Config:

# On a single GPU, this achieves maximum reward in ~15-20 minutes.
#
# $ python train.py -f tuned_configs/pong-ppo.yaml
#
pong-ppo:
    env: PongNoFrameskip-v4
    stop:
        episode_reward_mean: 19.0
    run: PPO
    config:
        # Works for both torch and tf.
        framework: tf
        lambda: 0.95
        kl_coeff: 0.5
        clip_rewards: True
        clip_param: 0.1
        vf_clip_param: 10.0
        entropy_coeff: 0.01
        train_batch_size: 5000
        rollout_fragment_length: 20
        sgd_minibatch_size: 500
        num_sgd_iter: 10
        num_workers: 15
        num_envs_per_worker: 5
        batch_mode: truncate_episodes
        observation_filter: NoFilter
        num_gpus: 1
        model:
            dim: 42
            vf_share_layers: true

        _disable_execution_plan_api:
            grid_search: [true, false]

Results:

== Status ==
Current time: 2022-04-08 04:36:59 (running for 00:22:00.63)
Memory usage on this node: 27.1/239.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 32.0/32 CPUs, 2.0/4 GPUs, 0.0/147.65 GiB heap, 0.0/67.27 GiB objects
Result logdir: /home/ray/ray_results/pong-ppo
Number of trials: 2/2 (2 RUNNING)
+------------------------------------+----------+-------------------+-------------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
| Trial name                         | status   | loc               | _disable_execution_plan_api   |   iter |   total time (s) |      ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|------------------------------------+----------+-------------------+-------------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------|
| PPO_PongNoFrameskip-v4_203c6_00000 | RUNNING  | 10.0.67.156:17005 | True                          |    327 |          1280.29 | 3237300 |    19.02 |                   21 |                   12 |            7343.62 |
| PPO_PongNoFrameskip-v4_203c6_00001 | RUNNING  | 10.0.67.156:17006 | False                         |    638 |          2522.73 | 6316200 |    19.15 |                   21 |                   -3 |            8530.15 |
+------------------------------------+----------+-------------------+-------------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@gjoliver
Copy link
Member

gjoliver commented Apr 4, 2022

what's the relationship between this and #23686 ??
which one should we reivew?

@smorad
Copy link
Contributor Author

smorad commented Apr 4, 2022

what's the relationship between this and #23686 ?? which one should we reivew?

Doh, I did not see this open PR. This implements identical capabilities to #23686. @sven1977 it's probably worth double checking against mine to see if there are any important differences. I think the main difference is this PR still warns about bad weight/clipping scales like the original verison. And looking at his, I think I actually train on rollout_fragment_length rather than train_batch_size :P

@sven1977
Copy link
Contributor

sven1977 commented Apr 5, 2022

Oh no! :D Sorry @smorad , I didn't see your PR before I did something similar to this yesterday.
But no problem. Yours actually looks like it solved some of the problems my PR still had (e.g. PPO reward range checking).
Let me take a look ...

"vf_share_layers.".format(policy_id, scaled_vf_loss, policy_loss)
)
# Warn about bad clipping configs
mean_reward = rollouts["rewards"][rollouts["agent_index"] == i].mean()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea just doing this here and in this fashion! True, we don't really need entire episodes for this avg-reward estimate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we log 1 in N times or something? won't this flood the training logs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @smorad fixed this by a log_once if-block, which seems like always the best option for these warnings.

@@ -337,6 +337,35 @@ def __call__(self, samples: SampleBatchType) -> SampleBatchType:
return samples


def standardize_fields(samples: SampleBatchType, fields: List[str]) -> SampleBatchType:
"""Standardize fields of the given SampleBatch"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, could we just simplify this even further and directly add the following code to the training_iteration() method:

        # Standardize `advantages` values in train_batch.
        for policy_id, batch in train_batch.policy_batches.items():
            if Postprocessing.ADVANTAGES not in batch:
                raise KeyError(
                    f"`{Postprocessing.ADVANTAGES}` not found in SampleBatch for "
                    f"policy `{policy_id}`! Maybe this policy fails to add "
                    f"`{Postprocessing.ADVANTAGES}` in its `postprocess_trajectory` "
                    f"method? Or this policy is not meant to learn at all and you "
                    "forgot to remove it from the list under `config."
                    "multiagent.policies_to_train`."
                )
            batch[Postprocessing.ADVANTAGES] = standardized(
                batch[Postprocessing.ADVANTAGES])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require two identical for loops, which are quite ugly IMO. This is because we need to train AFTER normalizing, not before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fair enough. Fine with me :)

@@ -152,6 +152,7 @@ def reduce_mean_valid(t):
mean_vf_loss = reduce_mean_valid(vf_loss_clipped)
# Ignore the value function.
else:
value_fn_out = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix! I guess this is already merged from the other PR.

rollouts = synchronous_parallel_sample(self.workers)

# Concatenate the SampleBatches from each worker into one large SampleBatch
rollouts = SampleBatch.concat_samples(rollouts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need some train_batch_size check here (while loop just like in rllib/agents/pg/pg.py).
Otherwise, the train batch size might be too small after just a single round of parallel RolloutWorker.sample() collection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it into a util to collect exactly train_batch # of samples from available workers in truncated_episode mode, and at least train_batch # of samples in complete_episode mode?

Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great. Just a few nits.

@sven1977 sven1977 mentioned this pull request Apr 5, 2022
6 tasks
@sven1977
Copy link
Contributor

sven1977 commented Apr 5, 2022

@gjoliver , @smorad , we'll move forward with this PR. Please ignore #23686 (already closed)

This PR here solves the reward checking problem. The other PR (#23686) solves the batch size problem, the solution of which we should move into here (see my comments above).

@smorad
Copy link
Contributor Author

smorad commented Apr 5, 2022

I've implemented nearly all suggestions and the code trains. I have not written unit tests and this is quite an important piece of code to be tested IMO. Unfortunately, I'm leaving on holiday very soon! We can either merge now and I will write tests when I return, or we can wait until I return to add tests to this PR.

Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I vote we merge this as long as you promise to write the tests after you come back :)

Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for this PR @smorad !

@sven1977
Copy link
Contributor

sven1977 commented Apr 7, 2022

Fixed LINTer, just waiting for tests to pass.

@sven1977
Copy link
Contributor

sven1977 commented Apr 7, 2022

Fixed an older bug in multi_gpu_train_one_step (sgd_num_iter instead of num_sgd_iter :( ). This was causing PPO not to learn. It's fixed now and I ran several tests on Pendulum and CartPole. I'm not too concerned, this broke other benchmarks, as PPO is fairly sequential in its execution pattern. Will merge as soon as all tests pass again ...

Also, switched training_iteration to True by default.

@sven1977 sven1977 changed the title [RLlib] Rewrite PPO to use training_iteration [RLlib] Rewrite PPO to use training_iteration + enable DD-PPO for Win32. Apr 8, 2022
env.set_task(new_task)

fn = functools.partial(fn, task_fn=self.config["env_task_fn"])
self.workers.foreach_env_with_context(fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this time out and create errors that we don't recover from properly due to this being outside of step_attempt()?

Copy link
Contributor

@ArturNiederfahrenhorst ArturNiederfahrenhorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the max_****_steps change, we should totally do this in other training iteration functions. Other than that I have some minor questions, nothing pressing :)
Sorry I'm late to the party 😄

@sven1977 sven1977 merged commit 0092281 into ray-project:master Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants