Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] DD-PPO training iteration fn #23906

Merged
merged 12 commits into from
Apr 19, 2022

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Apr 14, 2022

DD-PPO training iteration fn implementation:

  • Add Pendulum learning test for this algo to CI
  • Atari comparative benchmarks (vs execution_plan version) pending ...

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

sample_and_update_results = asynchronous_parallel_requests(
remote_requests_in_flight=self.remote_requests_in_flight,
actors=self.workers.remote_workers(),
ray_wait_timeout_s=1000.0, # 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, was just trying some stuff. Basically, this makes it synchronous :) Will revert. ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a future commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? Ok, now it's fixed ... Forgot to push.

env: Pendulum-v1
run: DDPPO
stop:
episode_reward_mean: -300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part?

  • reward: It's able to get to -300.
  • timesteps: Yeah, it does need up to 1M sometimes. DD-PPO is not a very stable algo, it seems. Especially on cont. action tasks. But even on Atari I have yet to find a good choice of hyperparams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh for some reason I read this as CartPole and not Pendulum, my b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yeah -300 would be pretty bad for CartPole :)

@@ -249,6 +249,16 @@ py_test(
args = ["--yaml-dir=tuned_examples/ppo"]
)

py_test(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task is now working properly. Due to proper hyperparam tuning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome! How difficult did you find it to tune hparams?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was pretty hard, actually.
It's good to start with just one worker and using the exact same hparams as the respective PPO version. Then increase num_workers and at the same time carefully adjust:

rollout_fragment_length
num_envs_per_worker
sgd_minibatch_size
num_sgd_iter

in short, anything that affects (per-worker) batch-size, and the time each worker spends on a decentralized update.

Copy link
Contributor

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks p much good to me.

I think that we can find good ATARI hparams, but we'll probably need some more logging infos (for stddev and entropy in the case of PPO) and then we'll be able to get a sufficiently working DDPPO agent.

@@ -249,6 +249,16 @@ py_test(
args = ["--yaml-dir=tuned_examples/ppo"]
)

py_test(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome! How difficult did you find it to tune hparams?

sample_and_update_results = asynchronous_parallel_requests(
remote_requests_in_flight=self.remote_requests_in_flight,
actors=self.workers.remote_workers(),
ray_wait_timeout_s=1000.0, # 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a future commit?

env: Pendulum-v1
run: DDPPO
stop:
episode_reward_mean: -300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh for some reason I read this as CartPole and not Pendulum, my b

@sven1977 sven1977 merged commit eb54236 into ray-project:master Apr 19, 2022
avnishn added a commit that referenced this pull request Apr 19, 2022
amogkam pushed a commit that referenced this pull request Apr 19, 2022
The DDPPO LR scheduler test is broken because the learner_info_dictionary that is returned by the training iteration function does not consistently return a learner info for every training iteration, but the test expects that it does.

We'll need to fix the test then re-merge

Reverts #23906
sven1977 added a commit that referenced this pull request Apr 20, 2022
sven1977 pushed a commit that referenced this pull request Apr 21, 2022
krfricke added a commit that referenced this pull request Apr 22, 2022
krfricke added a commit that referenced this pull request Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants