-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] DD-PPO training iteration fn #23906
Conversation
rllib/agents/ppo/ddppo.py
Outdated
sample_and_update_results = asynchronous_parallel_requests( | ||
remote_requests_in_flight=self.remote_requests_in_flight, | ||
actors=self.workers.remote_workers(), | ||
ray_wait_timeout_s=1000.0, # 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, was just trying some stuff. Basically, this makes it synchronous :) Will revert. ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a future commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh? Ok, now it's fixed ... Forgot to push.
env: Pendulum-v1 | ||
run: DDPPO | ||
stop: | ||
episode_reward_mean: -300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part?
- reward: It's able to get to -300.
- timesteps: Yeah, it does need up to 1M sometimes. DD-PPO is not a very stable algo, it seems. Especially on cont. action tasks. But even on Atari I have yet to find a good choice of hyperparams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh for some reason I read this as CartPole and not Pendulum, my b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, yeah -300 would be pretty bad for CartPole :)
@@ -249,6 +249,16 @@ py_test( | |||
args = ["--yaml-dir=tuned_examples/ppo"] | |||
) | |||
|
|||
py_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This task is now working properly. Due to proper hyperparam tuning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome! How difficult did you find it to tune hparams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was pretty hard, actually.
It's good to start with just one worker and using the exact same hparams as the respective PPO version. Then increase num_workers and at the same time carefully adjust:
rollout_fragment_length
num_envs_per_worker
sgd_minibatch_size
num_sgd_iter
in short, anything that affects (per-worker) batch-size, and the time each worker spends on a decentralized update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks p much good to me.
I think that we can find good ATARI hparams, but we'll probably need some more logging infos (for stddev and entropy in the case of PPO) and then we'll be able to get a sufficiently working DDPPO agent.
@@ -249,6 +249,16 @@ py_test( | |||
args = ["--yaml-dir=tuned_examples/ppo"] | |||
) | |||
|
|||
py_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome! How difficult did you find it to tune hparams?
rllib/agents/ppo/ddppo.py
Outdated
sample_and_update_results = asynchronous_parallel_requests( | ||
remote_requests_in_flight=self.remote_requests_in_flight, | ||
actors=self.workers.remote_workers(), | ||
ray_wait_timeout_s=1000.0, # 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a future commit?
env: Pendulum-v1 | ||
run: DDPPO | ||
stop: | ||
episode_reward_mean: -300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh for some reason I read this as CartPole and not Pendulum, my b
…o_training_itr
The DDPPO LR scheduler test is broken because the learner_info_dictionary that is returned by the training iteration function does not consistently return a learner info for every training iteration, but the test expects that it does. We'll need to fix the test then re-merge Reverts #23906
DD-PPO training iteration fn implementation:
Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.