Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Twin Delayed DDPG for RLlib DDPG agent #3353

Merged
merged 9 commits into from
Nov 22, 2018

Conversation

joneswong
Copy link
Contributor

What do these changes do?

  • fix DDPG optimizers
  • add Twin Delayed DDPG (TD3)

The original implementation compute gradients with actor optimizer and critic optimizer respectively but apply gradients with the fake optimizer defined by base class tf policy graph. Actually, TensorFlow optimizers compute the gradients in the same way, while the momentum adjustments function in applying the gradients. Thus, the original implementation failed to use the two optimizers created by the DDPG agent itself.

TD3 mainly add 3 tricks to DDPG and this pr tries to add TD3 upon DDPG. The comparison is showed as follow:

ddpg_improvements

TD3 seems more stable w.r.t. DDPG and solves the Pendulum-v0 quickly in the sense of sample efficiency.

Related issue number

@ericl
Copy link
Contributor

ericl commented Nov 19, 2018

Nice, looking at the paper https://arxiv.org/pdf/1802.09477.pdf, it seems Walker2d-v1 and Ant-v1 tasks should show a significant gain with TD3. Is it possible to include the results of those as well? Otherwise, I can benchmark it later.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9448/
Test FAILed.

self.config["huber_threshold"], self.config["twin_q"])

def optimizer(self):
return self.critic_optimizer, self.actor_optimizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit skeptical about the benefit of this. Could we instead have one optimizer and have a loss coefficient balancing the losses instead, like the other algs do?

# delayed policy update
"policy_delay": 1,
# target policy smoothing
"use_gaussian_noise": False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this smooth_target_policy? You can then add in the comment "this also forces the use of gaussian instead of OU noise for exploration".

@@ -16,6 +16,18 @@
# yapf: disable
# __sphinx_doc_begin__
DEFAULT_CONFIG = with_common_config({
# === Twin Delayed DDPG (TD3) and Soft Actor-Critic (SAC) tricks ===
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stochastic_actions = deterministic_actions + eps * (
high_action - low_action) * exploration_value
if use_gaussian_noise:
if target_smoothing:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use_gaussian_noise is renamed to smooth_target_policy, perhaps this can be renamed to is_target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, we just want to branch OU and Gaussian and, in the Gaussian condition, we also have to distinguish act from target act where is_target is an appropriate name, imo.

builder.add_feed_dict({self._is_training: True})
fetches = builder.add_fetches([
self._apply_op if self.policy_delay_count %
self.config["policy_delay"] == 0 else self._apply_ops[0],
Copy link
Contributor

@ericl ericl Nov 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more natural to multiply the loss by 0 unless policy delay mod is 0? That way it is not necessary to add these hacks in the policy graph.

I am also kind of skeptical of this optimization, seems like you can probably just tune some loss coefficients instead (but I guess we should have it for reproducibility).

@ericl ericl self-assigned this Nov 19, 2018
@joneswong
Copy link
Contributor Author

joneswong commented Nov 20, 2018

I agree with your suggestion to move the delayed policy update trick from run-time to the declaration of computation graph.

One question is that, without considering the delayed policy update trick, the original DDPG paper uses two optimizers respectively so that the policy net and value net can be updated with different learning rate. This still requires us to call apply_gradients() respectively, instead of the original implementation of RLlib which uses a fake optimizer (created by base class TFPolicyGraph) to call apply_gradients() as _apply_op.
Do you want to be identical to the original paper, say that makes the two optimizers function? If no, I will rollback the optimizer() and other related modifications. Otherwise, these modifications make the base class TFPolicyGraph hacked but are still necessary.

@ericl
Copy link
Contributor

ericl commented Nov 20, 2018

Isn't scaling the loss by a factor equivalent to scaling the learning rate? If so then I think we should go with having a single optimizer, and expose loss coefficients as options instead.

@ericl ericl added this to Needs triage in RLlib via automation Nov 20, 2018
@ericl ericl moved this from Needs triage to High priority in RLlib Nov 20, 2018
@joneswong
Copy link
Contributor Author

joneswong commented Nov 21, 2018

ddpg_improvements

updated. td3 still solves pendulum-v0 quickly

@joneswong
Copy link
Contributor Author

Let me summarize these modifications:

  • move the "Delayed” Policy Updates" trick from run-time to declaration of computation graph via global_step and mod operation
  • for "Target Policy Smoothing", there are two Gaussian stddev actually, one for act and one for target act which are 0.1 and 0.2 respectively, according to both the original paper and the implementation of baselines.
  • I don't have mujoco. So continuous control envs other than Pendulum are not available now. There is no founds for this. I am very sorry...

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9503/
Test FAILed.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments

@@ -189,13 +239,16 @@ def __init__(self, observation_space, action_space, config):

# Action outputs
with tf.variable_scope(A_SCOPE, reuse=True):
deterministic_flag = tf.constant(value=False, dtype=tf.bool)
deterministic_flag = tf.constant(value=True, dtype=tf.bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the value of this doesn't matter since eps is zero?

Copy link
Contributor Author

@joneswong joneswong Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but the following output_actions are used for target Q value and BP grads to policy net so "deterministic" is more natural. eps applies to OU now. I implemented the Gaussian noise according to the original paper and baselines where the scale of action space and eps has not been considered. Is it necessary to also apply eps to Gaussian noise? Let me know your request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, that sounds fine to not include it.

p_tp1, deterministic_flag, zero_eps)
p_tp1,
tf.constant(value=False, dtype=tf.bool)
if self.config["smooth_target_policy"] else deterministic_flag,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be not self.config["smooth_target_policy"] since bools will be casted to tensors right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also guessed in this way but got "TypeError: pred must not be a Python bool"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.config["smooth_target_policy"] else deterministic_flag,
stochastic=tf.constant(not self.config["smooth_target_policy"]),

p_tp1,
tf.constant(value=False, dtype=tf.bool)
if self.config["smooth_target_policy"] else deterministic_flag,
zero_eps,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly I think you can just pass 0.0 for zero_eps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like the above, if we also consider eps for Gaussian, eps can be 0.0 for output_action, but can NOT be 0.0 for output_action_estimation (i.e., the target action).

@@ -355,7 +447,8 @@ def compute_td_error(self, obs_t, act_t, rew_t, obs_tp1, done_mask,
return td_err

def reset_noise(self, sess):
sess.run(self.reset_noise_op)
if not self.config["use_gaussian_noise"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer a valid flag right? Maybe we can always run this since resetting noise is harmless in the other case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my fault....replaced.

self._apply_op = self._optimizer.apply_gradients(
self._grads_and_vars,
global_step=self.global_step
if hasattr(self, "global_step") else None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm how about we replace this with tf.train.get_or_create_global_step()?

I would also add a note here that it is for TD3.

# update policy net one time v.s. update critic net `policy_delay` time(s)
actor_loss_coeff = tf.to_float(
tf.equal(tf.mod(global_step, policy_delay), 0))
self.total_loss = actor_loss_coeff * self.actor_loss + self.critic_loss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.total_loss = actor_loss_coeff * self.actor_loss + self.critic_loss
self.total_loss = self.config["actor_loss_coeff"] * actor_loss_mask * self.actor_loss + self.config["critic_loss_coeff"] * self.critic_loss


# === Optimization ===
actor_lr: 0.0001
critic_lr: 0.001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove these and have actor_loss_coeff and critic_loss_coeff instead since this has no effect?

with tf.variable_scope(A_SCOPE, reuse=True):
exploration_sample = tf.get_variable(name="ornstein_uhlenbeck")
self.reset_noise_op = tf.assign(exploration_sample,
self.dim_actions * [.0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else:
self.reset_noise_op = tf.no_op()

@@ -205,14 +258,28 @@ def __init__(self, observation_space, action_space, config):
with tf.variable_scope(Q_SCOPE, reuse=True):
q_tp0, _ = self._build_q_network(self.obs_t, observation_space,
output_actions)
if self.config["twin_q"]:
with tf.variable_scope("twin_" + Q_SCOPE) as scope:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with tf.variable_scope("twin_" + Q_SCOPE) as scope:
with tf.variable_scope(TWIN_Q_SCOPE) as scope:


self.loss = self._build_actor_critic_loss(q_t, q_tp1, q_tp0)
if self.config["twin_q"]:
with tf.variable_scope("twin_" + Q_TARGET_SCOPE) as scope:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with tf.variable_scope("twin_" + Q_TARGET_SCOPE) as scope:
with tf.variable_scope(TWIN_Q_TARGET_SCOPE) as scope:

@@ -142,6 +188,9 @@ def __init__(self, observation_space, action_space, config):
self.critic_optimizer = tf.train.AdamOptimizer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the optimizers right?

@ericl
Copy link
Contributor

ericl commented Nov 21, 2018

@joneswong I'll run some benchmarks later since we have licenses.

@@ -189,13 +239,16 @@ def __init__(self, observation_space, action_space, config):

# Action outputs
with tf.variable_scope(A_SCOPE, reuse=True):
deterministic_flag = tf.constant(value=False, dtype=tf.bool)
deterministic_flag = tf.constant(value=True, dtype=tf.bool)
zero_eps = tf.constant(value=.0, dtype=tf.float32)
output_actions = self._build_action_network(
self.p_t, deterministic_flag, zero_eps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.p_t, deterministic_flag, zero_eps)
self.p_t, stochastic=tf.constant(True), eps=zero_eps)

@@ -189,13 +239,16 @@ def __init__(self, observation_space, action_space, config):

# Action outputs
with tf.variable_scope(A_SCOPE, reuse=True):
deterministic_flag = tf.constant(value=False, dtype=tf.bool)
deterministic_flag = tf.constant(value=True, dtype=tf.bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deterministic_flag = tf.constant(value=True, dtype=tf.bool)

@joneswong
Copy link
Contributor Author

updated. According to the performance, td3 solves more quickly than ddpg and both of them are improved as, for now, the different learning rate do work. I mean the gradients() method of DDPGPolicyGraph just use the actor_loss and critic_loss respectively, so we must multiply the loss coeffs before construct total_loss.

ddpg_improvements

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9513/
Test FAILed.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some lint errors in tests.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9520/
Test FAILed.

@ericl ericl merged commit 24bfe8a into ray-project:master Nov 22, 2018
RLlib automation moved this from Prioritized to Done Nov 22, 2018
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9527/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
RLlib
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants