Enable Twin Delayed DDPG for RLlib DDPG agent #3353

joneswong · 2018-11-19T15:16:11Z

What do these changes do?

fix DDPG optimizers
add Twin Delayed DDPG (TD3)

The original implementation compute gradients with actor optimizer and critic optimizer respectively but apply gradients with the fake optimizer defined by base class tf policy graph. Actually, TensorFlow optimizers compute the gradients in the same way, while the momentum adjustments function in applying the gradients. Thus, the original implementation failed to use the two optimizers created by the DDPG agent itself.

TD3 mainly add 3 tricks to DDPG and this pr tries to add TD3 upon DDPG. The comparison is showed as follow:

TD3 seems more stable w.r.t. DDPG and solves the Pendulum-v0 quickly in the sense of sample efficiency.

Related issue number

ericl · 2018-11-19T19:12:59Z

Nice, looking at the paper https://arxiv.org/pdf/1802.09477.pdf, it seems Walker2d-v1 and Ant-v1 tasks should show a significant gain with TD3. Is it possible to include the results of those as well? Otherwise, I can benchmark it later.

AmplabJenkins · 2018-11-19T19:44:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9448/
Test FAILed.

ericl · 2018-11-19T19:41:59Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

+            self.config["huber_threshold"], self.config["twin_q"])
+
+    def optimizer(self):
+        return self.critic_optimizer, self.actor_optimizer


I'm a bit skeptical about the benefit of this. Could we instead have one optimizer and have a loss coefficient balancing the losses instead, like the other algs do?

ericl · 2018-11-19T22:19:12Z

python/ray/rllib/agents/ddpg/ddpg.py

+    # delayed policy update
+    "policy_delay": 1,
+    # target policy smoothing
+    "use_gaussian_noise": False,


Call this smooth_target_policy? You can then add in the comment "this also forces the use of gaussian instead of OU noise for exploration".

ericl · 2018-11-19T22:19:31Z

python/ray/rllib/agents/ddpg/ddpg.py

@@ -16,6 +16,18 @@
 # yapf: disable
 # __sphinx_doc_begin__
 DEFAULT_CONFIG = with_common_config({
+    # === Twin Delayed DDPG (TD3) and Soft Actor-Critic (SAC) tricks ===


Maybe link to https://spinningup.openai.com/en/latest/algorithms/td3.html

ericl · 2018-11-19T22:28:47Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

-        stochastic_actions = deterministic_actions + eps * (
-            high_action - low_action) * exploration_value
+        if use_gaussian_noise:
+            if target_smoothing:


If use_gaussian_noise is renamed to smooth_target_policy, perhaps this can be renamed to is_target?

here, we just want to branch OU and Gaussian and, in the Gaussian condition, we also have to distinguish act from target act where is_target is an appropriate name, imo.

ericl · 2018-11-19T22:32:37Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

+        builder.add_feed_dict({self._is_training: True})
+        fetches = builder.add_fetches([
+            self._apply_op if self.policy_delay_count %
+            self.config["policy_delay"] == 0 else self._apply_ops[0],


Would it be more natural to multiply the loss by 0 unless policy delay mod is 0? That way it is not necessary to add these hacks in the policy graph.

I am also kind of skeptical of this optimization, seems like you can probably just tune some loss coefficients instead (but I guess we should have it for reproducibility).

joneswong · 2018-11-20T02:37:22Z

I agree with your suggestion to move the delayed policy update trick from run-time to the declaration of computation graph.

One question is that, without considering the delayed policy update trick, the original DDPG paper uses two optimizers respectively so that the policy net and value net can be updated with different learning rate. This still requires us to call apply_gradients() respectively, instead of the original implementation of RLlib which uses a fake optimizer (created by base class TFPolicyGraph) to call apply_gradients() as _apply_op.
Do you want to be identical to the original paper, say that makes the two optimizers function? If no, I will rollback the optimizer() and other related modifications. Otherwise, these modifications make the base class TFPolicyGraph hacked but are still necessary.

ericl · 2018-11-20T02:53:46Z

Isn't scaling the loss by a factor equivalent to scaling the learning rate? If so then I think we should go with having a single optimizer, and expose loss coefficients as options instead.

joneswong · 2018-11-21T04:24:13Z

updated. td3 still solves pendulum-v0 quickly

joneswong · 2018-11-21T05:16:54Z

Let me summarize these modifications:

move the "Delayed” Policy Updates" trick from run-time to declaration of computation graph via global_step and mod operation
for "Target Policy Smoothing", there are two Gaussian stddev actually, one for act and one for target act which are 0.1 and 0.2 respectively, according to both the original paper and the implementation of baselines.
I don't have mujoco. So continuous control envs other than Pendulum are not available now. There is no founds for this. I am very sorry...

AmplabJenkins · 2018-11-21T05:46:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9503/
Test FAILed.

ericl

Mostly minor comments

ericl · 2018-11-21T07:03:03Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

@@ -189,13 +239,16 @@ def __init__(self, observation_space, action_space, config):

        # Action outputs
        with tf.variable_scope(A_SCOPE, reuse=True):
-            deterministic_flag = tf.constant(value=False, dtype=tf.bool)
+            deterministic_flag = tf.constant(value=True, dtype=tf.bool)


I guess the value of this doesn't matter since eps is zero?

yes, but the following output_actions are used for target Q value and BP grads to policy net so "deterministic" is more natural. eps applies to OU now. I implemented the Gaussian noise according to the original paper and baselines where the scale of action space and eps has not been considered. Is it necessary to also apply eps to Gaussian noise? Let me know your request.

Ah I see, that sounds fine to not include it.

ericl · 2018-11-21T07:04:01Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

-                p_tp1, deterministic_flag, zero_eps)
+                p_tp1,
+                tf.constant(value=False, dtype=tf.bool)
+                if self.config["smooth_target_policy"] else deterministic_flag,


This can just be not self.config["smooth_target_policy"] since bools will be casted to tensors right?

I also guessed in this way but got "TypeError: pred must not be a Python bool"

Suggested change

if self.config["smooth_target_policy"] else deterministic_flag,

stochastic=tf.constant(not self.config["smooth_target_policy"]),

ericl · 2018-11-21T07:04:30Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

+                p_tp1,
+                tf.constant(value=False, dtype=tf.bool)
+                if self.config["smooth_target_policy"] else deterministic_flag,
+                zero_eps,


Similarly I think you can just pass 0.0 for zero_eps.

like the above, if we also consider eps for Gaussian, eps can be 0.0 for output_action, but can NOT be 0.0 for output_action_estimation (i.e., the target action).

ericl · 2018-11-21T07:05:59Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

@@ -355,7 +447,8 @@ def compute_td_error(self, obs_t, act_t, rew_t, obs_tp1, done_mask,
        return td_err

    def reset_noise(self, sess):
-        sess.run(self.reset_noise_op)
+        if not self.config["use_gaussian_noise"]:


This is no longer a valid flag right? Maybe we can always run this since resetting noise is harmless in the other case.

my fault....replaced.

ericl · 2018-11-21T07:07:40Z

python/ray/rllib/evaluation/tf_policy_graph.py

+        self._apply_op = self._optimizer.apply_gradients(
+            self._grads_and_vars,
+            global_step=self.global_step
+            if hasattr(self, "global_step") else None)


Hm how about we replace this with tf.train.get_or_create_global_step()?

I would also add a note here that it is for TD3.

ericl · 2018-11-21T07:08:52Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

+        # update policy net one time v.s. update critic net `policy_delay` time(s)
+        actor_loss_coeff = tf.to_float(
+            tf.equal(tf.mod(global_step, policy_delay), 0))
+        self.total_loss = actor_loss_coeff * self.actor_loss + self.critic_loss


Suggested change

self.total_loss = actor_loss_coeff * self.actor_loss + self.critic_loss

self.total_loss = self.config["actor_loss_coeff"] * actor_loss_mask * self.actor_loss + self.config["critic_loss_coeff"] * self.critic_loss

ericl · 2018-11-21T07:09:16Z

python/ray/rllib/tuned_examples/pendulum-td3.yaml

+
+        # === Optimization ===
+        actor_lr: 0.0001
+        critic_lr: 0.001


We should remove these and have actor_loss_coeff and critic_loss_coeff instead since this has no effect?

ericl · 2018-11-21T07:10:21Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

+            with tf.variable_scope(A_SCOPE, reuse=True):
+                exploration_sample = tf.get_variable(name="ornstein_uhlenbeck")
+                self.reset_noise_op = tf.assign(exploration_sample,
+                                                self.dim_actions * [.0])


else:
self.reset_noise_op = tf.no_op()

ericl · 2018-11-21T07:14:49Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

@@ -205,14 +258,28 @@ def __init__(self, observation_space, action_space, config):
        with tf.variable_scope(Q_SCOPE, reuse=True):
            q_tp0, _ = self._build_q_network(self.obs_t, observation_space,
                                             output_actions)
+        if self.config["twin_q"]:
+            with tf.variable_scope("twin_" + Q_SCOPE) as scope:


Suggested change

with tf.variable_scope("twin_" + Q_SCOPE) as scope:

with tf.variable_scope(TWIN_Q_SCOPE) as scope:

ericl · 2018-11-21T07:15:14Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

-
-        self.loss = self._build_actor_critic_loss(q_t, q_tp1, q_tp0)
+        if self.config["twin_q"]:
+            with tf.variable_scope("twin_" + Q_TARGET_SCOPE) as scope:


Suggested change

with tf.variable_scope("twin_" + Q_TARGET_SCOPE) as scope:

with tf.variable_scope(TWIN_Q_TARGET_SCOPE) as scope:

ericl · 2018-11-21T07:20:11Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

@@ -142,6 +188,9 @@ def __init__(self, observation_space, action_space, config):
        self.critic_optimizer = tf.train.AdamOptimizer(


We should remove the optimizers right?

ericl · 2018-11-21T07:20:41Z

@joneswong I'll run some benchmarks later since we have licenses.

ericl · 2018-11-21T09:12:02Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

@@ -189,13 +239,16 @@ def __init__(self, observation_space, action_space, config):

        # Action outputs
        with tf.variable_scope(A_SCOPE, reuse=True):
-            deterministic_flag = tf.constant(value=False, dtype=tf.bool)
+            deterministic_flag = tf.constant(value=True, dtype=tf.bool)
            zero_eps = tf.constant(value=.0, dtype=tf.float32)
            output_actions = self._build_action_network(
                self.p_t, deterministic_flag, zero_eps)


Suggested change

self.p_t, deterministic_flag, zero_eps)

self.p_t, stochastic=tf.constant(True), eps=zero_eps)

ericl · 2018-11-21T09:13:22Z

python/ray/rllib/agents/ddpg/ddpg_policy_graph.py

@@ -189,13 +239,16 @@ def __init__(self, observation_space, action_space, config):

        # Action outputs
        with tf.variable_scope(A_SCOPE, reuse=True):
-            deterministic_flag = tf.constant(value=False, dtype=tf.bool)
+            deterministic_flag = tf.constant(value=True, dtype=tf.bool)


Suggested change

deterministic_flag = tf.constant(value=True, dtype=tf.bool)

joneswong · 2018-11-21T10:17:06Z

updated. According to the performance, td3 solves more quickly than ddpg and both of them are improved as, for now, the different learning rate do work. I mean the gradients() method of DDPGPolicyGraph just use the actor_loss and critic_loss respectively, so we must multiply the loss coeffs before construct total_loss.

AmplabJenkins · 2018-11-21T11:47:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9513/
Test FAILed.

ericl

Looks good! Some lint errors in tests.

AmplabJenkins · 2018-11-22T01:52:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9520/
Test FAILed.

AmplabJenkins · 2018-11-22T04:23:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9527/
Test FAILed.

joneswong added 3 commits November 19, 2018 20:25

fix ddpg optimizer and add td3

25f58b7

supplement missed argument

aaefabe

formated by yapf

91ebcc1

ericl reviewed Nov 19, 2018

View reviewed changes

ericl self-assigned this Nov 19, 2018

ericl added this to Needs triage in RLlib via automation Nov 20, 2018

ericl moved this from Needs triage to High priority in RLlib Nov 20, 2018

joneswong added 2 commits November 21, 2018 12:18

revised accordingly

6100fa8

formated by yapf

41a3c10

ericl reviewed Nov 21, 2018

View reviewed changes

ericl mentioned this pull request Nov 21, 2018

[rllib] Benchmark DDPG vs TD3 #3371

Closed

ericl reviewed Nov 21, 2018

View reviewed changes

joneswong added 2 commits November 21, 2018 18:10

revised accordingly

a47dc37

formated by yapf

02b69a5

ericl approved these changes Nov 21, 2018

View reviewed changes

ericl added 2 commits November 21, 2018 14:47

format

af2de74

paren

ee6c20d

ericl merged commit 24bfe8a into ray-project:master Nov 22, 2018

RLlib automation moved this from Prioritized to Done Nov 22, 2018

	if self.config["smooth_target_policy"] else deterministic_flag,
	stochastic=tf.constant(not self.config["smooth_target_policy"]),

	self.total_loss = actor_loss_coeff * self.actor_loss + self.critic_loss
	self.total_loss = self.config["actor_loss_coeff"] * actor_loss_mask * self.actor_loss + self.config["critic_loss_coeff"] * self.critic_loss

	with tf.variable_scope("twin_" + Q_SCOPE) as scope:
	with tf.variable_scope(TWIN_Q_SCOPE) as scope:

	with tf.variable_scope("twin_" + Q_TARGET_SCOPE) as scope:
	with tf.variable_scope(TWIN_Q_TARGET_SCOPE) as scope:

		@@ -142,6 +188,9 @@ def __init__(self, observation_space, action_space, config):
		self.critic_optimizer = tf.train.AdamOptimizer(

	self.p_t, deterministic_flag, zero_eps)
	self.p_t, stochastic=tf.constant(True), eps=zero_eps)

Enable Twin Delayed DDPG for RLlib DDPG agent #3353

Enable Twin Delayed DDPG for RLlib DDPG agent #3353

Conversation

joneswong commented Nov 19, 2018

What do these changes do?

Related issue number

ericl commented Nov 19, 2018

AmplabJenkins commented Nov 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl Nov 19, 2018 • edited Loading

Choose a reason for hiding this comment

joneswong commented Nov 20, 2018 • edited Loading

ericl commented Nov 20, 2018 • edited Loading

joneswong commented Nov 21, 2018 • edited Loading

joneswong commented Nov 21, 2018

AmplabJenkins commented Nov 21, 2018

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joneswong Nov 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Nov 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joneswong commented Nov 21, 2018

AmplabJenkins commented Nov 21, 2018

ericl left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 22, 2018

AmplabJenkins commented Nov 22, 2018

ericl Nov 19, 2018 •

edited

Loading

joneswong commented Nov 20, 2018 •

edited

Loading

ericl commented Nov 20, 2018 •

edited

Loading

joneswong commented Nov 21, 2018 •

edited

Loading

joneswong Nov 21, 2018 •

edited

Loading