DDPG parameters #146

anaypat · 2017-05-30T18:03:37Z

Can you please let us know the parameters used for DDPG algorithm for reproducing results as given in "Benchmarking Deep Reinforcement Learning for Continuous Control" paper (https://arxiv.org/pdf/1604.06778.pdf). I couldn't find them in supplementary material.

cannontwo · 2017-05-30T20:21:55Z

Check here: https://arxiv.org/abs/1509.02971

anaypat · 2017-05-31T00:23:18Z

Thanks! However, I couldn't find some parameters such as epoch_length, max_path_length, n_epochs etc

It has been mentioned in https://arxiv.org/abs/1509.02971 (Section 7) that "Actions were not included until the 2nd hidden layer of Q" . In rllab's DeterministicMLPPolicy class, I couldn't find where the actions were injected into Q network.

It seems that the critic network architecture is different. Am I missing something?

cannontwo · 2017-05-31T00:28:16Z

I'm not sure what you mean by referring to the DeterministicMLPPolicy, as the code implementing DDPG is here: https://github.com/openai/rllab/blob/master/rllab/algos/ddpg.py. Hopefully that helps.

anaypat · 2017-05-31T00:34:24Z

Sorry, I previously referred to actor network instead of the critic network. I should have mentioned https://github.com/openai/rllab/blob/master/rllab/q_functions/continuous_mlp_q_function.py . rllab has indeed merged the action into second hidden layer of critic.

btw, I still couldn't get hold of epoch_length, max_path_length, n_epochs etc

cannontwo · 2017-05-31T05:13:35Z

I'm not sure if there are additional parameters implied by your "etc" that are especially important to you, but epoch_length, max_path_length, and n_epochs should have minimal effect on training your RL agent. For DDPG I typically use ~5000 epochs by default, with an epoch_length/max_path_length of 1000, but you will likely want to adjust this to better fit the precise environments that you are working with.

You may notice that in the paper at https://arxiv.org/pdf/1509.02971.pdf the authors report their results in terms of the number of steps of training, which is the important metric for training results. So long as your epochs are not trivially short, you should have a wide margin for the specific parameters (epoch_length, max_path_length, n_epochs) that you mention.

Much more relevant to replicating the authors' results are the parameters to the actor and critic networks and discount values for reinforcement learning, which are reported in the paper linked above.

anaypat · 2017-06-01T15:47:24Z

This is the code that I used for half cheetah task

from rllab.algos.ddpg import DDPG
from rllab.envs.mujoco.half_cheetah_env import HalfCheetahEnv
from rllab.envs.normalized_env import normalize
from rllab.misc.instrument import run_experiment_lite
from rllab.exploration_strategies.ou_strategy import OUStrategy
from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction


def run_task(*_):
    env = normalize(HalfCheetahEnv())

    policy = DeterministicMLPPolicy(
        env_spec=env.spec,
        # The neural network policy should have two hidden layers.
        hidden_sizes=(400, 300)
    )

    es = OUStrategy(env_spec=env.spec)

    qf = ContinuousMLPQFunction(env_spec=env.spec)

    algo = DDPG(
        env=env,
        policy=policy,
        es=es,
        qf=qf,
        batch_size=64,
        max_path_length=500,
        epoch_length=1000,
        min_pool_size=10000,
        n_epochs=3000,
        discount=0.99,
        scale_reward=0.1,
        qf_learning_rate=1e-3,
        policy_learning_rate=1e-4,
        # Uncomment both lines (this and the plot parameter below) to enable plotting
        # plot=True,
    )
    algo.train()

run_experiment_lite(
    run_task,
    # Number of parallel workers for sampling
    n_parallel=4,
    # Only keep the snapshot parameters for the last iteration
    snapshot_mode="last",
    # Specifies the seed for the experiment. If this is not provided, a random seed
    # will be used
    seed=1,
    # plot=True,
)

This is the average return per epoch graph. Seems like it doesn't learn

As you might see I ran it for around 1600 epochs with 1000 steps per epoch. That is around 1.6 million steps. In https://arxiv.org/abs/1509.02971 , Fig. 2, the agent learns well before 1 million steps.

Can you please suggest what might be the issue? Seems that I need to change some parameters. As you had suggested earlier, I have kept the architecture and parameters as well as update rule same as https://arxiv.org/abs/1509.02971 (btw, this was the default setting).

dementrock · 2017-06-03T05:13:39Z

Can you try setting the hidden_sizes of ContinuousMLPQFunction to also (400, 300)?

anaypat · 2017-06-05T17:17:03Z

Thanks for pointing it out, @dementrock !

anaypat · 2017-06-05T18:46:34Z

Just out of curiosity, the results given in https://arxiv.org/abs/1509.02971 (Figure 2) shows that the algo converges well before 1 million steps. In the above experiment, I used max_path_length=500, epoch_length=1000 and n_epochs=3000. It seems to converge at around 2000 epochs. As I understand it turns out to 500 * 1000 *2000 = 100 million steps. Can I reduce this by appropriately fixing some of these parameters (max_path_length, epoch_length, n_epochs)? I'm asking this as it'll save me training time on other tasks as well. Please let me know if it's task dependent as well. Given below is the code that I used to produce the result in previous comment.

from rllab.algos.ddpg import DDPG
from rllab.envs.mujoco.half_cheetah_env import HalfCheetahEnv
from rllab.envs.normalized_env import normalize
from rllab.misc.instrument import run_experiment_lite
from rllab.exploration_strategies.ou_strategy import OUStrategy
from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction


def run_task(*_):
    env = normalize(HalfCheetahEnv())

    policy = DeterministicMLPPolicy(
        env_spec=env.spec,
        # The neural network policy should have two hidden layers
        hidden_sizes=(400, 300)
    )

    es = OUStrategy(env_spec=env.spec)                  
                    

    qf = ContinuousMLPQFunction(env_spec=env.spec,
                                hidden_sizes=(400,300)
                                )

    algo = DDPG(
        env=env,
        policy=policy,
        es=es,
        qf=qf,
        batch_size=64,
        max_path_length=500,
        epoch_length=1000,
        min_pool_size=10000,
        n_epochs=3000,
        discount=0.99,
        scale_reward=0.1,
        qf_learning_rate=1e-3,
        policy_learning_rate=1e-4,
        # Uncomment both lines (this and the plot parameter below) to enable plotting
        # plot=True,
    )
    algo.train()

run_experiment_lite(
    run_task,
    # Number of parallel workers for sampling
    n_parallel=4,
    # Only keep the snapshot parameters for the last iteration
    snapshot_mode="last",
    # Specifies the seed for the experiment. If this is not provided, a random seed
    # will be used
    seed=1,
    # plot=True,
)

dementrock · 2017-06-06T05:01:04Z

epoch_length is the number of time steps per epoch rather than the number of episodes. So what you have should correspond to 1000 * 2000 = 2 million time steps. Also in the DDPG paper I think they might have used a shorter horizon, probably about 250 time steps per episode.

anaypat · 2017-06-06T16:45:06Z

@dementrock Thanks for the clarification. I'll try out shorter horizon.

atavakol · 2017-06-22T20:43:04Z

How about the qf_weight_decay? By default it's set to zero, but in the DDPG paper it's said to be 1e-2? Is there a reason for this? Same question in regards to reward scaling: do we need to use reward scaling for DDPG in RLLab and if yes, what value works well? In the DDPG paper there is no mentioning of reward scaling.

dementrock · 2017-06-23T16:28:46Z

I've found weight decay to hurt performance sometimes, but you should experiment with both. For reward scaling use 0.1.

There's no mention in DDPG paper since they implemented the environments themselves, and they can choose to scale the reward when defining the environments. However environments in rllab were implemented for policy gradient algorithms. They are batch based and can already normalize rewards within each batch.

atavakol · 2017-06-24T12:35:38Z

@dementrock I tried reward scaling of 0.1 and 1. For Reacher and Hopper I get divergence or plateauing on very bad returns. For 1.0 I was getting better results for both, but for Reacher, the agent would start at evaluation rewards of -12 and plateau on -10 or -9! Which are far from solved. Any pointers? I'm using all the parameters in the DDPG paper.

ghost · 2017-07-19T09:40:28Z

@dementrock I found using batch normalization in the policy and value networks to hinder the performance contrary to what the DeepMind paper says on all the tasks among Half Cheetah, Swimmer, Reacher and Walker2D. Any insights on why the case?

Another question I had was about include_horizon_terminal_transitions. It's set to default False in rllab. But I think DeepMind in general uses True (atleast in Atari environments)?

dementrock · 2017-07-21T16:38:06Z

@aravindsrinivas I've observed the same behavior when using batch norm. I haven't figured out why but maybe one reason is that the environments they used aren't exactly the same, and maybe their environments require more care with normalizing activations (e.g. if the inputs are in different range). Also, different parameterizations of parameters of batch norm will yield different behaviors when doing the soft target update (e.g. parameterizing variance vs. inverse of variance).

include_horizon_terminal_transitions: I haven't found this to matter. Earlier when communicating with Tim Lillicrap (1st author on DDPG paper) he indicated that they did not include such transitions.

LilianaNYC · 2018-12-16T15:24:03Z

Thanks for pointing it out, @dementrock !

Hi anaypat,

I am working on something very similar but different environment. I just were curious on how you were able to plot the average reward after training the algo. If you could perhaps point me to the right direction, I will appreciate it.

Thanks!

DongChen06 · 2019-04-15T15:06:58Z

@LilianaNYC Have you figure out how to solve this? By average random seeds?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDPG parameters #146

DDPG parameters #146

anaypat commented May 30, 2017

cannontwo commented May 30, 2017

anaypat commented May 31, 2017 •

edited

Loading

cannontwo commented May 31, 2017

anaypat commented May 31, 2017 •

edited

Loading

cannontwo commented May 31, 2017

anaypat commented Jun 1, 2017 •

edited

Loading

dementrock commented Jun 3, 2017

anaypat commented Jun 5, 2017

anaypat commented Jun 5, 2017

dementrock commented Jun 6, 2017

anaypat commented Jun 6, 2017

atavakol commented Jun 22, 2017

dementrock commented Jun 23, 2017

atavakol commented Jun 24, 2017

ghost commented Jul 19, 2017 •

edited by ghost

Loading

dementrock commented Jul 21, 2017

LilianaNYC commented Dec 16, 2018

DongChen06 commented Apr 15, 2019

DDPG parameters #146

DDPG parameters #146

Comments

anaypat commented May 30, 2017

cannontwo commented May 30, 2017

anaypat commented May 31, 2017 • edited Loading

cannontwo commented May 31, 2017

anaypat commented May 31, 2017 • edited Loading

cannontwo commented May 31, 2017

anaypat commented Jun 1, 2017 • edited Loading

dementrock commented Jun 3, 2017

anaypat commented Jun 5, 2017

anaypat commented Jun 5, 2017

dementrock commented Jun 6, 2017

anaypat commented Jun 6, 2017

atavakol commented Jun 22, 2017

dementrock commented Jun 23, 2017

atavakol commented Jun 24, 2017

ghost commented Jul 19, 2017 • edited by ghost Loading

dementrock commented Jul 21, 2017

LilianaNYC commented Dec 16, 2018

DongChen06 commented Apr 15, 2019

anaypat commented May 31, 2017 •

edited

Loading

anaypat commented May 31, 2017 •

edited

Loading

anaypat commented Jun 1, 2017 •

edited

Loading

ghost commented Jul 19, 2017 •

edited by ghost

Loading