Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDPG parameters #146

Open
anaypat opened this issue May 30, 2017 · 18 comments
Open

DDPG parameters #146

anaypat opened this issue May 30, 2017 · 18 comments

Comments

@anaypat
Copy link

anaypat commented May 30, 2017

Can you please let us know the parameters used for DDPG algorithm for reproducing results as given in "Benchmarking Deep Reinforcement Learning for Continuous Control" paper (https://arxiv.org/pdf/1604.06778.pdf). I couldn't find them in supplementary material.

@cannontwo
Copy link

Check here: https://arxiv.org/abs/1509.02971

@anaypat
Copy link
Author

anaypat commented May 31, 2017

Thanks! However, I couldn't find some parameters such as epoch_length, max_path_length, n_epochs etc

It has been mentioned in https://arxiv.org/abs/1509.02971 (Section 7) that "Actions were not included until the 2nd hidden layer of Q" . In rllab's DeterministicMLPPolicy class, I couldn't find where the actions were injected into Q network.

It seems that the critic network architecture is different. Am I missing something?

@cannontwo
Copy link

I'm not sure what you mean by referring to the DeterministicMLPPolicy, as the code implementing DDPG is here: https://github.com/openai/rllab/blob/master/rllab/algos/ddpg.py. Hopefully that helps.

@anaypat
Copy link
Author

anaypat commented May 31, 2017

Sorry, I previously referred to actor network instead of the critic network. I should have mentioned https://github.com/openai/rllab/blob/master/rllab/q_functions/continuous_mlp_q_function.py . rllab has indeed merged the action into second hidden layer of critic.

btw, I still couldn't get hold of epoch_length, max_path_length, n_epochs etc

@cannontwo
Copy link

I'm not sure if there are additional parameters implied by your "etc" that are especially important to you, but epoch_length, max_path_length, and n_epochs should have minimal effect on training your RL agent. For DDPG I typically use ~5000 epochs by default, with an epoch_length/max_path_length of 1000, but you will likely want to adjust this to better fit the precise environments that you are working with.

You may notice that in the paper at https://arxiv.org/pdf/1509.02971.pdf the authors report their results in terms of the number of steps of training, which is the important metric for training results. So long as your epochs are not trivially short, you should have a wide margin for the specific parameters (epoch_length, max_path_length, n_epochs) that you mention.

Much more relevant to replicating the authors' results are the parameters to the actor and critic networks and discount values for reinforcement learning, which are reported in the paper linked above.

@anaypat
Copy link
Author

anaypat commented Jun 1, 2017

This is the code that I used for half cheetah task

from rllab.algos.ddpg import DDPG
from rllab.envs.mujoco.half_cheetah_env import HalfCheetahEnv
from rllab.envs.normalized_env import normalize
from rllab.misc.instrument import run_experiment_lite
from rllab.exploration_strategies.ou_strategy import OUStrategy
from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction


def run_task(*_):
    env = normalize(HalfCheetahEnv())

    policy = DeterministicMLPPolicy(
        env_spec=env.spec,
        # The neural network policy should have two hidden layers.
        hidden_sizes=(400, 300)
    )

    es = OUStrategy(env_spec=env.spec)

    qf = ContinuousMLPQFunction(env_spec=env.spec)

    algo = DDPG(
        env=env,
        policy=policy,
        es=es,
        qf=qf,
        batch_size=64,
        max_path_length=500,
        epoch_length=1000,
        min_pool_size=10000,
        n_epochs=3000,
        discount=0.99,
        scale_reward=0.1,
        qf_learning_rate=1e-3,
        policy_learning_rate=1e-4,
        # Uncomment both lines (this and the plot parameter below) to enable plotting
        # plot=True,
    )
    algo.train()

run_experiment_lite(
    run_task,
    # Number of parallel workers for sampling
    n_parallel=4,
    # Only keep the snapshot parameters for the last iteration
    snapshot_mode="last",
    # Specifies the seed for the experiment. If this is not provided, a random seed
    # will be used
    seed=1,
    # plot=True,
) 

progress_half_cheetah

This is the average return per epoch graph. Seems like it doesn't learn

As you might see I ran it for around 1600 epochs with 1000 steps per epoch. That is around 1.6 million steps. In https://arxiv.org/abs/1509.02971 , Fig. 2, the agent learns well before 1 million steps.

Can you please suggest what might be the issue? Seems that I need to change some parameters. As you had suggested earlier, I have kept the architecture and parameters as well as update rule same as https://arxiv.org/abs/1509.02971 (btw, this was the default setting).

@dementrock
Copy link
Member

Can you try setting the hidden_sizes of ContinuousMLPQFunction to also (400, 300)?

@anaypat
Copy link
Author

anaypat commented Jun 5, 2017

Thanks for pointing it out, @dementrock !

it_works

@anaypat
Copy link
Author

anaypat commented Jun 5, 2017

Just out of curiosity, the results given in https://arxiv.org/abs/1509.02971 (Figure 2) shows that the algo converges well before 1 million steps. In the above experiment, I used max_path_length=500, epoch_length=1000 and n_epochs=3000. It seems to converge at around 2000 epochs. As I understand it turns out to 500 * 1000 *2000 = 100 million steps. Can I reduce this by appropriately fixing some of these parameters (max_path_length, epoch_length, n_epochs)? I'm asking this as it'll save me training time on other tasks as well. Please let me know if it's task dependent as well. Given below is the code that I used to produce the result in previous comment.

from rllab.algos.ddpg import DDPG
from rllab.envs.mujoco.half_cheetah_env import HalfCheetahEnv
from rllab.envs.normalized_env import normalize
from rllab.misc.instrument import run_experiment_lite
from rllab.exploration_strategies.ou_strategy import OUStrategy
from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction


def run_task(*_):
    env = normalize(HalfCheetahEnv())

    policy = DeterministicMLPPolicy(
        env_spec=env.spec,
        # The neural network policy should have two hidden layers
        hidden_sizes=(400, 300)
    )

    es = OUStrategy(env_spec=env.spec)                  
                    

    qf = ContinuousMLPQFunction(env_spec=env.spec,
                                hidden_sizes=(400,300)
                                )

    algo = DDPG(
        env=env,
        policy=policy,
        es=es,
        qf=qf,
        batch_size=64,
        max_path_length=500,
        epoch_length=1000,
        min_pool_size=10000,
        n_epochs=3000,
        discount=0.99,
        scale_reward=0.1,
        qf_learning_rate=1e-3,
        policy_learning_rate=1e-4,
        # Uncomment both lines (this and the plot parameter below) to enable plotting
        # plot=True,
    )
    algo.train()

run_experiment_lite(
    run_task,
    # Number of parallel workers for sampling
    n_parallel=4,
    # Only keep the snapshot parameters for the last iteration
    snapshot_mode="last",
    # Specifies the seed for the experiment. If this is not provided, a random seed
    # will be used
    seed=1,
    # plot=True,
)

@dementrock
Copy link
Member

epoch_length is the number of time steps per epoch rather than the number of episodes. So what you have should correspond to 1000 * 2000 = 2 million time steps. Also in the DDPG paper I think they might have used a shorter horizon, probably about 250 time steps per episode.

@anaypat
Copy link
Author

anaypat commented Jun 6, 2017

@dementrock Thanks for the clarification. I'll try out shorter horizon.

@atavakol
Copy link

How about the qf_weight_decay? By default it's set to zero, but in the DDPG paper it's said to be 1e-2? Is there a reason for this? Same question in regards to reward scaling: do we need to use reward scaling for DDPG in RLLab and if yes, what value works well? In the DDPG paper there is no mentioning of reward scaling.

@dementrock
Copy link
Member

I've found weight decay to hurt performance sometimes, but you should experiment with both. For reward scaling use 0.1.

There's no mention in DDPG paper since they implemented the environments themselves, and they can choose to scale the reward when defining the environments. However environments in rllab were implemented for policy gradient algorithms. They are batch based and can already normalize rewards within each batch.

@atavakol
Copy link

@dementrock I tried reward scaling of 0.1 and 1. For Reacher and Hopper I get divergence or plateauing on very bad returns. For 1.0 I was getting better results for both, but for Reacher, the agent would start at evaluation rewards of -12 and plateau on -10 or -9! Which are far from solved. Any pointers? I'm using all the parameters in the DDPG paper.

@ghost
Copy link

ghost commented Jul 19, 2017

@dementrock I found using batch normalization in the policy and value networks to hinder the performance contrary to what the DeepMind paper says on all the tasks among Half Cheetah, Swimmer, Reacher and Walker2D. Any insights on why the case?

Another question I had was about include_horizon_terminal_transitions. It's set to default False in rllab. But I think DeepMind in general uses True (atleast in Atari environments)?

@dementrock
Copy link
Member

@aravindsrinivas I've observed the same behavior when using batch norm. I haven't figured out why but maybe one reason is that the environments they used aren't exactly the same, and maybe their environments require more care with normalizing activations (e.g. if the inputs are in different range). Also, different parameterizations of parameters of batch norm will yield different behaviors when doing the soft target update (e.g. parameterizing variance vs. inverse of variance).

include_horizon_terminal_transitions: I haven't found this to matter. Earlier when communicating with Tim Lillicrap (1st author on DDPG paper) he indicated that they did not include such transitions.

@LilianaNYC
Copy link

Thanks for pointing it out, @dementrock !

it_works

Hi anaypat,

I am working on something very similar but different environment. I just were curious on how you were able to plot the average reward after training the algo. If you could perhaps point me to the right direction, I will appreciate it.

Thanks!

@DongChen06
Copy link

@LilianaNYC Have you figure out how to solve this? By average random seeds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants