Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about results in the paper #43

Open
lchenat opened this issue Oct 5, 2016 · 8 comments
Open

Question about results in the paper #43

lchenat opened this issue Oct 5, 2016 · 8 comments

Comments

@lchenat
Copy link
Contributor

lchenat commented Oct 5, 2016

Hi, I recently tried to reproduce the experiment result in your paper and I found some results are somehow different from the results in the paper. Did you use default parameters for all algorithms when you did the experiment?

@lchenat lchenat closed this as completed Oct 5, 2016
@lchenat lchenat reopened this Oct 5, 2016
@dementrock
Copy link
Member

Hi @lchenat, all the parameters should have been documented in the appendix of the paper. However, it is not guaranteed that you will get exactly the same result, due to difference in random seeds. I'd be happy to assist if you observe significant discrepancies.

@lchenat
Copy link
Contributor Author

lchenat commented Oct 10, 2016

I did not find the parameters of DDPG in the appendix of the paper. I ran the following code and the maximal average return in each iteration is no more than 2400:

import re
import numpy
import sys
from subprocess import call
from rllab.algos.vpg import VPG
from rllab.algos.tnpg import TNPG
from rllab.algos.erwr import ERWR
from rllab.algos.reps import REPS
from rllab.algos.trpo import TRPO
from rllab.algos.cem import CEM
from rllab.algos.cma_es import CMAES
from rllab.algos.ddpg import DDPG
from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline
from rllab.envs.box2d.cartpole_env import CartpoleEnv
from rllab.envs.normalized_env import normalize
from rllab.policies.gaussian_mlp_policy import GaussianMLPPolicy
from rllab.misc.instrument import stub, run_experiment_lite
from rllab.exploration_strategies.ou_strategy import OUStrategy
from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction

path = "/home/data/lchenat/rllab-master/data/local/experiment/"
exp_name = "test_cartpole_again_ddpg"

stub(globals())

env = normalize(CartpoleEnv())

policy = DeterministicMLPPolicy(
env_spec=env.spec,
hidden_sizes=(400, 300)
)
es = OUStrategy(env_spec=env.spec)
qf = ContinuousMLPQFunction(env_spec=env.spec)
algo = DDPG(
env=env,
policy=policy,
es=es,
qf=qf,
n_epochs=600,
)

#delete the previous data
call(["rm","-rf",path+exp_name])

run_experiment_lite(
algo.train(),
n_parallel=1,
snapshot_mode="last",
#seed=1,
exp_name=exp_name,
#plot=True,
)

@lchenat
Copy link
Contributor Author

lchenat commented Oct 10, 2016

By the way, is there a function that can calculate the metric defined in the paper? (average over all iterations and all trajectories) The debug log only provides average return of each iteration and the number of trajectories in each iteration are not provided in some algorithms.

@dementrock
Copy link
Member

dementrock commented Oct 11, 2016

Also, as mentioned in the paper (probably should have been more clear), we scaled all the rewards by 0.1 when running DDPG. Refer to https://github.com/openai/rllab/blob/master/rllab/algos/ddpg.py#L112

In general, we found this parameter to be very important, and due to time constraint at the time we weren't able to tune it extensively. You may try some other values on other tasks, which may give you even better results.

Re the second question: I think we did a very crude approximation and simply averaged the results over all iterations (treating it as if all iterations had the same number of trajectories). Feel free to submit a pull request that adds additional loggings.

@lchenat
Copy link
Contributor Author

lchenat commented Oct 12, 2016

I have scale the reward by 0.1, but I still got return around 2500, are there any other parameters that I need to tune?

@dementrock
Copy link
Member

Oh, you should change max path length in ddpg to 500. Otherwise, the optimal score is 2500!

@lchenat
Copy link
Contributor Author

lchenat commented Oct 13, 2016

Yes, the optimal score has increased to 5000 after I have changed the max path length to 500, but the average over all iteration is around 3100. Here are average return extracted from the debug.log:

[85.1877, 22.4833, 22.2935, 22.4445, 22.561, 22.3393, 22.8141, 22.2145, 22.2697, 22.3604, 100.441, 177.388, 196.363, 183.331, 223.452, 272.554, 293.124, 407.079, 535.813, 619.828, 695.468, 872.355, 1028.65, 952.744, 645.209, 846.002, 601.686, 607.632, 656.687, 697.427, 715.399, 646.103, 646.78, 621.531, 609.173, 629.381, 598.768, 633.524, 603.093, 692.313, 627.032, 665.51, 671.895, 678.046, 721.31, 670.6, 645.387, 603.164, 594.49, 617.101, 676.009, 634.184, 627.533, 658.008, 700.695, 684.835, 622.859, 596.207, 691.321, 615.621, 612.777, 573.243, 598.272, 611.166, 596.099, 598.044, 551.066, 636.267, 740.511, 599.541, 605.533, 615.751, 710.193, 662.288, 619.205, 661.016, 582.386, 582.968, 601.911, 653.29, 617.729, 651.414, 744.331, 714.654, 658.312, 804.903, 841.202, 925.207, 855.179, 1044.97, 895.128, 936.976, 1066.89, 1406.07, 2131.26, 4021.35, 1814.43, 1877.28, 1512.61, 1993.6, 1686.47, 1991.07, 3476.89, 4138.7, 2385.71, 3379.73, 2648.44, 2970.91, 4008.72, 4683.97, 3603.48, 4999.14, 4999.04, 4998.86, 2328.25, 4534.03, 4999.28, 4999.24, 4998.56, 4283.28, 4998.47, 4998.89, 4998.86, 2223.49, 4999.18, 2702.06, 4998.8, 4998.67, 4999.02, 4998.57, 4999.6, 4998.84, 4998.5, 4998.65, 2449.9, 2153.85, 2034.24, 1275.76, 1394.86, 2258.75, 4557.9, 4998.51, 4998.52, 4998.37, 4998.73, 4998.16, 4997.71, 4997.81, 4583.94, 4998.32, 4998.46, 4998.38, 4998.21, 4804.9, 4997.79, 4998.41, 4998.03, 4998.44, 4998.26, 4998.16, 4998.07, 4998.21, 4997.73, 4998.04, 4997.81, 4998.3, 4998.33, 4998.2, 4998.27, 4998.15, 4998.6, 4998.23, 4998.63, 4998.58, 4998.57, 4999.11, 4999.32, 4999.47, 4999.41, 4790.46, 4999.45, 4999.45, 4999.57, 4999.45, 4781.79, 4999.5, 4999.46, 2834.94, 2667.89, 4999.43, 4879.07, 4999.51, 4999.5, 4256.07, 4999.24, 3749.83, 3140.73, 2184.49, 3293.37, 4276.64, 4570.93, 4549.38, 4448.15, 4999.32, 4608.16, 4999.52, 4999.38, 4999.16, 4999.43, 4790.45, 4999.54, 4724.55, 4999.43, 4627.56, 4999.58, 4999.45, 4272.88, 4999.26, 4999.38, 4784.83, 4731.7, 4696.11, 4427.15, 4165.41, 4906.99, 4422.53, 3953.47, 3692.44, 4123.02, 4571.29, 4450.07, 4999.32, 4859.32, 4999.44, 4498.9, 4895.5, 4999.22, 4589.09, 4998.88, 4733.38, 4775.73, 4999.29, 4999.18, 4640.48, 4610.55, 4935.44, 4999.2, 4883.15, 4852.51, 4900.67, 4835.74, 4500.04, 4738.27, 4531.23, 4530.79, 4999.0, 4999.18, 3974.69, 4797.54, 4998.95, 4000.32, 3699.98, 3424.3, 4998.86, 4003.68, 4878.38, 4915.73, 4763.66, 4998.63, 4688.21, 4998.92, 4926.33, 3244.25, 4507.45, 4998.75, 4998.79, 4998.45, 3060.27, 2583.36, 2717.86, 2005.12, 4911.39, 4998.91, 4998.66, 4660.82, 4789.71, 4998.43, 4998.52, 4884.03, 4541.58, 4998.37]
291results
3179.83032646

The average return drops down from 5000 to 2000-3000 from time to time, is that a normal phenomenon in ddpg?

@dementrock
Copy link
Member

@lchenat The benchmark results were run over 25 million samples, to match the sample complexity used by other algorithms. This should correspond to roughly 2500 epochs. A good approximation could be to extrapolate the performance of the last few epochs to the same amount of sample, and compute the average return using all these data.

I have also observed that ddpg is sometimes unstable, even in cartpole. What you're getting seems about right. One thing we didn't try was batch normalization, which we did not get working before the paper deadline and this could be a good thing to try. You can also try other reward scaling (e.g. 0.01), which might stabilize learning more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants