Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRPO does not work as the OpenAI baselines #64

Closed
befelix opened this issue Jul 27, 2017 · 4 comments
Closed

TRPO does not work as the OpenAI baselines #64

befelix opened this issue Jul 27, 2017 · 4 comments
Assignees

Comments

@befelix
Copy link
Contributor

befelix commented Jul 27, 2017

Here are two pieces of code running on the OpenAI Hopper environment. The first one uses the openAI baseline implementation of TRPO, while the second one uses tensorforce - both with the same parameters as far as I could figure (I manually changed the MLP layer size to two internally).

The baseline implementation works really well, while the tensorforce implementation does not. Their internal code is a bit of a mess, so there might be some hidden setting somewhere that helps. Do you have any experience with this? It might be worth looking into to understand where the performance difference is coming from.

https://gist.github.com/befelix/517887234139a5d9ebb5654161fa0b54

Might be related to #26

@michaelschaarschmidt
Copy link
Contributor

michaelschaarschmidt commented Jul 27, 2017

2 things:

  • They do adaptive step sizes, for which we have not implemented any heuristics yet, which I think has a very significant performance impact, and also means just using the same starting configurations is not very indicative
  • They do a shared value function so the baseline/value function computation is slightly different

I think it's likely more the adaptive stepsizing, and we need a principled way to do that, not a hardcoded heuristic.

Edit: They do adaptive step sizing in the line search, not in the KL divergence (both makes sense). I dont see another key difference aside from all the custon MPI stuff - how many runs have you done? I dont have a Mujoco license here right now so I cant run this instantly

On a sidenote, we are working on a benchmarking project to make it easier to run proper benchmarks (and will try to solve some environments ourselves and provide results)

@michaelschaarschmidt
Copy link
Contributor

Looking at your code in more detail, I think since our value function is a separate MLP, it needs way more updates (100 as set before), whereas they share the same network (if I recall correctly), so with the parameters given the baseline in tensorforce would be way, way off

@befelix
Copy link
Contributor Author

befelix commented Jul 27, 2017

It looks like they use two different networks to me https://github.com/openai/baselines/blob/master/baselines/pposgd/mlp_policy.py#L27

@AlexKuhnle
Copy link
Member

So it seems that with the last commit TRPO performance is better and much more stable, at least according to first test runs on CartPole. Hence I will close this issue for now. If the more in-depth experiments that we will run in the next few days still don't work well, we'll reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants