-
-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRPO does not work as the OpenAI baselines #64
Comments
2 things:
I think it's likely more the adaptive stepsizing, and we need a principled way to do that, not a hardcoded heuristic. Edit: They do adaptive step sizing in the line search, not in the KL divergence (both makes sense). I dont see another key difference aside from all the custon MPI stuff - how many runs have you done? I dont have a Mujoco license here right now so I cant run this instantly On a sidenote, we are working on a benchmarking project to make it easier to run proper benchmarks (and will try to solve some environments ourselves and provide results) |
Looking at your code in more detail, I think since our value function is a separate MLP, it needs way more updates (100 as set before), whereas they share the same network (if I recall correctly), so with the parameters given the baseline in tensorforce would be way, way off |
It looks like they use two different networks to me https://github.com/openai/baselines/blob/master/baselines/pposgd/mlp_policy.py#L27 |
So it seems that with the last commit TRPO performance is better and much more stable, at least according to first test runs on CartPole. Hence I will close this issue for now. If the more in-depth experiments that we will run in the next few days still don't work well, we'll reopen it. |
Here are two pieces of code running on the OpenAI Hopper environment. The first one uses the openAI baseline implementation of TRPO, while the second one uses tensorforce - both with the same parameters as far as I could figure (I manually changed the MLP layer size to two internally).
The baseline implementation works really well, while the tensorforce implementation does not. Their internal code is a bit of a mess, so there might be some hidden setting somewhere that helps. Do you have any experience with this? It might be worth looking into to understand where the performance difference is coming from.
https://gist.github.com/befelix/517887234139a5d9ebb5654161fa0b54
Might be related to #26
The text was updated successfully, but these errors were encountered: