TRPO does not work as the OpenAI baselines #64

befelix · 2017-07-27T16:26:54Z

Here are two pieces of code running on the OpenAI Hopper environment. The first one uses the openAI baseline implementation of TRPO, while the second one uses tensorforce - both with the same parameters as far as I could figure (I manually changed the MLP layer size to two internally).

The baseline implementation works really well, while the tensorforce implementation does not. Their internal code is a bit of a mess, so there might be some hidden setting somewhere that helps. Do you have any experience with this? It might be worth looking into to understand where the performance difference is coming from.

https://gist.github.com/befelix/517887234139a5d9ebb5654161fa0b54

Might be related to #26

michaelschaarschmidt · 2017-07-27T16:44:05Z

2 things:

They do adaptive step sizes, for which we have not implemented any heuristics yet, which I think has a very significant performance impact, and also means just using the same starting configurations is not very indicative
They do a shared value function so the baseline/value function computation is slightly different

I think it's likely more the adaptive stepsizing, and we need a principled way to do that, not a hardcoded heuristic.

Edit: They do adaptive step sizing in the line search, not in the KL divergence (both makes sense). I dont see another key difference aside from all the custon MPI stuff - how many runs have you done? I dont have a Mujoco license here right now so I cant run this instantly

On a sidenote, we are working on a benchmarking project to make it easier to run proper benchmarks (and will try to solve some environments ourselves and provide results)

michaelschaarschmidt · 2017-07-27T17:00:35Z

Looking at your code in more detail, I think since our value function is a separate MLP, it needs way more updates (100 as set before), whereas they share the same network (if I recall correctly), so with the parameters given the baseline in tensorforce would be way, way off

befelix · 2017-07-27T17:10:45Z

It looks like they use two different networks to me https://github.com/openai/baselines/blob/master/baselines/pposgd/mlp_policy.py#L27

AlexKuhnle · 2017-08-14T18:24:17Z

So it seems that with the last commit TRPO performance is better and much more stable, at least according to first test runs on CartPole. Hence I will close this issue for now. If the more in-depth experiments that we will run in the next few days still don't work well, we'll reopen it.

michaelschaarschmidt assigned AlexKuhnle and michaelschaarschmidt Jul 27, 2017

michaelschaarschmidt mentioned this issue Aug 14, 2017

Quickstart example doesn't work #91

Closed

AlexKuhnle closed this as completed Aug 14, 2017

krfricke unassigned michaelschaarschmidt Jan 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRPO does not work as the OpenAI baselines #64

TRPO does not work as the OpenAI baselines #64

befelix commented Jul 27, 2017 •

edited

Loading

michaelschaarschmidt commented Jul 27, 2017 •

edited

Loading

michaelschaarschmidt commented Jul 27, 2017

befelix commented Jul 27, 2017

AlexKuhnle commented Aug 14, 2017

TRPO does not work as the OpenAI baselines #64

TRPO does not work as the OpenAI baselines #64

Comments

befelix commented Jul 27, 2017 • edited Loading

michaelschaarschmidt commented Jul 27, 2017 • edited Loading

michaelschaarschmidt commented Jul 27, 2017

befelix commented Jul 27, 2017

AlexKuhnle commented Aug 14, 2017

befelix commented Jul 27, 2017 •

edited

Loading

michaelschaarschmidt commented Jul 27, 2017 •

edited

Loading