[Feature Request] Implement TRPO #38

cyprienc · 2021-08-07T17:17:17Z

Hi,

I've started working on implementing TRPO: https://github.com/cyprienc/stable-baselines3-contrib/blob/master/sb3_contrib/trpo/trpo.py

I am currently facing a bug when computing the step direction and maximal step length using the matrix-vector product with the Fisher information matrix.
The denominator of the beta is sometimes negative.

I suspect the Hessian in the Hessian-vector product used for the Conjugate Gradient algorithm is wrong (see implementation):

def Hpv(v, retain_graph=True):
  jvp = (grad_kl * v).sum()
  return flat_grad(jvp, params, retain_graph=retain_graph)

Could I've made the graph to compute grad_kl wrong?
If someone spots something out of place, please let me know.
Thanks,

Cyprien

PS: Here is a snippet to run the code:

import gym

from sb3_contrib.trpo.policies import MlpPolicy
from sb3_contrib.trpo.trpo import TRPO

env = gym.make("CartPole-v1")

model = TRPO(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

env.close()

The text was updated successfully, but these errors were encountered:

Miffyli · 2021-08-09T12:49:40Z

This is a wild-guess, but since you mentioned grad_kl: Seems like this code uses single-sample estimate of KL (and then averages over), which is known to sometimes return negative values for KL (see lengthy discussion and update on SB3 here). This is simply based on the "something is negative but shouldn't be" part :D .

Only other tip I can give is looking at other implementations of TRPO and see what they did, e.g. spinning up (alas, they too only have TF1 version of TRPO).

cyprienc · 2021-08-09T16:45:50Z

Hi,

I've added an assert slightly earlier; inside the conjugate gradient algorithm.
But it points in the same direction, which is that the matrix defined in Hpv is not positive-definite.

@Miffyli Thanks for the approximation trick - Neat one. I'll have a look at it (and its gradients :) ). Other implementations usually use a distribution object (custom or from one of the major framework) which computes the KL directly. I also wanted to do that but wasn't sure where I could find a distribution object for the policy passed - but let me have a better look at it.

Thanks,

Cyprien

cyprienc · 2021-08-10T09:13:11Z

@araffin What about ActorCriticPolicy.evaluate_actions returning the Distribution object directly instead of the entropy as the last output? This would allow access to the Distribution.distribution object inside the training loop and compute the analytical KL divergence instead of the sample estimate.

Also, the side-effect in Distribution.proba_distribution called in ActorCriticPolicy._get_action_dist_from_latent means it's not possible to compute the detached old distribution using ActorCriticPolicy.evaluate_actions in a no_grad block because it overrides the parameter of the new distribution (in the image below, the parameters of distribution are replaced with the ones from old_distribution because of the side-effect).

On a side-note, pytorch currently doesn't allow to "detach" a distribution easily, but maybe it could be implemented in SB3's Distribution class.

Cyprien

araffin · 2021-08-14T11:43:19Z

I will try to have a deeper look at it soon. In the meantime, I recommend reading part of John Schulman Thesis, notably the "Computing the Fisher-Vector Product" section ;)

What about ActorCriticPolicy.evaluate_actions returning the Distribution object directly instead of the entropy as the last output?

probably a better idea would be to create a new method get_distribution()

in the image below, the parameters of distribution are replaced with the ones from old_distribution because of the side-effect).

a deepcopy should probably solve that issue, no?

EDIT: you can also take a look at Theano implementation and Tianshou one

cyprienc · 2021-08-17T17:43:03Z

probably a better idea would be to create a new method get_distribution()

Done

a deepcopy should probably solve that issue, no?

Used a shallow copy; but I am wondering whether it makes more sense to avoid any kind of copy and do the necessary refactoring work to avoid the side-effect. Probably something for the future.

Using the pytorch distribution did the trick. I also refined a few things to avoid numerical instabilities stemming from the CG method.

How would you like to proceed @araffin ?

araffin · 2021-08-17T20:50:24Z

as mentioned in contrib contributing guide, next step is to match published results, i would start with pybullet envs (i had some results in SB2 zoo)
and at the same time open a draft PR ;)

araffin · 2021-08-23T10:30:18Z

as mentioned in contrib contributing guide, next step is to match published results, i would start with pybullet envs (i had some results in SB2 zoo)

Regarding the benchmark, once you have created a fork of the rl zoo (cf. guide), I could help you to run it on a larger scale (I have access to a cluster).

cyprienc · 2021-09-08T14:39:12Z

Hi,

Sorry for the delay (holidays), I've pushed to a fork of rl zoo: https://github.com/cyprienc/rl-baselines3-zoo
I'll need some help running it on a larger scale since my local compute is not enough.
Let me know what I can do.
Thanks,

Cyprien

araffin · 2021-09-08T15:26:25Z

Could you also open a PR?
This will make it easier to review/use ;)

cyprienc · 2021-09-08T16:42:51Z

Sure: DLR-RM/rl-baselines3-zoo#163
I'll fill the PR message later, just opened the PR to avoid loosing time.
Thanks,

Cyprien

araffin · 2021-09-08T16:49:53Z

i meant a PR to sb3 contrib...

cyprienc · 2021-09-08T17:00:57Z

Indeed... #40

cyprienc changed the title ~~Implement TRPO~~ [Feature Request] Implement TRPO Aug 7, 2021

araffin added the enhancement New feature or request label Aug 9, 2021

araffin self-assigned this Aug 9, 2021

araffin added the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Aug 13, 2021

araffin mentioned this issue Aug 14, 2021

[Feature Request] TRPO needed DLR-RM/stable-baselines3#467

Closed

1 task

araffin removed the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Sep 8, 2021

cyprienc mentioned this issue Sep 9, 2021

Add TRPO #40

Merged

14 tasks

araffin closed this as completed in #40 Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Implement TRPO #38

[Feature Request] Implement TRPO #38

cyprienc commented Aug 7, 2021 •

edited

Miffyli commented Aug 9, 2021

cyprienc commented Aug 9, 2021

cyprienc commented Aug 10, 2021

araffin commented Aug 14, 2021 •

edited

cyprienc commented Aug 17, 2021 •

edited

araffin commented Aug 17, 2021

araffin commented Aug 23, 2021

cyprienc commented Sep 8, 2021

araffin commented Sep 8, 2021

cyprienc commented Sep 8, 2021

araffin commented Sep 8, 2021

cyprienc commented Sep 8, 2021

[Feature Request] Implement TRPO #38

[Feature Request] Implement TRPO #38

Comments

cyprienc commented Aug 7, 2021 • edited

Miffyli commented Aug 9, 2021

cyprienc commented Aug 9, 2021

cyprienc commented Aug 10, 2021

araffin commented Aug 14, 2021 • edited

cyprienc commented Aug 17, 2021 • edited

araffin commented Aug 17, 2021

araffin commented Aug 23, 2021

cyprienc commented Sep 8, 2021

araffin commented Sep 8, 2021

cyprienc commented Sep 8, 2021

araffin commented Sep 8, 2021

cyprienc commented Sep 8, 2021

cyprienc commented Aug 7, 2021 •

edited

araffin commented Aug 14, 2021 •

edited

cyprienc commented Aug 17, 2021 •

edited