Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Implement TRPO #38

Closed
cyprienc opened this issue Aug 7, 2021 · 12 comments · Fixed by #40
Closed

[Feature Request] Implement TRPO #38

cyprienc opened this issue Aug 7, 2021 · 12 comments · Fixed by #40
Assignees
Labels
enhancement New feature or request

Comments

@cyprienc
Copy link
Contributor

cyprienc commented Aug 7, 2021

Hi,

I've started working on implementing TRPO: https://github.com/cyprienc/stable-baselines3-contrib/blob/master/sb3_contrib/trpo/trpo.py

I am currently facing a bug when computing the step direction and maximal step length using the matrix-vector product with the Fisher information matrix.
The denominator of the beta is sometimes negative.

I suspect the Hessian in the Hessian-vector product used for the Conjugate Gradient algorithm is wrong (see implementation):

def Hpv(v, retain_graph=True):
  jvp = (grad_kl * v).sum()
  return flat_grad(jvp, params, retain_graph=retain_graph)

Could I've made the graph to compute grad_kl wrong?
If someone spots something out of place, please let me know.
Thanks,

Cyprien

PS: Here is a snippet to run the code:

import gym

from sb3_contrib.trpo.policies import MlpPolicy
from sb3_contrib.trpo.trpo import TRPO

env = gym.make("CartPole-v1")

model = TRPO(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

env.close()
@cyprienc cyprienc changed the title Implement TRPO [Feature Request] Implement TRPO Aug 7, 2021
@Miffyli
Copy link
Member

Miffyli commented Aug 9, 2021

This is a wild-guess, but since you mentioned grad_kl: Seems like this code uses single-sample estimate of KL (and then averages over), which is known to sometimes return negative values for KL (see lengthy discussion and update on SB3 here). This is simply based on the "something is negative but shouldn't be" part :D .

Only other tip I can give is looking at other implementations of TRPO and see what they did, e.g. spinning up (alas, they too only have TF1 version of TRPO).

@araffin araffin added the enhancement New feature or request label Aug 9, 2021
@araffin araffin self-assigned this Aug 9, 2021
@cyprienc
Copy link
Contributor Author

cyprienc commented Aug 9, 2021

Hi,

I've added an assert slightly earlier; inside the conjugate gradient algorithm.
But it points in the same direction, which is that the matrix defined in Hpv is not positive-definite.
image

@Miffyli Thanks for the approximation trick - Neat one. I'll have a look at it (and its gradients :) ). Other implementations usually use a distribution object (custom or from one of the major framework) which computes the KL directly. I also wanted to do that but wasn't sure where I could find a distribution object for the policy passed - but let me have a better look at it.

Thanks,

Cyprien

@cyprienc
Copy link
Contributor Author

@araffin What about ActorCriticPolicy.evaluate_actions returning the Distribution object directly instead of the entropy as the last output? This would allow access to the Distribution.distribution object inside the training loop and compute the analytical KL divergence instead of the sample estimate.

Also, the side-effect in Distribution.proba_distribution called in ActorCriticPolicy._get_action_dist_from_latent means it's not possible to compute the detached old distribution using ActorCriticPolicy.evaluate_actions in a no_grad block because it overrides the parameter of the new distribution (in the image below, the parameters of distribution are replaced with the ones from old_distribution because of the side-effect).
image

On a side-note, pytorch currently doesn't allow to "detach" a distribution easily, but maybe it could be implemented in SB3's Distribution class.

Cyprien

@araffin araffin added the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Aug 13, 2021
@araffin
Copy link
Member

araffin commented Aug 14, 2021

I will try to have a deeper look at it soon. In the meantime, I recommend reading part of John Schulman Thesis, notably the "Computing the Fisher-Vector Product" section ;)

What about ActorCriticPolicy.evaluate_actions returning the Distribution object directly instead of the entropy as the last output?

probably a better idea would be to create a new method get_distribution()

in the image below, the parameters of distribution are replaced with the ones from old_distribution because of the side-effect).

a deepcopy should probably solve that issue, no?

EDIT: you can also take a look at Theano implementation and Tianshou one

@cyprienc
Copy link
Contributor Author

cyprienc commented Aug 17, 2021

probably a better idea would be to create a new method get_distribution()

Done

a deepcopy should probably solve that issue, no?

Used a shallow copy; but I am wondering whether it makes more sense to avoid any kind of copy and do the necessary refactoring work to avoid the side-effect. Probably something for the future.


Using the pytorch distribution did the trick. I also refined a few things to avoid numerical instabilities stemming from the CG method.

How would you like to proceed @araffin ?

@araffin
Copy link
Member

araffin commented Aug 17, 2021

as mentioned in contrib contributing guide, next step is to match published results, i would start with pybullet envs (i had some results in SB2 zoo)
and at the same time open a draft PR ;)

@araffin
Copy link
Member

araffin commented Aug 23, 2021

as mentioned in contrib contributing guide, next step is to match published results, i would start with pybullet envs (i had some results in SB2 zoo)

Regarding the benchmark, once you have created a fork of the rl zoo (cf. guide), I could help you to run it on a larger scale (I have access to a cluster).

@cyprienc
Copy link
Contributor Author

cyprienc commented Sep 8, 2021

Hi,

Sorry for the delay (holidays), I've pushed to a fork of rl zoo: https://github.com/cyprienc/rl-baselines3-zoo
I'll need some help running it on a larger scale since my local compute is not enough.
Let me know what I can do.
Thanks,

Cyprien

@araffin
Copy link
Member

araffin commented Sep 8, 2021

Could you also open a PR?
This will make it easier to review/use ;)

@cyprienc
Copy link
Contributor Author

cyprienc commented Sep 8, 2021

Sure: DLR-RM/rl-baselines3-zoo#163
I'll fill the PR message later, just opened the PR to avoid loosing time.
Thanks,

Cyprien

@araffin
Copy link
Member

araffin commented Sep 8, 2021

i meant a PR to sb3 contrib...

@araffin araffin removed the Maintainers on vacation Maintainers are on vacation so they can recharge their batteries, we will be back soon ;) label Sep 8, 2021
@cyprienc
Copy link
Contributor Author

cyprienc commented Sep 8, 2021

Indeed... #40

@cyprienc cyprienc mentioned this issue Sep 9, 2021
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants