-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Implement TRPO #38
Comments
This is a wild-guess, but since you mentioned Only other tip I can give is looking at other implementations of TRPO and see what they did, e.g. spinning up (alas, they too only have TF1 version of TRPO). |
Hi, I've added an assert slightly earlier; inside the conjugate gradient algorithm. @Miffyli Thanks for the approximation trick - Neat one. I'll have a look at it (and its gradients :) ). Other implementations usually use a distribution object (custom or from one of the major framework) which computes the KL directly. I also wanted to do that but wasn't sure where I could find a distribution object for the policy passed - but let me have a better look at it. Thanks, Cyprien |
@araffin What about Also, the side-effect in On a side-note, pytorch currently doesn't allow to "detach" a distribution easily, but maybe it could be implemented in SB3's Cyprien |
I will try to have a deeper look at it soon. In the meantime, I recommend reading part of John Schulman Thesis, notably the "Computing the Fisher-Vector Product" section ;)
probably a better idea would be to create a new method
a deepcopy should probably solve that issue, no? EDIT: you can also take a look at Theano implementation and Tianshou one |
Done
Used a shallow copy; but I am wondering whether it makes more sense to avoid any kind of copy and do the necessary refactoring work to avoid the side-effect. Probably something for the future. Using the pytorch distribution did the trick. I also refined a few things to avoid numerical instabilities stemming from the CG method. How would you like to proceed @araffin ? |
as mentioned in contrib contributing guide, next step is to match published results, i would start with pybullet envs (i had some results in SB2 zoo) |
Regarding the benchmark, once you have created a fork of the rl zoo (cf. guide), I could help you to run it on a larger scale (I have access to a cluster). |
Hi, Sorry for the delay (holidays), I've pushed to a fork of rl zoo: https://github.com/cyprienc/rl-baselines3-zoo Cyprien |
Could you also open a PR? |
Sure: DLR-RM/rl-baselines3-zoo#163 Cyprien |
i meant a PR to sb3 contrib... |
Indeed... #40 |
Hi,
I've started working on implementing TRPO: https://github.com/cyprienc/stable-baselines3-contrib/blob/master/sb3_contrib/trpo/trpo.py
I am currently facing a bug when computing the step direction and maximal step length using the matrix-vector product with the Fisher information matrix.
The denominator of the beta is sometimes negative.
I suspect the Hessian in the Hessian-vector product used for the Conjugate Gradient algorithm is wrong (see implementation):
Could I've made the graph to compute
grad_kl
wrong?If someone spots something out of place, please let me know.
Thanks,
Cyprien
PS: Here is a snippet to run the code:
The text was updated successfully, but these errors were encountered: