New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in the implementation of rmsprop with Tensorflow #23796
Comments
…red version memory usage
Thanks for the PR @meijieru ! |
I'm reopening in case other users would like to give feedback on this. |
@meijieru were you able to replicate the TF result in the end? No matter what I tried with RMSProp I could not get it to work well in PyTorch. It does work pretty well in TF though. |
Yeah. |
Closing since the documentation has been updated. |
@vincentqb @zou3519 By moving epsilon inside sqrt everything works like a charm. There are also mathematical proofs that adding eps after sqrt is not enough to prevent overflow. Check this paper at p.6 bottom right. The topic of the paper is slightly different but their conclusion is also applicable here. According to this issue other people also had problems with RMSprop behavior in PyTorch. As far as I can see you want the current implementation to be consistent with Adam / AdamW but they also tend to produce |
The main reason the epsilon is as it is right now is due to backward compatibility (and consistency with others like Adam/AdamW), and we cannot change the default behavior, at least not without a lot of warnings and time. We also try to keep the optimizer as lightweight as possible so users can modify and experiment with them more easily. That being said, this is a topic that comes up often (e.g. #32545), so I'm open to having an alternative available such as you mentioned and offered in #23807 (which would need to be applied to Adam/AdamW also). What would be other alternatives we could do? |
@vincentqb I do agree that adding |
Is this what you mean #26735 ? |
I saw #26735 I think it's not enough. Maybe adding a sentence about possible overflow in FP16 training would be better |
If overflow happens in a systematic fashion, can you point me to the issue number in this case? |
Another issue related to TF vs. PyTorch rmsprop implementation. Fixing epsilon alone did not work for us: Tensorflow initializes squared-grad accumulator to ones, while PyTorch initializes to zeros. In our experiments (A2C) we found PyTorch variant to learn faster in task but never converged to optimal policy, while TF version learns steadily and reliably. PyTorch used to initialize to ones but it was changed years ago without much discussion (#485). I realize there is no golden-standard for rmsprop and you might not want to change this, but I believe TF version would be stabler with smaller initial gradient updates. |
Since I'm cc'd on this thread, I've had great success with my variant of RMSProp that tries to stay true to the TF version. I've trained quite a number of models with excellent results and so have quite a few others. Trying similar hparams with the PyTorch RMSProp results in unstable training and often immediate blow ups in training, it's basically not usable in my trials and I've never managed acceptable results. There are 3 main differences:
I also tried changing a few order of ops to closer match TF but I doubt there was any impact whatsoever. |
The third one, the way the LR is applied to the update, is interesting but not often brought up. In steady state (of LR) the TF and PyTorch impl are equivalent, however they are not when LR changes, the TF version smooths the transition. Interestingly, many LR schedules used with rmsprop by some Google research teams change the LR quite frequently, they often have per step or per-epoch warmup ramps and then LR decay steps every 1-3 epochs. So this difference would have an impact. |
🚀 Feature
Recently I want to reproduce a tensorflow model in pytorch. I found some differences between tensorflow and pytorch for rmsprop optimizer. The epsilon is added inside the sqrt in tensorflow while pytorch add them outside of it. It makes a difference when epsilon is large.
See chainer/chainer#4754 for reference. Maybe we could have the same option
eps_inside_sqrt
for controlling the behavior.The text was updated successfully, but these errors were encountered: