Difference in the implementation of rmsprop with Tensorflow #23796

meijieru · 2019-08-05T17:29:12Z

🚀 Feature

Recently I want to reproduce a tensorflow model in pytorch. I found some differences between tensorflow and pytorch for rmsprop optimizer. The epsilon is added inside the sqrt in tensorflow while pytorch add them outside of it. It makes a difference when epsilon is large.

See chainer/chainer#4754 for reference. Maybe we could have the same option eps_inside_sqrt for controlling the behavior.

The text was updated successfully, but these errors were encountered:

…red version memory usage

ailzhang · 2019-08-07T06:38:22Z

Thanks for the PR @meijieru !

vincentqb · 2019-09-24T18:11:52Z

I'm reopening in case other users would like to give feedback on this.

…g epsilon. #23796

…g epsilon. #23796 ghstack-source-id: c6732741c0d21d9fea4c7b74ce5e81a02e972ecc Pull Request resolved: #26735

…g epsilon. #23796 ghstack-source-id: 03f67e94e355a74c5cf1ce57cfed6af587495088 Pull Request resolved: #26735

…g epsilon. #23796 ghstack-source-id: 6c4dbd396edeb987c422ec69fa32b60840b3d108 Pull Request resolved: #26735

1e100 · 2019-12-15T20:49:34Z

@meijieru were you able to replicate the TF result in the end? No matter what I tried with RMSProp I could not get it to work well in PyTorch. It does work pretty well in TF though.

meijieru · 2019-12-15T21:05:23Z

Yeah.

vincentqb · 2019-12-18T22:02:44Z

Closing since the documentation has been updated.

bonlime · 2020-06-03T11:07:49Z

@vincentqb @zou3519
Hi,
May I ask is there any particular reason the current implementation of RMSprop adds epsilon after sqrt?
From my experience training with default PyTorch RMSprop + FP16 (using apex) tends to produce nan during backprop. I'm sure @rwightman also experienced such issues. Training EfficientNets, for example, is impossible.

By moving epsilon inside sqrt everything works like a charm. There are also mathematical proofs that adding eps after sqrt is not enough to prevent overflow. Check this paper at p.6 bottom right. The topic of the paper is slightly different but their conclusion is also applicable here.

According to this issue other people also had problems with RMSprop behavior in PyTorch. As far as I can see you want the current implementation to be consistent with Adam / AdamW but they also tend to produce nan's!
It's a very tricky issue and it's hard to show that proposed implementation is superior to the current one by tests. But I know a bunch of people who successfully used adjusted versions of Adam / RMSprop to avoid nans

vincentqb · 2020-06-03T15:26:57Z

The main reason the epsilon is as it is right now is due to backward compatibility (and consistency with others like Adam/AdamW), and we cannot change the default behavior, at least not without a lot of warnings and time. We also try to keep the optimizer as lightweight as possible so users can modify and experiment with them more easily.

That being said, this is a topic that comes up often (e.g. #32545), so I'm open to having an alternative available such as you mentioned and offered in #23807 (which would need to be applied to Adam/AdamW also). What would be other alternatives we could do?

bonlime · 2020-06-03T16:14:10Z

@vincentqb
It looks like that for such a mature framework any changes in default behavior are slow and difficult. The reason why I even started this discussion is that in out Russian-speaking Slack (ODS) questions regarding nan's in loss with FP16 appear regularly and often moving eps inside sqrt helps.

I do agree that adding eps_inside_sqrt is not the best. I would suggest adding at least a short note about this issue to optimizer's doc string

vincentqb · 2020-06-04T15:31:28Z

I do agree that adding eps_inside_sqrt is not the best. I would suggest adding at least a short note about this issue to optimizer's doc string

Is this what you mean #26735 ?

bonlime · 2020-06-04T16:53:57Z

I saw #26735 I think it's not enough. Maybe adding a sentence about possible overflow in FP16 training would be better

vincentqb · 2020-06-04T20:00:58Z

If overflow happens in a systematic fashion, can you point me to the issue number in this case?

Miffyli · 2020-07-23T18:57:32Z

Another issue related to TF vs. PyTorch rmsprop implementation. Fixing epsilon alone did not work for us:

Tensorflow initializes squared-grad accumulator to ones, while PyTorch initializes to zeros. In our experiments (A2C) we found PyTorch variant to learn faster in task but never converged to optimal policy, while TF version learns steadily and reliably.

PyTorch used to initialize to ones but it was changed years ago without much discussion (#485). I realize there is no golden-standard for rmsprop and you might not want to change this, but I believe TF version would be stabler with smaller initial gradient updates.

rwightman · 2020-07-23T19:39:32Z

Since I'm cc'd on this thread, I've had great success with my variant of RMSProp that tries to stay true to the TF version. I've trained quite a number of models with excellent results and so have quite a few others. Trying similar hparams with the PyTorch RMSProp results in unstable training and often immediate blow ups in training, it's basically not usable in my trials and I've never managed acceptable results.

There are 3 main differences:

I also tried changing a few order of ops to closer match TF but I doubt there was any impact whatsoever.

rwightman · 2020-07-23T19:54:30Z

The third one, the way the LR is applied to the update, is interesting but not often brought up. In steady state (of LR) the TF and PyTorch impl are equivalent, however they are not when LR changes, the TF version smooths the transition. Interestingly, many LR schedules used with rmsprop by some Google research teams change the LR quite frequently, they often have per step or per-epoch warmup ramps and then LR decay steps every 1-3 epochs. So this difference would have an impact.

meijieru added a commit to meijieru/pytorch that referenced this issue Aug 5, 2019

close pytorch#23796, eps_inside_sqrt option for rmsprop, reduce cente…

ea62024

…red version memory usage

meijieru mentioned this issue Aug 5, 2019

close #23796, eps_inside_sqrt option for rmsprop, reduce centered ver… #23807

Closed

ailzhang added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: optimizer Related to torch.optim labels Aug 7, 2019

meijieru closed this as completed Aug 14, 2019

vincentqb reopened this Sep 24, 2019

vincentqb added a commit that referenced this issue Sep 24, 2019

Highlighting in the documentation that square root comes before addin…

12dcaf1

…g epsilon. #23796

This was referenced Sep 24, 2019

Change schedulers to chainable form #26423

Closed

Highlighting in the doc that square root comes before adding epsilon #26735

Closed

vincentqb added a commit that referenced this issue Sep 24, 2019

Highlighting in the documentation that square root comes before addin…

bbe85af

…g epsilon. #23796 ghstack-source-id: c6732741c0d21d9fea4c7b74ce5e81a02e972ecc Pull Request resolved: #26735

vincentqb added a commit that referenced this issue Sep 24, 2019

Highlighting in the documentation that square root comes before addin…

e4a3bae

…g epsilon. #23796 ghstack-source-id: 03f67e94e355a74c5cf1ce57cfed6af587495088 Pull Request resolved: #26735

vincentqb added a commit that referenced this issue Sep 25, 2019

Highlighting in the documentation that square root comes before addin…

014019d

…g epsilon. #23796 ghstack-source-id: 6c4dbd396edeb987c422ec69fa32b60840b3d108 Pull Request resolved: #26735

vincentqb self-assigned this Dec 18, 2019

vincentqb closed this as completed Dec 18, 2019

vincentqb mentioned this issue Oct 28, 2020

Extend pytorch multi tensor rmsprop optimizer to support lr_in_momentum and decoupled_decay. #46118

Closed

gabikadlecova mentioned this issue Nov 21, 2022

Did you try to reproduce the results with the same hyperparameters from the NASBench101 paper? romulus0914/NASBench-PyTorch#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in the implementation of rmsprop with Tensorflow #23796

Difference in the implementation of rmsprop with Tensorflow #23796

meijieru commented Aug 5, 2019

ailzhang commented Aug 7, 2019

vincentqb commented Sep 24, 2019

1e100 commented Dec 15, 2019

meijieru commented Dec 15, 2019

vincentqb commented Dec 18, 2019

bonlime commented Jun 3, 2020 •

edited

vincentqb commented Jun 3, 2020 •

edited

bonlime commented Jun 3, 2020

vincentqb commented Jun 4, 2020

bonlime commented Jun 4, 2020

vincentqb commented Jun 4, 2020

Miffyli commented Jul 23, 2020 •

edited

rwightman commented Jul 23, 2020

rwightman commented Jul 23, 2020

Difference in the implementation of rmsprop with Tensorflow #23796

Difference in the implementation of rmsprop with Tensorflow #23796

Comments

meijieru commented Aug 5, 2019

🚀 Feature

ailzhang commented Aug 7, 2019

vincentqb commented Sep 24, 2019

1e100 commented Dec 15, 2019

meijieru commented Dec 15, 2019

vincentqb commented Dec 18, 2019

bonlime commented Jun 3, 2020 • edited

vincentqb commented Jun 3, 2020 • edited

bonlime commented Jun 3, 2020

vincentqb commented Jun 4, 2020

bonlime commented Jun 4, 2020

vincentqb commented Jun 4, 2020

Miffyli commented Jul 23, 2020 • edited

rwightman commented Jul 23, 2020

rwightman commented Jul 23, 2020

bonlime commented Jun 3, 2020 •

edited

vincentqb commented Jun 3, 2020 •

edited

Miffyli commented Jul 23, 2020 •

edited