New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An important correction if you want to Fine Tune an already pre-train… #91108
Conversation
…ed model with a big moment. If you want to continue training with this optimizer, for example, with a moment of 0.99, the gradient of the first batch will be x100 more efficient than all the others. This may even knock the model out of the optimum it already founded.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91108
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 073f677: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch!
could a unit test be added in test_optim.py to ensure this doesn’t regress?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is BC-breaking.
You're now biasing towards a zero momentum at the beginning. I could see this significantly slowing down training for some users.
We should definitely look into more details at the impact of this change before merging it!
Not from zero, from 1-momentum, else was removed. This is the normal behavior of sgd with momentum Let's see, for example momentum will be 3/4, old version: It means that the training is not faster, just the contribution of the first batch is a multiple higher, and the rest of the training goes the same way. |
This is example based on my project: image0df = FileToImageDF(filenames[0]) output log: original fine tuning, original dataset size: 468480, batch size 64, 100 epoches You can see that first 2 epoch of fine tuning series loss was increased. It's effect of this error. With my fixed version there is no difference btw first and second optimizator. Function is smooth. But I do not know how to implement such a test in your test_optim.py |
With the current version, the else had no effect because you interpolate between twice the same value. So it is the same as if you were initializing with the grad and then going from there. |
The difference is way to use dampening. For example, when you use momentum = 1-1/4, it means that the effective lr will be quadrupled. To prevent this effect you must set dampening = momentum = 1-1/4 too. With this property integral effect of each batch will be the same as momentum=0 but more smoothed. .add_(d_p, alpha=1 - dampening) In my fix buffer will be zeros and after that on the same call will be executed old else branch and bush will receive dumped grad, like an all other calls. This is the the same effect if you write: if buf is None: but my fix has lesser copy-paste |
@albanD To illustrate the changes I made, I made an example of a network on the mnist dataset so that you could experiment with it yourself. fixed file sgd.py must be at the same folder:
RESULTS: torch.optim.SGD: ERROR! Loss was dramatically increased. log1[-1][0] < log2[0][0]: 0.026298949959874154 < 0.032348830696940424 P.S. By the way, in Adam, where exponential moving average are also used, buffers are initialized with zeros: |
Hi,
Keeping this second point in mind, I'm wondering why we ended up with this different initialization. I think it would be worth digging into the git history to see if that has always been the case or if it was changed at some point from 0 to what it is today. |
i havent followed this discussion closely, but the original nesterov momentum is a bit simplified / re-derived and is from: torch/optim#27 (comment) (see the linked PDF in the comment). hope this helps. |
The error was already into the code when the Nesterov momentum was added there in 2017, and when using it, it just did not show itself, because for Nesterov must necessarily dampening == 0. But thanks for the link, I looked into nag.lua from which file clementfarabet the code peeked and initialization with zeros is also implemented there. In the code it looks like this:
@albanD However , during sgd.py the error is present from the very beginning, from commite
It was in the fourth month of development and it is unlikely that we will now be able to find out from Adam where he was peeping when he wrote this code. At least google claims that this is not an exact copy-paste from some other place before 19.07.2016. Perhaps it is just a rare and very old error. |
Hey! From looking at these in more details, I don't think we want to change the default here. The main reasons are:
I understand that this might not be optimal in your case though. There are a couple things you could do here:
While the doc already explicitly states what the initial value is, we can add one line to the note about the details of our SGD vs other frameworks to highlight the initialization. https://pytorch.org/docs/stable/generated/torch.optim.SGD.html?highlight=sgd#torch.optim.SGD |
Following the discussion in #91108 Pull Request resolved: #92111 Approved by: https://github.com/soumith, https://github.com/janeyx99
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
…ed model with a big moment. If you want to continue training with this optimizer, for example, with a moment of 0.99, the gradient of the first batch will be x100 more efficient than all the others. This may even knock the model out of the optimum it already founded.
No issue
cc @vincentqb @jbschlosser @albanD @janeyx99