Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Add a note in the docs about the momentum formulation used in optim #1099
I have been looking at the implementation of SGD + Momentum in PyTorch and noticed something a bit different from how other packages (and papers) describe it. For the moment, let's focus solely on (classical) momentum and not Nesterov's version.
At the time of writing, the implementation reads:
Mathematically, if we denote the momentum buffer by
Let us contrast this with the Sutskever et. al. paper and other commonly used pacakges such as Lasagne, Keras, Neon, etc.
Retaining the syntax from above, the algorithm updates
Lasagne employs the same rule as suggested in Sutskever for momentum.
Same for Keras:
Is the disparity true or am I missing something important?
The difference between the two implementations is not insignificant and especially so when
For a fixed learning rate, the two formulations are equivalent. The Torch formulation is chosen because the the step size is directly proportional to the learning rate. This means that if you decrease the learning rate, the step size decreases immediately, and not after some number of iterations, which is generally what you want.
I agree. My only concern was that, given that the reference for the method is the Sutskever paper and there is no documentation to explain the difference, the current implementation could be a potential "gotcha" for folks moving to PyTorch from other frameworks.