You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question about back-gradient optimization technique. Your paper mentioned this article, but reading the source code train_distill_image.py, I've noticed that you couldn't use SGD with momentum (because of previous learning rates influence), and so had to save neural network parameters of each forward step. So what is advantage of your scheme over usual backpropagation?
The text was updated successfully, but these errors were encountered:
Re momentum: There is nothing in our paper's framework that prevents using momentum. One just need to add the forward and backward logic. Momentum is computed independent of learning rate.
Re backgradient vs backprop: The literature has been using back-gradient to refer to backpropagation though optimization steps, often done with jvps. So there is not a difference between back-gradient and backprop though optimizations steps. The article you cited is just an instance of this technique, with improvements to make it more numerically stable by exploiting momentum. Of course you can also use autograd/autodiff systems to do so. It's just the common jvp technique makes it more efficient.
Hello!
I have a question about back-gradient optimization technique. Your paper mentioned this article, but reading the source code train_distill_image.py, I've noticed that you couldn't use SGD with momentum (because of previous learning rates influence), and so had to save neural network parameters of each forward step. So what is advantage of your scheme over usual backpropagation?
The text was updated successfully, but these errors were encountered: