back-gradient optimization technique #31

dm-medvedev · 2020-05-05T16:07:53Z

Hello!

I have a question about back-gradient optimization technique. Your paper mentioned this article, but reading the source code train_distill_image.py, I've noticed that you couldn't use SGD with momentum (because of previous learning rates influence), and so had to save neural network parameters of each forward step. So what is advantage of your scheme over usual backpropagation?

ssnl · 2020-05-22T03:17:39Z

These are two different issues though.

Re momentum: There is nothing in our paper's framework that prevents using momentum. One just need to add the forward and backward logic. Momentum is computed independent of learning rate.
Re backgradient vs backprop: The literature has been using back-gradient to refer to backpropagation though optimization steps, often done with jvps. So there is not a difference between back-gradient and backprop though optimizations steps. The article you cited is just an instance of this technique, with improvements to make it more numerically stable by exploiting momentum. Of course you can also use autograd/autodiff systems to do so. It's just the common jvp technique makes it more efficient.

ssnl closed this as completed May 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

back-gradient optimization technique #31

back-gradient optimization technique #31

dm-medvedev commented May 5, 2020

ssnl commented May 22, 2020

back-gradient optimization technique #31

back-gradient optimization technique #31

Comments

dm-medvedev commented May 5, 2020

ssnl commented May 22, 2020