# Learning Rates

Previously, we went over learning rates and parameter tuning with gradient descent. In this document, we will go over optimizers for learning rate so that we can avoid parameter tuning, which is a very lengthy trial and error process. An optimizer is a method of updating the learning rate iteratively as a function of another variable.

We will go over and demo the following optimizers:
1. AdaGrad
2. RMSprop
3. Adam
4. AdaDelta
5. Nadam
6. AdaMax
7. AMSgrad

There are many other methods for computing the learning rate, and we saw a few of them in the variants we explored. However, now we want to turn our attention to some more popular methods that are often paired with SGD in autoML/ML systems, like Keras, Google Cloud AutoML, sklearn. We will explore the listed techniques with SGD, which are optimizers in found in Keras. We will implement the first two using Stochastic Gradient Descent, and the rest of the optimizers are designed to be used with Nesterov's Accelerated Gradient.


#### AdaGrad and RMSprop

The most basic learning rate variant is AdaGrad. AdaGrad updates the learning rate by dividing by a factor of a new variable $S_{t+1} = S_{t} + |\nabla F(w^{(t)})|^{2}$. Thus we update by $$w^{(t+1)} = w^{(t)} - \frac{\mu}{\sqrt{S_{t} + \epsilon}} \nabla F(w^{(t)})$$ $$S_{t+1} = S_{t} + |\nabla F(w^{(t)})|^{2}$$


RMSprop uses this same idea, but it takes the average of all the previous iterates of $S$. Thus our update is: $$w^{(t+1)} = w^{(t)} - \frac{\mu}{\sqrt{S_{t} + \epsilon}} \nabla F(w^{(t)})$$ $$S_{t+1} = \beta S_{t} + | (1- \beta) \nabla F(w^{(t)})|^{2}$$ Here, we choose $\beta \in [0,1)$. A typical choice for $\beta$ is $\beta =0.9$. We will apply these algorithms with SGD and NAG.