# Lecture 24 - Adaptive Learning Rates

# Backpropagation in a Nutshell

Suppose you have the data $\{x_i\}_{i=1}^N \in \mathbb{R}^D$ with labels $\{d_i\}_{i=1}^N \in \mathbb{R}^K$. You want to find a mapper that learns the input samples $x_i$ and maps it to a label response $d_i$, i.e. *classification*. 

Consider the following objective function:

$$J(w) = \frac{1}{2} \sum_{j=1}^p e_p^2 = \frac{1}{2} \sum_{j=1}^p \left(d_p-y_p\right)^2$$

where $p$ is the dimensionality of your desired values.

1. **Forward Pass:**
For any neuron $i$ receiving input signals from neurons $j$:

$$y_i = \phi\left(\sum_{j=1}^M w_{ij}x_j\right)$$

where $\phi(\bullet)$ is a pre-defined activation function. (The weights and biases of the network have been initialized to some random value.)

2. **Backward Pass:** Compute the gradient by defining local errors at every layer and neuron

$$\delta_i = \phi'(v_i)\sum_j \delta_j w_{ij}$$

$$\Delta w_{ij} = - \eta \delta_j y_i$$

and update the parameters:

$$w_{ij}^{(n+1)} = w_{ij}^{(n)} + \Delta w_{ij}$$

where $\eta$ is the learning rate.

# Adaptive Linear Systems

In regression, we found an analytic solution for the *optimal* weights of our regressor.

For some data $\{x_i\}_{i=1}^N$ and labels $\{d_i\}_{i=1}^N$, we found that the optimal weights are:

$$w^* = (X^TX)^{-1} X^Td$$

using the mean squared error function $J = \frac{1}{2N} \sum_{i=1}^N (d_i - wx_i)^2$

Note that if we were operating in the feature space of $X$, $\phi(X)$, the optimal weights are: 

$$w^* = (\phi(X)^T\phi(X))^{-1} \phi(X)^Td$$

Assuming, the data $X$ is demeaned, $R = X^TX$ is called the *covariance* of the input data and $P = X^Td$ is called the *cross-covariance* of the input data with the desired signal.

Such analytic solution only exists if the model is a *linear function on the parameters*, i.e., $y = Xw$ or $y = \phi(X) w$, which is not the case for an MLP output!

* We can find a solution by performing a *search* of the performance surface (governed by the error function $J$).

# The Error Surface for a Linear Neuron

* The mean squared error function $J = \frac{1}{2N} \sum_{i=1}^N e_i^2$ is a *convex* function and therefore we can apply convex optimization techniques to search for the *minima* of this function.
    * The most common method to optimize the least squares error is the *steepest descent*.

* The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error.
    * For a linear neuron with a squared error, it is a quadratic bowl.
    * Vertical cross-sections are parabolas.
    * Horizontal cross-sections are ellipses.
    
* For multi-layer, non-linear nets the error surface is much more complicated.
    * But locally, a piece of a quadratic bowl is usually a very good apparoximation.

## Convergence Speed of Full-Batch Learning

* Going "downhill" reduces the error, but the direction of steepest descent does not point at the minimum unless the ellipse is a circle.
    * The gradient is big in the direction in which only want to travel a small distance.
    * The gradient is small in the direction in which we want to travel a large distance
    
* Even for non-linear multi-layer nets, the error surface is locally quadratic, so the same speed issues apply.

## Learning Rate

$$w^{(t+1)} = w^{(t)} - \eta \nabla J(w^{(t)})$$

* If the learning rate is big, the weights slosh to and from the ravine.
    * If the learning rate is too big, this oscillation diverges.
    
* What we would like to achieve:
    * Move quickly in direction with small but consistent gradients.
    * Move slowly in directions with big but inconsistent gradients.

## Stochastic Gradient Descent - Online Learning

* If the dataset is redundant, the gradient on the first half is almost identical to the gradient on the second half.
    * So instead of computing the ful gradient, update the weights using the gradient on the first half and then get a gradient for the new weights on the second half.
    * The extreme version of this approach updates weights after each point. It's called **online learning**.
    
* Mini-batches are usually better than online.
    * Less computation is used for updating the weights.
    * Computing the gradient for many cases simultaneously uses matrix-matrix multiplications which are very efficient, especially with GPUs.

* Mini-batches need to be balanced for classes.

## Full-Batch vs Mini-Batch Learning

* If we use the full gradient computed from all the training cases, **batch learning**, there are many clever ways to speed up (e.g. non-linear conjugate gradient).
    * The optimization community has studied the general problem of optimizing smooth non-linear functions for many years.
   * Multilayer neural nets are not typical of the problems they study so their methods may need a lot of modifications.
   
* For large neural networks with very large and highly redundant training sets, it is nearly always best to use **mini-batch learning**.
    * The mini-batches may need to be quite big when adapting fancy methods.
    * Big mini-batches are more computationaly efficient.

# Mini-Batch Gradient Descent

* We start by guessing a learning rate.
    * If the error keeps getting worse or oscillates wildly, reduce the learning rate.
    * If the error is falling fairly consistently but slowly, increase the learning rate.

* We can write a simple routine to automate this way of adjusting the learning rate.

* Towards the end of mini-batch learning it nearly always helps to turn down the learning rate.
    * This removes fluctuations in the final weights caused by the variations between mini-batches.

* Turn down the learning rate when the error stops decreasing.
    * Use the error on a separate validation set.

# A Bag of Tricks for Mini-Batch Gradient Descent

<span style="color:blue">**Initializing the Weights**</span>

* If two hidden units have exactly the same bias and exactly the same incoming and outgoing weights, they will always get exactly the same gradient.
    * So they can never learn to be different features.
    * We break symmetry by initializing the weights to have small random values.
    
* If a hidden unit has a big *fan-in*, small changes on many of its incoming weights can cause the learning to overshoot.
    * We generally want smaller incoming weights when the fan-in is big, so initialize the weights to be proportional to $\sqrt{\text{fan-in}}$.
    
<span style="color:blue">**Shifting the Input (Demeaning)**</span>
* When using the steepest descent, shifting the input values makes a big difference!
    * It usually helps to transform each component of the input vector so that it has zero mean over the whole training set.
* The hyperbolic tangent produces hidden activations that are roughly zero mean.
    * In this repect it's better than the logistic
    
<span style="color:blue">**Scalling the Input (Unit Variance)**</span>
* When using steepest descent, scaling then input values makes a big difference.
    * It usually helps to transform each component of the input vector over the whole training set.

<span style="color:blue">**Decorrelate the Input components**</span>
* For a linear neuron, we get a big win by decorrelating each component of the input from the other input components.

* There are several different ways to decorrelate inputs. A reasonable method is to use *Principal Component Analysis (PCA)*.
    * Drop the principal components with the smallest eigenvalues.
        * This achieves some dimensionality reduction.
    * Divide the remaining principal components by the square roots of their eigenvalues. For a linear neuron, this converts an axis aligned elliptical error surface into a circular one.

* For a circular error surface, the gradient points straight towards the minimum.

<span style="color:orange">**Common Problems**</span>

**Plateau**
* If we start with a very big learning rate, the weights of each hidden unit will all bcome very big and positive or very big and negative.
    * The error derivatives for the hidden units will all become tiny and the error will not decrease.
    * This is usualy a *plateau*, but can be misunderstood as a local minimum.

**Turning Down Learning Rate**
* Turning down the learning rate reduces the random fluctuations in the error due to the different gradients on different mini-batches.
    * So we get a quick win.
    * But then we get slower learning.
* So, don't turn down the learning rate too soon!

# Optimization for Training Networks

Here are a few ways to speed up mini-batch learning:

**Momentum**
* Instead of using the gradient to change *position* of the weight *particle*, use it to change the *velocity*.

**Adaptive Learning Rate**
* Use separate adaptive learning rates for each parameter.
* Slowly adjust the rate using the consistency of the gradient for that parameter.

**RMSProp (Root Mean Square Propagation)**
* Use separate adaptive learning rates for each parameter.
* Divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.
    * This is the mini-batch version of just using the sign of the gradient (method called **RProp** designed for full-batch learning).

**ADAM (Adaptive Moment Estimation)**
* Use separate adaptive learning rates for each parameter.
* Adaptation of RMSProp, running averages of both gradient and second moments are used

# The Momentum Learning

* The momentum learning can be applied to full-batch learning or mini-batch learning.
    * Probably the commonest recipe to learn deep neural nets is to use stochastic gradient descent with mini-batches combined with momentum.
    
* Imagine a ball on the error surface. The location of the ball in the horizontal plane represents the weight vector.
    * The ball starts off by following the gradient, but once it has velocity, it no longer does steepest descent.
    * Its momentum makes it keep going in the previous direction.
    * We need to introduce *viscosity* so that the velocity dies off when are getting closer to the solution.

* It damps oscillations in directions of high curvature by combining gradients with opposite signs.

* It builds up speed in directions with a gentle but consistent gradient.

\begin{align}
\Delta w_{ji}^{(t)} &= \alpha \Delta w_{ji}^{(t-1)} + \eta  \nabla J(w_{ji}^{(t)})\\
&= \alpha \Delta w_{ji}^{(t-1)} + \eta \delta_j^{(t)} y_i^{(t)}
\end{align}

The effect of the gradient is to increment the previous velocity. The velocity also decays by $\alpha$ which is slightly less than 1 (generally, $\alpha=0.9$).

* At the beginning of learning there may be very large gradients.
    * So it pays off to use a small momentum (e.g. $\alpha=0.5$).
    * Once the large gradients have disappeared and the weights are stuck in a ravine the momentum can be smoothly raised to its final value (e.g. $\alpha=0.9$ or even $0.99$).
    
* This allows us to learn at a rate that would have caused divergent oscillations without momentum (case of increased learning rate only).

## Nesterov's Accelerated Gradient Descent

* The standard momentum *first* computes the gradient at the current location and *then* takes a big jump in the direction of the updated accumulated gradient.

* Ilya Sutskever (2012 unpublished) suggested a new form of momentum that often works better.
    * Inspired by the Nesterov method for optimizing convex functions.
* *First* make a big jump in the direction of the previous accumulated gradient.

* *Then* measure the gradient where you end up and make a correction.

$$\Delta w_{ji}^{(t)} = \alpha \Delta w_{ji}^{(t-1)} + \eta \nabla J(w_{ji}^{(t-1)} + \eta \Delta w_{ji}^{(t-1)})$$

# (General) Adaptive Learning Rate

* In multilayer neural networks, the appropriate learning rates can vary widely between weights:
    * The magnitude of the gradient are often very different for different layers, especially if the initial weights are small.
    * The fan-in of a unit determines the size of the "overshoot" effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error.

* So use a global learning rate multiplied by an appropriate local gain that is determined empirically for each weight.

* Start with a local gain of 1 for every weight.

* Increase the local gain if the gradient for that weight does not change sign.

* Use small additive increases and multiplicative decreases.
    * This ensures that big gains decay rapidly when oscillations start.
    * If the gradient is totally random the gain will hover around 1 when we increase by plus $\delta$ half the time and decrease by times $1-\delta$ half the time.
    
$$\Delta w_{ji} = -\eta g_{ji} \nabla J(w_{ji})$$

\begin{align}
\text{If } &\left(\nabla J(w_{ji}^{(t)}) \times \nabla J(w_{ji}^{(t-1)})\right) \geq 0 \\
&\text{then } g_{ji}(t) = g_{ji}(t-1) + \delta \\
&\text{else } g_{ji}(t) = g_{ji}(t-1) \times \delta \\
\end{align}

* Need to limit the gains to lie in some reasonable range, e.g. $[0.1,10]$ or $[0.01,100]$.

* Use full batch learning or very large mini-batches.
    * This ensures that changes in the sign of the gradient are not mainly due to the sampling error of a mini-batch.
    
* Adaptive learning rates can be combined with momentum (Jacobs, 1989).

# RProp

* RProp stands for *Resilient BackPropagation*.

* The magnitude of the gradient can be very different for different weights and can change during learning.
    * This make it hard to choose a single glocal learning rate.
    
* For full-batch learning, we can deal with this variation by only using the sign of the gradient.
    * The weight updates are all of the same magnitude.
    * This escapes from plateaus with tiny gradients quickly.

$$\Delta w_{ji} = -\eta g_{ji}\text{sign}\left( \nabla J(w_{ji})\right)$$

\begin{align}
\text{If } &\left(\nabla J(w_{ji}^{(t)}) \times \nabla J(w_{ji}^{(t-1)})\right) \geq 0 \\ 
& \text{then } g_{ji}(t) = g_{ji}(t-1) 
\times \delta_1 \\
& \text{else } g_{ji}(t) = g_{ji}(t-1) \times \delta_2
\end{align}

* RProp combines the idea of only using the sign of the gradient with the idea of adapting the learning rate separately for each weight.
    * Increase the learning rate for a weight multiplicatively if the signs of its last two gradients agree.
    * Otherwise decrease the step size multiplicatively.

* Use full batch learning or very big mini-batches.
   * This ensures that changes in the sign of the gradient are not mainly due to the sampling error of a mini-batch.

# RMSProp

* RProp is equivalent to using the gradient but also dividing by the size of the gradient.
    * The problem with mini-batch RProp is that we divide by a different number for each mini-batch.

* RMSProp keeps a moving average of the squared gradient for each weight.

$$\Delta w_{ji}^{(t)} = \gamma \Delta w_{ji}^{(t-1)} + (1-\gamma) \left(\nabla J(w_{ji}^{(t)})\right)^2$$

$$w_{ji}^{(t+1)} = w_{ji}^{(t)} - \frac{\eta}{\sqrt{\Delta w_{ji}^{(t)}}} \nabla J(w_{ji}^{(t)}) $$

# ADAM

* ADAM is a combination of RMSProp and momentum.

* ADAM keeps both moving average of the gradient and the squared gradient.

* Adam includes bias corrections to the estimates of both the ﬁrst-order moments (the momentum term) and the (uncentered) second-order moments to account for their initializationat the origin.
    * Thus, unlike in ADAM, the RMSProp second-order moment estimate may have high biasearly in training. 

$$m_w^{(t+1)} = \beta_1 m_w^{(t)} + (1-\beta_1)\nabla J(w^{(t)})$$

$$v_w^{(t+1)} = \beta_2 v_w^{(t)} + (1-\beta_2)\left(\nabla J(w^{(t)})\right)^2$$

$$\hat{m}_w = \frac{m_w^{(t+1)}}{1 - \beta_1^{t+1}}\text{ (bias correction)}$$

$$\hat{v}_w = \frac{v_w^{(t+1)}}{1 - \beta_2^{t+1}}\text{ (bias correction)}$$

$$w^{(t+1)} = w^{(t)} - \eta \frac{\hat{m}_w}{\sqrt{\hat{v}_m} + \epsilon}$$

* Often used $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$.

# Summary of Learning Methods for Neural Networks

* For small datasets (e.g. 10,000 samples) or bigger datasets without much redundacy, use a full-batch method.
    * AdaGrad, RProp, ...
    
* For big, redundant datasets use mini-batches.
    * Try gradient descent with momentum.
    * Try RMSProp.
    * Try ADAM.

*Why there is no simple recipe?*

* There are lots of different network architectures
* Tasks differ a lot
    * Some require very accurate weights, some don't.
    * Some have many very rare cases (e.g. words).

# Recommended Reading

Chapter 8 "Optimization for Training Deep Models" from Deep Learning by Ian Goodfellow
* http://www.deeplearningbook.org/