# Neural Networks for Machine Learning (6)

### Overview of Mini-batch Gradient Descent

Error surface lies in a space with horizontal axis for each weight and one vertical axis for the error

Whole batch learning is not useful when error surface is a quadratic bowl

- Too big weights result in oscillation
- Want to move quickly with small but consistent gradient
- Move slowly with big but inconsistent gradient

##### Stochastic gradient descent

If dataset is highly redundant, gradient on first half is almost identical on the second

Mini-batches are usually better than online
- Less computation is used (compared with online)
- Computing gradient simultaneously, can be efficient especially on GPU
- Mini-batches need to be balanced for classes

Two types of learning algorithm:
- Full gradient: non-linear conjugate gradient
- Large NN: use mini-batch

##### Basic mini-batch gradient descent algorithm

- Guess an initial learning rate
- Adjust learning rate based on convergence
- Turn down learning rate towards end of mini-batch (when error stops decresing) (Use error on seperate validation set)

### A Bag of Tricks for Mini-batch Gradient Descent

##### Initializing the weights:

- Initial weights with small random value
- Initialize weights to be sqrt(fan-in)

##### Shifting the inputs

Adding a constant

- Make the input vector have zero mean over the whole training set, make the error surface a nice circle
- The hyberbolic tangent (2*logistic-1) produces hidden activations that are roughly zero mean

##### Scaling the inputs

Transform so that each input has unit variance

##### More thorough method: decorrelate the input components

A reasonable method is PCA:
 - drop PCA with smallest eigenvalues
 - Divide remaining principle components by square roots of their eigenvalues

##### Common problems occur in multilayer networks

1. Too big learning rate -> too big positive or negative weights
2. The best strategy for network when using squared error or cross-entropy error is to make output equal to the proportion it should 1 (it's a local minimum)

##### Becareful about turning down learning rate

Don't turn down learning rate too soon or too much

##### Four ways to speed up mini-batch learning

1. Use "momentum" -> use gradient to change the velocity
2. Use seperate adaptive learning rates for each parameter -> slowly adjust the rate using the consistency of the gradient for that parameter
3. rmsprop -> divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weights (a mini-batch version of just using sign of the gradient)
4. Use curvature information

### The Momentum Method

It damps oscillations in direction of high curvature by combing gradients with opposite signs
- It builds up spped in directions with a gentle but consistent gradient

##### The equations of momentum method
$v(t)=\alpha v(t-1)-\epsilon \frac{\partial{E}}{\partial{w}}(t)$  
$\Delta w(t)=v(t)=\alpha v(t-1)-\epsilon \frac{\partial{E}}{\partial{w}}(t)=\alpha \Delta w(t-1)-\epsilon \frac{\partial{E}}{\partial{w}}(t)$  
the weight change equal to current velocity, or previous weight change

##### The behavior of momentum method:
- On tilted plane: $v(\infty)\frac{1}{1-\alpha}(\epsilon\frac{\partial{E}}{\partial{w}})$
- Beginning: may be very large gradients (it pays to use small momentum)

Use small learning rate with large momentum

##### Better type of momentum
- standard: compute gradient at current location then take a big jump at in direction of accumulated gradient
- Better way:
 1. Make big jump in direction of previous accumulated gradient
 2. Measure gradient, make a correction

![](pics\6-3-1.png)

### A Seperate, Adaptive Learning Rate for Each Connection

##### The intuition behind seperate adaptive learning rates
- In multilayer net, appropriate rates can vary between weights

##### Determine individual learning rate

$\Delta w_{ij}=-\epsilon g_{ij}\frac{\partial{E}}{\partial{w_{ij}}}$  
if $(\frac{\partial{E}}{\partial{w_{ij}}}(t)\frac{\partial{E}}{\partial{w_{ij}}}(t-1))>0$  
then $g_{ij}(t)=g_{ij}(t-1)+.05$  else $g_{ij}(t)=g_{ij}(t-1)*.95$

This ensure big gains decay rapidly when oscillations start

The gain will hover around 1 when gradients are totally random

##### Tricks for making adaptive learning rates work better
- limit gains to lie in some reasonable range
- use full batch learning or very big mini-batches
- adaptive learning rates can be combined with momentum

### Rmsprop: Divide the Gradient by a Running Average of Its Recent Magnitude

##### rprop: using only the sign of the gradient

Can escape plateauus with tiny gradients quickly
rprop: using the sign of the gradient with step size seperately for each weight
- increase the step size for a weight multiplicatively if sign of its last two gradients agree
- otherwise decrease the step size multiplicatively
- limit step sizes to be less than 50 and more than a millionth

##### Why rprop does not work with mini-batches

It violate central idea between stochastic gradient descent  
Example:

![](pics\6-5-1.png)

##### rmsprop: A mini-batch version of rprop

![](pics\6-5-2.png)

##### Further development of rmsprop

- Combine rmsprop with standard momentum
- Combine rmsprop with Nesterov meoentum
- Combine rmsprop with adaptive learning rates for each connect
- Other methods related to rmsprop

##### Summary

- Small datasets (e.g. 10,000 cases) or bigger datasets without much redundancy, use a full-batch method
 1. Conjugate gradient
 2. Adaptive learning rates
- Big, redundant datasets
 1. Gradient with momentum
 2. rmsprop