# Neural Network Optimizers

You can leverage several options to prioritize the training time or the accuracy of your neural network and deep learning models. In this module you learn about key concepts that intervene during model training, including optimizers and data shuffling. You will also gain hands-on practice using Keras, one of the go-to libraries for deep learning. 

Learning Objectives
- Understand Neural Networks as the main unit for Deep Learning
- Become familiarized with solution optimizers used in Neural Networks, including gradient descent and data shuffling

# Optimizers and Data Shuffling

## Optimizers and Momentum

### Optimizers
We have considered approaches to gradient descent
that vary the number of data points involved in a step.

However, they have all used the standard update formula:

$$
W := W - \alpha • \Delta J
$$

There are several variants to updating the weights which give better performance in practice.

These successive "tweaks" each attempt to improve on the previous idea.

The resulting (often complicated) methods are referred to as "optimizers".

### Momentum

Idea, only change direction by a little bit each time.

Keeps a "running average" of the step directions, smoothing out the variation of the individual points.

$V_t := \eta • V_{t-1} - \alpha • \delta V$
$W := W - V_t$

Here, $\eta$ is referred to as the "momentum".
It is generally given a value <1

![](./images/16_Momentum.png)

#### Nesterov Momentum

Idea: control "overshooting" by looking ahead.

Apply gradient only to the "non-momentum" component.


$v_t = \eta • v_{t-1} - \alpha • \Delta(J - \eta • v_{t-1})$

$W== W - v_t$


![](./images/17_NesterovMomentum.png)



## Popular Optimizers

### AdaGrad

Idea: scale the update for each weight separately.
1. Update frequently-updated weights less.
2. Keep running sum of previous updates.
3. Divide new updates by factor of previous sum.

With starting point $G_i(0) = 0:

$ G_i(t) = G_i(t - 1) + (\dfrac{\delta L}{\delta w_i} (t)^2 $  -> G will continue to increase


$W:=W-\dfrac{\eta}{\sqrt G_t+\epsilon}.\Delta J$ -> This leads to smaller updates each iteration

### RMSProp 

Quite similar to AdaGrad.
- Rather than using the sum of previous gradients,
decay older gradients more than more recent ones.
- More adaptive to recent updates.

![](./images/18_AdamOptimizer.png)

### Which Should You Use?

RMSProp and Adam seem to be quite popular. From 2012 to 2017, approximately 23% of
deep learning papers submitted to arXiv (a popular platform for research in Deep Learning)
mentioned using the Adam approach.

It can be difficult to predict in advance which will be best for a particular problem.

This is still an active area of inquiry.


## Details of Training Neural Networks

Learning Goals
- Details of training Neural Network models
- Stochastic gradient descent
- Batching approaches and terminology

Given an example (or group of examples),
we know how to compute the derivative for each weight.
1. How exactly do we update the weights?
2. How often? (..after each training data point? ..after all the training data points?)

### What Next? - Gradient Descent

Classical approach: get derivative for entire data set, then take a step in that direction.
- Pros: Each step is informed by all the data.
- Cons: Very slow, especially as data gets big.

$W_new = W_old - lr* derivative$

### Stochastic Gradient Descent

Get derivative for just one point, and take a step in that direction.
- Steps are "less informed", but you
take more of them.
- Should "balance out".
- Probably want a smaller step size.
- Also helps "regularize".

### Compromise Approach: Mini-batch

Get derivative for a "small" set of points, then take a step in that direction.
- Typical mini batch sizes are 16, 32.
- Strikes a balance between two extremes.

![](./images/19_ComparisonOfBatchingApproaches.png)

#### Batching Terminology

- Full-batch: Use entire data set to compute gradient before updating.
- Mini-batch: Use a smaller portion of data (but more than single example)
to compute gradient before updating.
- Stochastic Gradient Descent (SGD): Use a single example to compute gradient
before updating (though sometimes people use SGD to refer to minibatch, also).

- An Epoch: refers to a single pass through all of the training data.
  - In full batch gradient descent, there would be one step taken per epoch.
  - In SGD / Online learning, there would be n steps taken per epoch (n = training set size).
  - In Minibatch there would be (n / batch size) steps taken per epoch.
- When training, we often refer to the number of epochs needed for the model to be "trained".


## Data Shuffling

Note on Data Shuffling

To avoid any cyclical movement and aid convergence,
it is recommended to shuffle the data after each epoch.

This way,
the data is not seen in the same order every time,
and the batches are not the exact same ones.

![](./images/20_TrainingInAction.png)





## Transforms

### Scaling Inputs

In our discussion of backpropagation we briefly touched on the formula
for the gradient used to update the values of our weights W:

$\dfrac{\delta J}{\delta W^{(i)}} = (\hat y - y) • a^{(i)}$

And at each iteration of gradient Descent:
$ W_{new} = W_{old} - lr * derivative$

When i = 0, we are using the input values X as part of derivative to update $W_{new}$

This means that if we do not normalize the input values,
those with higher values will update much more quickly than those with lower values.

This imbalance can greatly slow down the speed at which our model converges.

#### Ways to Scale Inputs
Linear scaling to the interval [0,1]:

$x_i=\dfrac{x_i-x_{min}}{x_{max}-x_{min}}$

Linear scaling to the interval [-1,1]:

$x_i=2(\dfrac{x_i-x_{min}}{x_{max}-x{min}})-1$