In [1]:
from utils import *

## Optimization for Training Deep Models
This chapter focuses on one particular case of optimization: finding the parameters $\boldsymbol{\theta}$ of a neural network that significantly reduce a cost function $J(\boldsymbol{\theta})$, which typically includes a performance measure evaluated on the entire training set as well as additional regularization terms.

### Momentum Optimization

Regular Gradient Descent simply updates the weights $\theta$ by directly subtracting the gradient of the cost function $J(\theta)$ with regards to the weights ($\nabla_\theta J(\theta)$) multiplied by the learning rate $\eta$. The equation is $\theta\leftarrow\theta -\eta\nabla_\theta J(\theta)$. It does not care what the earlier gradients were. If the local gradient is tiny, it goes very slowly.

Momentum aims primarily to solve two problems: poor conditioning of the Hessian matrix and variance in the stochastic gradient. It cares a great deal about what the previous gradients were: at each iteration, it adds the local gradient to the *momentum vector* $\mathbf{m}$ (multiplied by the learning rate $\eta$), and it updates the weights by simply subtracting this momentum vector. 

In other words, the gradient is used as an acceleration, not as a speed. To simulate some sort of friction and prevent the momentum from growing too large, the algorithm introduces a new hyperparameter $\beta$, simply called the *momentum*, which must be set between 0 (high friction) and 1 (no friction). A typical momentum value is 0.9.

\begin{eqnarray}
&&\mathbf{m}\leftarrow \beta\mathbf{m} + \eta\nabla_\theta J(\theta)\\
&&\theta \leftarrow \theta - \mathbf{m}
\end{eqnarray}

In [3]:
learning_rate = 0.01

In [4]:
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9)

### Nesterov Accelerated Gradient
Measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum. The difference from vanilla Momentum optimization is that the gradient is measured at $\theta + \beta\mathbf{m}$ rather than at $\theta$.

\begin{eqnarray}
\mathbf{m} &\leftarrow & \beta\nabla_\theta J(\theta+\beta \mathbf{m})\\
\theta & \leftarrow & \theta - \mathbf{m}
\end{eqnarray}

This small tweek works because in general the momentum vector will be pointing in the right direction (i.e. toward the optimium), so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than using the gradient at the original position. One can interpret Nesterov momentum as attempting to add a *correction factor* to the standard method of momentum.

In the convex batch gradient case, Nesterov momentum brings the rate of convergence of the excess error from $O(1/k)$ (after $k$ steps) to $O(1/k^2)$. Unfortunately in the stochastic gradient case, Nesterov momentum does not improve the rate of convergence.
![nesterov](images/momentum.png)

In [5]:
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9, use_nesterov=True)

## Algorithms with Adaptive Learning Rates
### AdaGrad
- Individually adapts the learning rates of all the model parameters by scaling them inversely proportional to the square root of the sum of all the historical squared values of the gradient.
- The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate. 
- The net progress is in the more gently sloped direction of parameter space. 
- Empirically, for training deep neural network models, the accumulation of squared gradients *from the beginning of training* can result in a premature and excessive decrease in the effective learning rate.

\begin{eqnarray}
\mathbf{s} &\leftarrow & \mathbf{s}+\nabla_\theta J(\theta)\otimes\nabla_\theta J(\theta)\\
\theta &\leftarrow &\theta - \eta\nabla_\theta J(\theta)\oslash\sqrt{\mathbf{s} + \epsilon}
\end{eqnarray}

Although it works well for convex problems, it is not generally recommended for deep learning.

### RMSProp
- Modifies AdaGrad to perform better in the nonconvex setting by changing the gradient accumulation into an exponentially weighted moving average.
- Exponentially decaying average discard history from the extreme past so it can converge rapidly after finding a convex bowl. 


\begin{eqnarray}
\mathbf{s} &\leftarrow & \beta\mathbf{s} + (1-\beta)\nabla_\theta J(\theta)\otimes \nabla_\theta J(\theta)\\
\theta &\leftarrow & \theta - \eta\nabla_\theta J(\theta)\oslash \sqrt{\mathbf{s} + \epsilon}
\end{eqnarray}

The decay rate $\beta$ is typically set to 0.9. Yes, it is once again a new hyperparameter, but this default value often works well.

In [6]:
optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate, momentum=0.9, decay=0.9, epsilon=1.e-10)

## Adam
Adaptive momentum estimation, combines the ideas of Momentum optimization and RMSProp: just like momentum optimization it keeps track of an exponentially decaying average of past gradients, and just like RMSProp it keeps track of an exponentially decaying average of past squared gradients.

\begin{eqnarray}
\mathbf{m} &\leftarrow & \beta_1\mathbf{m} + (1 - \beta_1)\nabla_\theta J(\theta)\\
\mathbf{s} &\leftarrow & \beta_2\mathbf{s} + (1-\beta_2)\nabla_\theta J(\theta)\otimes\nabla_\theta J(\theta)\\
\mathbf{m} &\leftarrow & \frac{\mathbf{m}}{1-\beta_1^T} \\
\mathbf{s} &\leftarrow & \frac{\mathbf{s}}{1-\beta_2^T} \\
\theta & \leftarrow & \theta - \eta\mathbf{m}\oslash\sqrt{\mathbf{s} + \epsilon}
\end{eqnarray}
* $T$ represents the iteration number (starting at 1)

Steps 3 and 4 are somewhat of a technical detail: since $\mathbf{m}$ and $\mathbf{s}$ are initialized at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost $\mathbf{m}$ and $\mathbf{s}$ at the beginning of training.

- The momentum decaying hyperparameter $\beta_1$ is typically initialised to 0.9
- The scaling decay hyperparameter $\beta_2$ is often initialized to 0.999. 
- The smoothing term $\epsilon$ is usually initialized to a tiny number such as 10$^{-8}$. These are the default values for Tensorflow's `AdamOptimizer` 

In [7]:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

In fact, since Adam is an adaptive learning rate algorithm (like Adagrad and RMSProp) it requires less tuning of the learning rate hyperparameter $\eta$. You can often use the default value $\eta = 0.001$

## Learning Rate Scheduling
*Predetermined piecewise constant learning rate*
> For example, set the learning rate to $\eta_0=0.1$ at first, then to $\eta_1=0.001$ after 50 epochs. Requires fiddling around.

*Performance scheduling*
> Measure the validation error every $N$ steps (just like for early stopping) and reduce the learning rate by a factor of $\lambda$ when the error stops dropping.

*Exponential scheduling*
> Set the learning rate to a function of the iteration number $t: \eta(t) = \eta_0 10^{-t/r}$. This works great, but it requires tuning $\eta_0$ and $r$. The learning rate will drop by a factor of 10 every $r$ steps.

*Power scheduling*
> Set the learning rate to $\eta(t) = \eta_0(1+t/r)^{-c}$. The hyperparameter $c$ is typically set to 1. This is similar to exponentail scheduling, but the learning rate drops much more slowly.

In [8]:
reset_graph()

n_inputs = 28* 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

with tf.name_scope('dnn'):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name='hidden1')
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name='hidden2')
    logits = tf.layers.dense(hidden2, n_outputs, name='outputs')

with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')
    
with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

In [9]:
with tf.name_scope('train'):
    initial_learning_rate = 0.1
    decay_steps = 10000
    decay_rate = 1/10
    global_step = tf.Variable(0, trainable=False, name='global_step')
    learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step, decay_steps, decay_rate)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss, global_step=global_step)

In [10]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

Get data

In [11]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

In [12]:
n_epochs = 5
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, 'Validation accuracy:', accuracy_val)
    save_path = saver.save(sess, './my_model_final.ckpt')

0 Validation accuracy: 0.9612
1 Validation accuracy: 0.9744
2 Validation accuracy: 0.9746
3 Validation accuracy: 0.9822
4 Validation accuracy: 0.9826


In [13]:
show_graph(tf.get_default_graph())