# Chapter 8 Optimization for TrainingDeep Models

* 손고리즘 / 손고리즘ML 파트 2 - DeepLearning [1]
* 김무성

# Contents

* 8.1 Optimization for Model Training
* 8.2 Challenges in Optimization
* 8.3 Optimization Algorithms I: Basic Algorithms
* 8.4 Optimization Algorithms II: Adaptive LearningRates
* 8.5 Optimization Algorithms III: Approximate Second-Order Methods
* 8.6 Optimization Algorithms IV: Natural Gradient Meth-ods
* 8.7 Optimization Strategies and Meta-Algorithms
* 8.8 Hints, Global Optimization and Curriculum Learning

<img src="figures/cap11.1.png"  />

# 8.1 Optimization for Model Training

* 8.1.1 Empirical Risk Minimization
* 8.1.2 Surrogate Loss Functions
* 8.1.3 Batch and Minibatch Algorithms
* 8.1.4 Generalization and Early Stopping

## 8.1.1 Empirical Risk Minimization

<img src="figures/cap11.2.png"  />

## 8.1.2 Surrogate Loss Functions

## 8.1.3 Batch and Minibatch Algorithms

<img src="figures/cap11.3.png"  />

<img src="figures/cap11.4.png"  />

<img src="figures/cap11.5.png"  />

Minibatch sizes are generally driven by the following factors :

* Larger batches provide a more accurate estimate of the gradient, but withless than linear returns.
* Multicore architectures are usually underutilized by extremely small batches.This motivates using some absolute minimum batch size, below which thereis no reduction in the time to process a minibatch.
* If all examples in the batch are to be processed in parallel (as is typicallythe case), then the amount of memory scales with the batch size. For manyhardware setups this is the limiting factor in batch size
* Some kinds of hardware achieve better runtime with speciﬁc sizes of arrays.Especially when using GPU, it is common for power of 2 batch sizes to oﬀerbetter runtime. Typical power of 2 batch sizes range from 32 to 256, with16 sometimes being attempted for large models.
* Small batches can oﬀer a regularizing eﬀect. Generalization error is oftenbest for a batch size of 1, though this might take a very long time to trainand require a small learning rate to maintain stability

Diﬀerent kinds of algorithms use diﬀerent kinds of information in diﬀerentways, and some are more sensitive to sampling error than others. 

Many optimization problems in machine learning decompose over exampleswell enough that we can compute entire separate updates over diﬀerent examplesin parallel. In other words, we can compute the update that minimizes J(x) for one minibatch of examples x at the same time that we compute the update forseveral other minibatches. This is discussed further in Chapter 12.1.3

## 8.1.4 Generalization and Early Stopping

In machine learning, typically we minimize a objective function deﬁned as anexpectation of some per-example loss across the training set:

<img src="figures/cap11.6.png"  />

However, we would usually prefer to minimize the corresponding objective func-tion where the expectation is taken across the data generating distribution ratherthan just the ﬁnite training set:

<img src="figures/cap11.7.png"  />

In other words, we care about generalization error, not training error.

Usually, we use an optimization algorithm based on minibatch estimates ofthe gradient. During the ﬁrst stages of learning, this is equivalent to minimizingthe generalization error directly. After we have used up the training data andbegin to repeat minibatches, the two criteria are diﬀerent.

This is the main way in which optimization for machine learning is actuallydiﬀerent from traditional optimization, rather than just a special case of opti-mization. Many neural network optimization algorithms are implicitly designedin ways that are intended to yield better results in terms of generalization error,even if they perform worse as an optimization algorithm (yield worse trainingerror or minimize the training error more slowly).

# 8.2 Challenges in Optimization

* 8.2.1 Local Minima
* 8.2.2 Ill-Conditioning
* 8.2.3 Plateaus, Saddle Points, and Other Flat Regions
* 8.2.4 Cliﬀs and Exploding Gradients
* 8.2.5 Vanishing and Exploding Gradients - An Introduction tothe Issue of Learning Long-Term Dependencie
* 8.2.6 Inexact Gradients
* 8.2.7 Theoretical Limits of Optimization

## 8.2.1 Local Minima

<img src="figures/cap4.1.png" />
<img src="figures/cap4.2.png" />
<img src="figures/cap4.3.png" />

## 8.2.2 Ill-Conditioning

Conditioning refers to how rapidly a function changes with respect to small changes in in its input.

## 8.2.3 Plateaus, Saddle Points, and Other Flat Regions

Theoretical work has shown that saddle points (and the ﬂat regions surround-ing them) are important barriers to training neural networks, and may be moreimportant than local minima.

Functions that change rapidly when their inputs are perturbed slightly can be pproblematic for scientific computaition because rounding errors in the inputs can result in large change in the output.

<img src="figures/cap4.4.png"  />

## 8.2.4 Cliﬀs and Exploding Gradients

Whereas the issues of ill-conditioning and saddle points discussed in the previoussections arise because of the second-order structure of the objective function (asa function of the parameters), neural networks involve stronger non-linearitieswhich do not ﬁt well with this picture.

Second-order methods and momentum or gradient-averaging methods in-troduced in Section 8.5 are able to reduce the diﬃculty due to ill-conditioning byincreasing the size of the steps in the low-curvature directions (the “valley”, inFigure 8.1) and decreasing the size of the steps in the high-curvature directions(the steep sides of the valley, in the ﬁgure).

<img src="figures/cap11.8.png"  />

However, although classical second order methods can help, as shown in Fig-ure 8.2, due to higher order derivatives, the objective function may have a lotmore non-linearity, which often does not have the nice symmetrical shapes thatthe second-order “valley” picture builds in our mind. Instead, there are cliﬀswhere the gradient rises sharply.

When the parameters approach a cliﬀ region,the gradient update step can move the learner towards a very bad conﬁguration,ruining much of the progress made during recent training iterations.

<img src="figures/cap11.9.png"  />

As illustrated in Figure 8.3, the cliﬀ can be dangerous whether we approach itfrom above or from below, but fortunately there are some fairly straightforwardheuristics that allow one to avoid its most serious consequences. 

The basic ideais to limit the size of the jumps that one would make. Indeed, one should keepin mind that when we use the gradient to make an update of the parameters, weare relying on the assumption of inﬁnitesimal moves.

The only thing that is guaranteed is that a small enough stepin that direction will be helpful.

The gradient clipping heuristics are described in more detail in Section 10.8.7.The basic idea is to bound the magnitude of the update step, i.e., not trust thegradient too much when it is very large in magnitude.

<img src="figures/cap11.10.png"  />

## 8.2.5 Vanishing and Exploding Gradients - An Introduction tothe Issue of Learning Long-Term Dependencie

* Exploding or Vanishing Product of Jacobians
* Consequence for Recurrent Networks: Diﬃculty of Learning Long-Term Dependencies

### Exploding or Vanishing Product of Jacobians

<img src="figures/cap11.11.png"  />

<img src="figures/cap11.12.png"  />

<img src="figures/cap11.13.png"  />

### Consequence for Recurrent Networks: Diﬃculty of Learning Long-Term Dependencies

<img src="figures/cap11.14.png"  />

<img src="figures/cap11.15.png"  />

<img src="figures/cap11.16.png"  />

<img src="figures/cap11.17.png"  />

## 8.2.6 Inexact Gradients

## 8.2.7 Theoretical Limits of Optimization

# 8.3 Optimization Algorithms I: Basic Algorithms

* 8.3.1 Gradient Descent
* 8.3.2 Stochastic Gradient Descent
* 8.3.3 Online Gradient Descent Minimizes Generalization Error
* 8.3.4 Momentum
* 8.3.5 Nesterov Momentum

## 8.3.1 Gradient Descent

<img src="figures/cap11.18.png"  />

## 8.3.2 Stochastic Gradient Descent

<img src="figures/cap11.19.png"  />

<img src="figures/cap11.20.png"  />

<img src="figures/cap11.21.png"  />

## 8.3.3 Online Gradient Descent Minimizes Generalization Error

<img src="figures/cap11.22.png"  />

<img src="figures/cap11.23.png"  />

<img src="figures/cap11.24.png"  />

## 8.3.4 Momentum

<img src="figures/cap11.25.png"  />

<img src="figures/cap11.26.png"  />

<img src="figures/cap11.27.png"  />

<img src="figures/cap11.28.png"  />

## 8.3.5 Nesterov Momentum

<img src="figures/cap11.29.png"  />

<img src="figures/cap11.30.png"  />

<img src="figures/cap11.31.png"  />

# 8.4 Optimization Algorithms II: Adaptive LearningRates

* 8.4.1 AdaGrad
* 8.4.2 RMSprop
* 8.4.3 Adam
* 8.4.4 AdaDelta
* 8.4.5 Choosing the Right Optimization Algorithm

## 8.4.1 AdaGrad

<img src="figures/cap11.32.png"  />

## 8.4.2 RMSprop

<img src="figures/cap11.33.png"  />

## 8.4.3 Adam

<img src="figures/cap11.34.png"  />

## 8.4.4 AdaDelta

<img src="figures/cap11.35.png"  />

<img src="figures/cap11.36.png"  />

## 8.4.5 Choosing the Right Optimization Algorithm

<img src="figures/cap11.37.png"  />

# 8.5 Optimization Algorithms III: Approximate Second-Order Methods

* 8.5.1 Newton’s Method
* 8.5.2 Conjugate Gradients
* 8.5.3 BFGS

## 8.5.1 Newton’s Method

<img src="figures/cap11.38.png"  />

<img src="figures/cap11.39.png"  />

<img src="figures/cap11.40.png"  />

## 8.5.2 Conjugate Gradients

<img src="figures/cap11.41.png"  />

## 8.5.3 BFGS

<img src="figures/cap11.42.png"  />

# 8.6 Optimization Algorithms IV: Natural Gradient Methods

# 8.7 Optimization Strategies and Meta-Algorithms

* 8.7.1 Batch Normalization
* 8.7.2 Coordinate Descent
* 8.7.3 Initialization Strategies
* 8.7.4 Greedy Supervised Pre-training
* 8.7.5 Designing Models to Aid Optimization

## 8.7.1 Batch Normalization

## 8.7.2 Coordinate Descent

## 8.7.3 Initialization Strategies

<img src="figures/cap11.43.png"  />

<img src="figures/cap11.44.png"  />

## 8.7.4 Greedy Supervised Pre-training

## 8.7.5 Designing Models to Aid Optimization

# 8.8 Hints, Global Optimization and Curriculum Learning

<img src="figures/cap11.45.png"  />

# 참고자료

* [1] bengio's book - Chapter 8 Optimization for TrainingDeep Models - http://www.iro.umontreal.ca/~bengioy/dlbook/version-07-08-2015/optimization.html
* [2] Optimization, higher-level representations, image features - http://vision.stanford.edu/teaching/cs231n/slides/lecture4.pdf
* [3] Getting Neural Networks to work: cross-validation process, optimization, debugging - http://vision.stanford.edu/teaching/cs231n/slides/lecture6.pdf
* [4] Loss functions for classification - https://en.wikipedia.org/wiki/Loss_functions_for_classification
* [5] Surrogate Loss Functions in Machine Learning - http://fa.bianp.net/blog/2014/surrogate-loss-functions-in-machine-learning/
* [6] Condition number - https://en.wikipedia.org/wiki/Condition_number