<a id="regularize"></a>
# Regularization / overfitting 



"Empirical risk minimization" situation

"Holdout set", "test error", we would like to measure generalization.


<img src="http://drive.google.com/uc?export=view&id=1dgGMzI4cEwVeyQsSJKcIJSBcTrEimc9v"  width=900 heigth=900>


<a href="https://media.nature.com/m685/nature-assets/nmeth/journal/v13/n9/images/nmeth.3968-F1.jpg"><img src="https://drive.google.com/uc?export=view&id=1S1cL3LEsY9j8QA7iH1VkEZIjU8UpKJ_Z"></a>



## Neural networks should not work...

- Neural networks even with one hidden layer are **universal approximators**, and **massively overparametrized**
- **Degrees of freedom**,**capacity** or ability to have **absurd amount of "curvature"** as functions is huge
- Just gets worse with if number of layers is greater than 2.

<img src="https://cs231n.github.io/assets/nn1/layer_sizes.jpeg">


## ...And did in fact not work for a while

"... neural network should be able to overfit training data given sufficient training iterations and a legitimate learning algorithm, especially considering that Brady et al. (1989) showed that an inferior algorithm was able to overfit the data. Therefore, this phenomenon should have played a critical role
in the research of improving the optimization techniques. Recently, the studying of cost surfaces of neural networks have indicated the existence of saddle points (Choromanska et al., 2015; Dauphin et al., 2014; Pascanu et al., 2014), which may explain the findings of Brady et al back in the late 80s."

[Source](https://arxiv.org/pdf/1702.07800.pdf)


## But now works none the less. WHY?

As mentioned many times: 

- "Some technical developments"
    - Part of this is the material for this class
- Hardware + amount of data...


## I. "Capacity" usage as the progressive growth of weights during training

Change of decision surface equals direction of downward "step" in direction of error gradients.

<img src="https://iamtrask.github.io/img/sgd_optimal.png">
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/23104835/Qc281.jpg">

- If data is not linearly separable, curvature of decision surface increases progressively. 
- With some simplification, we can assume, that the weights will increase to produce a higher order polynomial.

Training can be understood to **increase the absolute value of weights** to utilize more and more capacity.

Small SGD based NN training on MNIST: charted the sum of absolute values of weights during the epochs.

<img src="http://drive.google.com/uc?export=view&id=1mbWqsUFHX-suGkJ15ZxNQcZvchBJprPf">



## Early stopping

Why do we check validation data in every epoch?

<a href="https://deeplearning4j.org/images/guide/earlystopping.png"><img src="https://drive.google.com/uc?export=view&id=1XaVTakoOBgvVnNutu0JQlHoKCggG-MiO" width=50%></a>

<img src="https://elitedatascience.com/wp-content/uploads/2017/09/early-stopping-graphic.jpg">

**We stop, where validation error starts to increase.**

### "Starts to"?

An example of a validation error curve might look like this:

<img src="http://drive.google.com/uc?export=view&id=1tfV_C9pS8QZUKqVbXxY47tXEa1hDOMzK" width="70%">

- If we get a constantly updating, noisy series, it might not be the best idea to stop the process if *one* epoch caused an increase in validation error. 
- Should "wait a bit", do some smoothing. (Moving average? - The default guess of TensorBoard, Continuous increase for some epochs?...)

#### Early stopping metaalgorithm

On the one hand we have to "get a feel" for it, on the other hand there are attempts to come up with heuristics.

<img src="http://drive.google.com/uc?export=view&id=1XhKuVB6mqj5ZMxsUBK8VgWF8uceP0ra-" width="70%">

([Source: Goodfellow, Bengio & Courville: Deep Learning (MIT, 2016)](http://www.deeplearningbook.org/contents/regularization.html))

There are several approaches for the systematic timing of early stopping.
**A good summary is well worth reading [here](http://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf).**


## Classic normalization - weight norm

We would like to achieve the increase of performance with the smallest possible amount of weight increase during training. "Do your best with the least capacity".

### Added terms to the cost function

How can we achieve this?
We should add some regularizing terms to the cost. These are the "additional constraints" mentioned in the beginning, and - no surprise - can be same as in case of linear models.

<img src="https://image.slidesharecdn.com/10-170929054651/95/neural-network-part2-7-638.jpg?cb=1506665197">

"Eror" :-D

[Source](https://www.slideshare.net/21_venkat/neural-network-part2)



### L1-L2 penalty
No change from the linear models, same concept.

<img src="http://slideplayer.com/slide/2511721/9/images/3/Cost+function+Logistic+regression:+Neural+network:.jpg">

#### L1
**If few covariates have a "non zero" value, the degree of freedom for the function is constrained.**

<img src="https://jamesmccaffrey.files.wordpress.com/2017/06/l1_regularization_1.jpg?w=640&h=361">

[Source](https://jamesmccaffrey.wordpress.com/2017/06/27/implementing-neural-network-l1-regularization/)


#### L2
**If the size of the covariates is small, the degree of freedom for the function is constrained.**

<img src="https://jamesmccaffrey.files.wordpress.com/2017/06/l2_regularization_equations.jpg?w=640&h=379">

[Source](https://jamesmccaffrey.wordpress.com/2017/06/29/implementing-neural-network-l2-regularization/)

#### Result
<img src="https://cs231n.github.io/assets/nn1/reg_strengths.jpeg">



## IIa. Robustness

### Dropout

#### What is it?

The basic idea is to **randomly choose** some neurons (typically with 0.5 probability "coin flip") that we "switch off", we regard them as temporarily not being part of the network. We set their activation to 0.

We calculate gradients according to this, since their contribution to the error will be zero.

In the next forward pass, we "flip the coins" anew.

<a href="https://everglory99.github.io/Intro_DL_TCC/intro_dl_images/dropout1.png"><img src="https://drive.google.com/uc?export=view&id=13ONCe_Q0cbJvhbQcUhWj72YXEa-OvLVo"></a>

**When we finish the training, we use THE WHOLE NETWORK as one**, but we have to norm the outputs with a normalization term.

**[Original paper](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)**


#### WHYYYY?

1. First of all: we don't know exactly. There are some ideas, and an emerging consensus.
2. It is plausible that the "interdependency" ("coupling", "entanglement of hidden factors") is also increasing during training (quasi complexity, more like dependency), which causes more "brittle" or arbitrary decision boundaries (not a large margin one). This is a nice metaphor, but the literature is not backing it fully up .
3. It is in practice a form of model averaging, that is, a self contained "ensemble model" - see below.

More: [here](https://arxiv.org/abs/1512.05287)


#### Observations

1. Dropout makes training more difficult, **decreases training performance**
2. As a rule of thumb, using dropout **_approximately_ doubles required training time**, but epoch time usually decreases (remember, you are zeroing out half of the gradients...)
3. In case of $h$ hidden neurons when we apply dropout with probability $p$, we train approximately $2^h$ models, and we do the prediction with their "ensemble", decreasing activations by a factor of $p$
4. Dropout forces the network to learn *robust* features, which are useful for many other random neuron groups in their decisions

#### Connection with ensemble methods

According to the most widespread opinion, during dropout, we create a weakly coupled group or *ensemble* of learners.

"Dropout training is **similar to bagging** (Breiman, 1994), where many **different models** are **trained on different subsets of the data**. Dropout training **differs** from bagging in that each **model is trained** for only **one step** and all of the **models share parameters**. For this training procedure (dropout) to behave as if it is training an ensemble rather than a single model, each **update must have a large effect**, so that it makes the **sub-model** induced by that µ fit the current input v well."

[Source](http://proceedings.mlr.press/v28/goodfellow13.pdf)

It is interesting to think about the connection of deep neural nets with *"boosted trees"* and "traditional" models like *RandomForest* and *XGBoost*.

<img src="https://i0.wp.com/dimensionless.in/wp-content/uploads/RandomForest_blog_files/figure-html/voting.png?w=1080&ssl=1">



#### "Dropout" wrapper

It became such a standard practice, that there is a "wrapper" function in TensorFlow (and all other frameworks) as a standard operation.

It is considered here as separate layer, that "wraps" the prior one, executes the "switch off" (technically by multiplicating with zero or other tricks).

```python
# Fully connected layer (in tf contrib folder for now)
fc1 = tf.layers.dense(fc1, 1024)
# Apply Dropout (if is_training is False, dropout is not applied)
fc1 = tf.layers.dropout(fc1, rate=dropout, training=is_training)
```
        

Details in the [documentation](https://www.tensorflow.org/api_docs/python/tf/nn/dropout)



#### OK, but where to use it? 

Non trivial question is where to place the dropout "layer"?

<img src="http://drive.google.com/uc?export=view&id=1zvL2s2gXB4ymfcnOXXraR8QFSm3Gy54Z">

This does not yet have a settled universal answer, just partial aspects were investigated.

For example in case of handwriting recognition the result is [this](http://ieeexplore.ieee.org/document/7333848/?reload=true).

This strongly depends on the architecture and represents an even more complex problem in case of recurrent models. A three part summary of the problem can be found [here](https://becominghuman.ai/learning-note-dropout-in-recurrent-networks-part-1-57a9c19a2307).



#### Sidenote: Another reasons why Dropout may be more important than we think

There is also work, (eg. by the group of [René Vidal](http://openaccess.thecvf.com/content_cvpr_2017/papers/Haeffele_Global_Optimality_in_CVPR_2017_paper.pdf)) which tries to characterize - under certain constraints - the supposedly non-convex optimization challenge of training a neural network as a convex problem. This can lead to a deeper understanding of how deep networks can - in theory, not just in practice - reach their high performance.

<img src="http://drive.google.com/uc?export=view&id=1lzDovsqObw6RizIEx5duSBX0G5Cx2EOf" width=600 heigth=600>

In this framework, they propose, that dropout - together with structural properties of deep NN-s and the optimization process is essentially causing the problem to become convex. This line of research is pretty novel, not well understood, but definitely worth following.


## IIb. Robustness

### "Information dropout"

#### What? 

Achille & Soatto (2015)

What if we don't "switch off" nodes, as in dropout, but we consciously **add noise from a known distribution to the inputs**?

**Traditional Dropout** is a form of **injecting noise** from a **binominal distribution**!

- **“Representation”** any **function of the data** that is **useful for a task** 

- **Optimal representation**: **useful(sufficient), parsimonious (minimal)** and **minimally affected by nuisance factors (invariant)**. 

- **Injecting multiplicative noise** (form of regularization) in activations to the computation **improves** the **properties of a representation** and results in better approximation of an optimal one

-  Multiplicative noise regularizer;  show  that  in  the  **special  case**  of  **Bernoulli  noise**, **regularization reduces to dropout** 

- **Connection** to **information theoretic** principles 

<img src="http://drive.google.com/uc?export=view&id=1b0gwNXYvJ7J2ZlF26dP4a4iM3iSI2U6D" width=500 heigth=500>

Original paper [here](https://arxiv.org/abs/1611.01353), implementation is not yet "standard", but can be found [here](https://github.com/ucla-vision/information-dropout)

**Learning: Noise is _not_ necessarily your enemy in Deep learning.**

**Main point of the paper: (certain types of) noise leads to an information bottleneck that achieves minimal and invariant representations in addition to the sufficient representations we already had** 

### Why does this work (a small detour)- Deep Learning and Information theory 

[source](https://arxiv.org/pdf/1503.02406.pdf).


"Abstract— **Deep Neural Networks** (DNNs) are **analyzed** via the theoretical framework of the **information bottleneck (IB)** principle. We first show that any DNN can be quantified by the **mutual information between the layers and** the **input and output variables**. Using this representation we can calculate the optimal information theoretic limits of the DNN and obtain finite sample generalization bounds. The advantage of getting closer to the theoretical limit is quantifiable both by the generalization bound and by the network’s simplicity. We argue that both the optimal architecture, number of layers and features/connections at each layer, are related to the bifurcation points of the information bottleneck tradeoff, namely, relevant compression of the input layer with respect to the output layer"


Main idea: over successive layers **neural networks** function as an **information bottleneck** **removing** as much **"unnecessary" information*** from the input variables **whilst learning** the **relevant input-output relationship**

Noise is essential in this process as it helps to differentiate between **useful(sufficient), parsimonious (minimal)** and **minimally affected by nuisance factors (invariant)**

**Development of information throughout the training**

<img src="http://drive.google.com/uc?export=view&id=1ZNx3K_srdO-YjihyiQsvtF661GN63KGS">

From the left image to the right image - increasing the amount of data. Within each image from the left to the right - increasingly deeper layers.


**Theoretical bounds for NN information bottle neck** 

<img src="http://drive.google.com/uc?export=view&id=1L-xBbKcZ3Vce7bBxuU_2F3b6t2YxzXnW">

Fig. 5. The black line is the optimal achievable information bottleneck (IB) limit. The red line corresponds to the upper bound on the out-of-sample IB distortion, when trained on a finite sample set. ΔC is the complexity gap and ΔG is the generalization gap


[see also here for an explanation]https://lilianweng.github.io/lil-log/2017/09/28/anatomize-deep-learning-with-information-theory.html)


#### Relationship to information dropout

The argument is that **information dropout** leads to a **"disentangled" representation** 


Given some input data $\mathrm{x}$, we want to compute some (possibly nondeterministic) function of $\mathbf{x}$, called a representation, that has some desirable properties in view of the task $\mathbf{y}$, for instance by being more convenient to work with, exposing relevant statistics, or being easier to store. Ideally, we want this **representation** to be **as good** as the **original data** for the task, and **not squander resources modeling** parts of the data that are **irrelevant to the task**. Formally, this means that we want to find a random variable $\mathbf{z}$ satisfying the following conditions:

(i) $\mathbf{z}$ is a representation of $\mathbf{x} ;$ that is, its distribution depends only on $\mathbf{x}$, as expressed by the following Markov chain:

<img src="http://drive.google.com/uc?export=view&id=1S3Z-6PwMSOqMYBmHu0CjdWoyFHWGXOFs">

(ii) $\mathbf{z}$ is sufficient for the task $\mathbf{y}$, that is $I(\mathbf{x} ; \mathbf{y})=I(\mathbf{z} ; \mathbf{y})$ expressed by the Markov chain:

<img src="http://drive.google.com/uc?export=view&id=1ytxD96t6fLp6tooJD59F5T8AssZbJ63M">

(iii) among all random variables satisfying these requirements, the mutual information $I(\mathbf{x} ; \mathbf{z})$ is minimal. This means that $z$ discards all variability in the data that is not relevant to the task.


Using the identity $I(\mathbf{x} ; \mathbf{y})-I(\mathbf{z} ; \mathbf{y})=H(\mathbf{y} \mid \mathbf{z})-H(\mathbf{y} \mid \mathbf{x})$ where $H$ denotes the entropy and $I$ the mutual information, it is easy to see that the above conditions are equivalent to finding a distribution $p(\mathbf{z} \mid \mathbf{x})$ which solves the optimization problem
$$
\begin{aligned}
\operatorname{minimize} & I(\mathbf{x} ; \mathbf{z}) \\
\text { s.t. } & H(\mathbf{y} \mid \mathbf{z})=H(\mathbf{y} \mid \mathbf{x})
\end{aligned}
$$
The minimization above is difficult in general. For this reason, Tishby et al. have introduced a generalization known as the Information Bottleneck Principle and the associated Lagrangian to be minimized.
$$
\mathcal{L}=H(\mathbf{y} \mid \mathbf{z})+\beta I(\mathbf{x} ; \mathbf{z})
$$
where $\beta$ is a positive constant that manages the tradeoff between sufficiency (the performance on the task, as measured by the first term) and minimality (the complexity of the representation, measured by the second term). It is easy to see that, in the limit $\beta \rightarrow 0^{+}$, this is equivalent to the original problem, where $\mathbf{z}$ is a minimal sufficient statistic. When all random variables are discrete and $\mathbf{z}=T(\mathbf{x})$ is a deterministic function of $\mathrm{x}$, the algorithm proposed by [2] can be used to minimize the IB Lagrangian efficiently. However, **no algorithm is known** to **minimize** the IB Lagrangian for **non-Gaussian, high-dimensional continuous random variables**.

One of our key results is that, when we **restrict to the family of distributions obtained by injecting noise to one** layer of a neural network, we can **efficiently approximate and minimize the IB Lagrangian**.

To set the stage, we rewrite the IB Lagrangian as a persample loss function. Let $p(\mathbf{x}, \mathbf{y})$ denote the true distribution of the data, from which the training set $\left\{\left(\mathbf{x}_{i}, \mathbf{y}_{i}\right)\right\}_{i=1, \ldots, N}$ is sampled, and let $p_{\theta}(\mathbf{z} \mid \mathbf{x})$ and $p_{\theta}(\mathbf{y} \mid \mathbf{z})$ denote the unknown distributions that we wish to estimate, parametrized by $\theta$. Then, we can write the two terms in the IB Lagrangian as
$$
\begin{aligned}
H(\mathbf{y} \mid \mathbf{z}) & \simeq \mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p(\mathbf{x}, \mathbf{y})}\left[\mathbb{E}_{\mathbf{z} \sim p_{\theta}(\mathbf{z} \mid \mathbf{x})}\left[-\log p_{\theta}(\mathbf{y} \mid \mathbf{z})\right]\right] \\
I(\mathbf{x} ; \mathbf{z}) &=\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})}\left[\mathrm{KL}\left(p_{\theta}(\mathbf{z} \mid \mathbf{x}) \| p_{\theta}(\mathbf{z})\right)\right]
\end{aligned}
$$
where KL denotes the Kullback-Leibler divergence. We can therefore approximate the IB Lagrangian empirically as
$$
\mathcal{L}=\frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{z \sim p\left(\mathbf{z} \mid \mathbf{x}_{i}\right)}\left[-\log p\left(\mathbf{y}_{i} \mid \mathbf{z}\right)\right]+\beta \operatorname{KL}\left(p_{\theta}\left(\mathbf{z} \mid \mathbf{x}_{i}\right) \| p_{\theta}(\mathbf{z})\right)
$$
Notice that the first term simply is the average cross-entropy, which is the most commonly used loss function in deep learning. The second term can then be seen as a regularization term. In fact, many classical regularizers, like the $L_{2}$ penalty, can be expressed in the form of the eq. In this work, we interpret the KL term as a reuglarizer that penalizes the transfer of information from $x$ to $z$. In the next section, we discuss ways to control such information transfer through the injection of noise.

#### Coming back to why information dropout leads an information bottle neck (minimal and invariant in addition to sufficient representation)


Information dropout cost for ReLU: Let $z=$ $\varepsilon \cdot f(x)$, where $\varepsilon \sim p_{\alpha}(\varepsilon)$, and assume $p(z)=q \delta_{0}(z)+c / z$ Then, assuming $f(x) \neq 0$, we have
$$
\operatorname{KL}\left(p_{\theta}(z \mid x) \| p(z)\right)=-H\left(p_{\alpha(x)}(\log \varepsilon)\right)+\log c
$$
In particular, if $p_{\alpha}(\varepsilon)$ is chosen to be the log-normal distribution $p_{\alpha}(\varepsilon)=\log \mathcal{N}\left(0, \alpha_{\theta}^{2}(x)\right)$, we have
$$
\mathrm{KL}\left(p_{\theta}(z \mid x) \| p(z)\right)=-\log \alpha_{\theta}(x)+\text { const }
$$
If instead $f(x)=0$, we have
$$
\operatorname{KL}\left(p_{\theta}(z \mid x) \| p(z)\right)=-\log q
$$



Substituting the expression for the KL divergence in eq. into the eq., and ignoring for simplicity the special case $f(x)=0$, we obtain the following loss function for ReLU activations
$$
\mathcal{L}=\frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{\mathbf{z} \sim p_{\theta}\left(\mathbf{z} \mid \mathbf{x}_{i}\right)}\left[\log p\left(\mathbf{y}_{i} \mid \mathbf{z}\right)\right]+\beta \log \alpha_{\theta}\left(\mathbf{x}_{i}\right)
$$

, where 


$$
\alpha_{\theta}\left(\mathbf{x}_{i}\right)
$$

represents the standard deviation of the log normal distribution:

$$
p_{\alpha(\mathbf{x})}(\varepsilon)=\log \mathcal{N}\left(0, \alpha_{\theta}^{2}(\mathbf{x})\right)
$$

### Connections to other methods

#### Variational autoencoders

Even the original paper of information dropout makes an explicit Reference to variational autoencoders, so we have to keep this in mind when we get back to those later on. 

#### Data augmentation

What if we have too little data, and "make some more" - by applying transformations to the original data, which are plausible in the given domain - like rotation and scaling in image domains?

<img src="http://drive.google.com/uc?export=view&id=17zfUo3UoD0AqwnEgCkROIQc_2X8cZx51">

Detailed analysis in case of images can be found [here](http://cs231n.stanford.edu/reports/2017/pdfs/300.pdf)

And additionally if we do such "plausible" transformations, **we greatly enhance the generalization ability of our models**.

Remember: More data is better! (Moser) 

Also, augmenting data is a form of injecting domain specific knowledge.

Source: Lecture series of Michael C. Mozer at DeepLearn2017 Bilbao

<img src="http://drive.google.com/uc?export=view&id=17IyTGGXib9u9cXDbADGXyDsLUwUIuwOc">

And again it is notable, that even Hinton admitted, that deep learning did not work because the lack of data.

##### Again connection:

This topic is in strong **connection** with **"data imputation"** whereby we try to **mitigate** the effect of **missing data** by **"imputing" missing variables**. There are also additional approaches, which we can not discuss in detail now, but you can find [here](http://www.stat.columbia.edu/~gelman/arm/missing.pdf).

Even one step further is something like "Cluttered MNIST", where the database is saved in an altered form. Downloadable from [here](https://github.com/deepmind/mnist-cluttered).



#### Adversarial examples

What if we add _targeted_ noise to the dataset?

For the human eye the noise is absolutely not perceptible (in extreme cases it can be **one pixel**), and with high confidence we can force the network to classify the example into another class?

<img src="https://cdn-images-1.medium.com/max/1400/1*Nj_toOwx_Hc5NLn97Jv-ww.png">

**This is one of the most serious challange for neural models presently!!! Just think about the security implications!!!**

Naturally the experiments started to test out this attack vector in physical settings by adding minimal optical noise (in form of colorful stickers) to stop signs. Well, it succeeded.

<img src="http://bair.berkeley.edu/blog/assets/yolo/image1.png" width=600 heigth=600>

Original [here](http://bair.berkeley.edu/blog/2017/12/30/yolo-attack/)



**This shows that the decision boundary between the classes is absolutely _not robust_!**

On the other hand, nothing prevents us from **regarding adversarial examples as a form of data augmentation**, and adding them to the datasets.

This method though went into the extreme recently, where Google researchers utilized it to "reprogram" a trained network to carry out a different task than the original.

[Google Brain researchers demo method to hijack neural networks](https://arxiv.org/pdf/1806.11146.pdf)

<img src="https://venturebeat.com/wp-content/uploads/2018/07/Capture-boring.png?fit=578%2C451&strip=all" width=600 heigth=600>

**Requires access to full output of neural network (all classes)**

**Learning of linear mapping from output to desired task**

(This has some relevance for transfer learning also, since it can be considered as domain mapping - see later.)

#### Mixup - “Vicinal Risk Minimization” 

So getting back to the question of "large margins", we attack the base paradigm of empirical risk minimization, and we move over to "vicinical risk minimization"?

This means that we do not just try to fit a model based on the error on the observed data ("empirical risk"), but use a method to force the model to learn something "between the classes", so we in essence try to constrain the behavior of the model in the space "inbetween".
(Just think about embedding and representation learning!)

<img src="http://drive.google.com/uc?export=view&id=1xXp9PUoKg5TpsEO5v_q62D32bTZaF00o" width=800 heigth=800>

<img src="https://www.inference.vc/content/images/2017/11/download-87.png" width=60%>

Original paper [here](https://arxiv.org/abs/1710.09412), a very good in-detail analysis [here](https://www.inference.vc/mixup-data-dependent-data-augmentation/).

This is yet a "fringe" movement in deep learning, but it can have great potential!



## III. “Covariate shift”

### The problem

"Covariate shift" is a general and rather nasty problem, when the distribution of your input shifts in a subtle way, implying that the learned model is no longer appropriate.

<img src="http://slideplayer.com/slide/5237435/16/images/5/Covariate+Shift+Adaptation.jpg" width=700 heigth=700>

If we can realize this - eg. with testing in time, or with performance monitoring "a bit late" - we can try to re-train the model. (See eg. ["importance sampling"](https://en.wikipedia.org/wiki/Importance_sampling), or some kind of transfer learning, as elaborated later).

More about detecting covariate shift can be found [here](https://datascience.stackexchange.com/questions/8278/covariate-shift-detection).

For our purposes it is important that Szegedy et al. demonstrated, that even **during training, inside backprop, covariate shift is happening** - if the network is deep enough, since when we update the weights, we update the output distribution of a layer, thus modifying the input distribution for the next layer - obviously, since we want to learn something. With this we are effectively "shooting at a moving target" during training.

<img src="https://image.slidesharecdn.com/dlmmdcud1l06optimization-170427160940/95/optimizing-deep-networks-d1l6-insightdcu-machine-learning-workshop-2017-8-638.jpg?cb=1493309658" width=600 heigth=600>

### The solution: Batchnorm


"Batch normalization", that is normalizing the activations of the nodes during backprop with the minibatch mean and variance.

Thus creating mean of 0 and variance of 1 for each scalar feature

<img src="http://drive.google.com/uc?export=view&id=1S55NF6zKwnOp8WEE6woWyDlw4LuF1ZaR" width=400 heigth=400>

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this,we introduce, for each activation $x^{(k)}$, a pair of parameters $\gamma^{(k)}, \beta^{(k)}$, which scale and shift the normalized value:
$$
y^{(k)}=\gamma^{(k)} \widehat{x}^{(k)}+\beta^{(k)}
$$
These parameters are learned along with the original model parameters, and restore the representation power of the network. Indeed, by setting $\gamma^{(k)}=\sqrt{\operatorname{Var}\left[x^{(k)}\right]}$ and $\beta^{(k)}=\mathrm{E}\left[x^{(k)}\right]$, we could recover the original activations if that were the optimal thing to do.

A nice analysis of the original method can be found [here](https://gab41.lab41.org/batch-normalization-what-the-hey-d480039a9e3b), original paper [here](https://arxiv.org/abs/1502.03167). 

### Why does it work?

The normalization inside a batch ensures that even with changing weights, which change the input distribution, it's mean and variance remains the same. It can not "creep away" in the numeric range.

Detailed explanation by the famous Andrew Ng [here](https://www.coursera.org/learn/deep-neural-network/lecture/81oTm/why-does-batch-norm-work).

Recently - as of June 2018 - some doubt came up, if batchnorm is counteracting covariate shift at all (see [this](https://arxiv.org/pdf/1805.11604.pdf) paper) - though it is still considered beneficial (by smoothing the error surface).

**This implies that other effect that smooth the error surface might have a similar effect e.g. other normalization techniques or injecting noise** 


## Newer normalization techniques

Though the application of batchnorm layers proved to be effective in speeding up convergence, some important shortcomings gradually became visible:

1. In case of "online" (example by example) learning or small batch sizes it does not work since the mean and variance statistics just don't mean anything about the other training data.

2. In case of recurrent networks (see later) it would be very cumbersome and a huge waste of resources.

Reacting to the above problems, multiple new normalization techniques have been introduced, which - in the end - also try to normalize the output of a neuron, but without relying on the batch statistics.

### Weight normalization

This method - similarly to batchnorm - uses a simple reparametrization trick: it stabilizes the output by decomposing the $\mathbf w$ weight vector for the neuron into the product of a scalar length parameter $g$ and a direction parameter $\mathbf v$:

$$
\mathbf w = \frac{g}{||\mathbf v||} \mathbf v
$$

where $\mathbf{v}$ is a $k$-dimensional vector, $g$ is a scalar, and $\|\mathbf{v}\|$ denotes the Euclidean norm of $\mathbf{v}$. This reparameterization has the effect of fixing the Euclidean norm of the weight vector $\mathbf{w}$ : we now have $\|\mathbf{w}\|=g$, independent of the parameters $\mathbf{v}$. We therefore call this reparameterizaton weight normalization.

By decoupling the norm of the weight vector $(g)$ from the direction of the weight vector $(\mathbf{v} /\|\mathbf{v}\|)$, we speed up convergence of our stochastic gradient descent optimization.

with this trick we can "fix" the weight vector for the neuron in $g$, thus mitigating the covariate shift phenomena.



### Layer normalization

Similarly to batchnorm, we would like to fix the input distribution to the neuron, but not in the "batch dimension", but in the dimensions of the representations of individual datapoints: We standardize every individual input vector so, that the whole vector should have 0 mean and unit variance, after this (very similarly to batchnorm) we scale and shift all values with learned parameters for every neuron. 

<a href="https://i1.wp.com/mlexplained.com/wp-content/uploads/2018/01/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2018-01-11-11.48.12.png?resize=768%2C448"><img src="https://drive.google.com/uc?export=view&id=1ks6yiXCPy0nrGYzqjPhCurmaN5S2XK71"></a>

Similarly to the other methods, here we also stabilize one of the components defining the output of the neuron, but unlike weight normalization, we don't manipulate the weights, but directly the distribution of the inputs.

### Further reading

More information about the normalization methods:

[An Intuitive Explanation of Why Batch Normalization Really Works (Normalization in Deep Learning Part 1)](http://mlexplained.com/2018/01/10/an-intuitive-explanation-of-why-batch-normalization-really-works-normalization-in-deep-learning-part-1/)

[Weight Normalization and Layer Normalization Explained (Normalization in Deep Learning Part 2)](http://mlexplained.com/2018/01/13/weight-normalization-and-layer-normalization-explained-normalization-in-deep-learning-part-2/)

Original papers:

[Layer normalization](https://arxiv.org/pdf/1607.06450.pdf), [Weight normalization](https://arxiv.org/pdf/1602.07868.pdf)

### Maybe normalization is not necessary?

In some recent works, eg. [FIXUP initialization: Residual learning without normalization](https://arxiv.org/pdf/1901.09321.pdf) the authors argue, that with a more proper initialization scheme the need for batchnorm even in case of very deep residual convolutional networks (more on those later) falls away, thus the networks can 
become easily and efficiently trainable. 

- Specific for ResNets (to be covered later), where output grows exponentially with depth
- Introduces bias during set up, shift layers to prevent this exponential output growth

<a href="http://drive.google.com/uc?export=view&id=1-ti0idMx9vFbyLjpqzEfrCBW1w_m8ule"><img src="https://drive.google.com/uc?export=view&id=1GhW0a6ML3uiOSA9TdwJRxifaK0Ye6y1H" width=75%></a>

These are quite new results, "handle with care"!



## A sidenote again: Synthetic gradients

The research on **parallelizability** of neural network training brought Google DeepMind to reflect upon the fact, that backprop on large enough NN-s is inherently bound by the fact that the **gradients** of the layers are to be **calculated** in a **sequential order**. Telling is the fact, that for problems on DeepMind horizon, this is considered a constraint, thus they try to parallelize the layer update calculation. 

For this to be achievable, they have introduced the concept of *"synthetic gradients"*. (Details can be found [here](https://deepmind.com/blog/decoupled-neural-networks-using-synthetic-gradients/).)

<img src="https://storage.googleapis.com/deepmind-live-cms/images/3-3.width-1500_qGOdtUS.png" width=400 heigth=400>

"So, how can one decouple neural interfaces - that is **decouple** the **connections between network modules** - and **still** allow the modules **to learn to interact**? In this paper, we **remove** the **reliance on backpropagation** to get error gradients, and instead **learn a parametric model** which **predicts what the gradients will be** based upon only **local information**. We call these predicted gradients *synthetic gradients*."

<img src="https://storage.googleapis.com/deepmind-live-cms/images/3-5.width-1500_EZm1qhu.png" width=600 heigth=600>
"The synthetic gradient model takes in the **activations** from a module and produces what it **predicts** will be the **error gradients** - the gradient of the loss of the network with respect to the activations.

... and use the synthetic gradients (blue) to *update Layer 1 before the rest of the network has even been executed.*"

This is again a **paradigmatically interesting concept**, since it goes back to the roots of **Hebbian learning**, whereby **only local information** is **necessary** for the update of weights, but with a twist: sooner or later *some* global information is necessary, but much less. 

This view is shared by other scholars (eg. Piere Baldi in his talk ["Deep Learning: Theory, Algorithms and Applications to Natural Sciences"](https://drive.google.com/file/d/1lbqi2EB24dhm5lHGdmRhGWrwsyRvOcrf/view?usp=sharing) at DeepLeran2018 Summer School, Genova) who state, that purely hebbian learning is not possible, at least some partial "distant" supervision is necessary.

<a id="gdparams"></a>
# GD training hyperparameters - Learning rate and co.
<img src="https://i1.wp.com/theaviationist.com/wp-content/uploads/2017/12/XB-70-cockpit.jpg" width="700px">


## What control points do we have over our model?

- Structure
    - Architecture (Fully connected layers - till this point, but we will see more)   
        - Selection from "general" or standard architectures 
        - Custom "wiring"
    - Layer number
    - Layer size
- Learning parameters
    - Optimization method (Adam vs SGD vs…)
    - Parameters of these  
        - Learning rate
        - In case of Adam $\beta$1,  $\beta$2
        ...
- Epoch number 
    - With constant regard to “early stopping”
- Regularization parameters
    - Cost function regularization parameters
    - Dropout rate
    - Inf dropout noise level
- Standardization approach
    - Batch norm
   
If we consider all these "control points" and their combinations, we still have a *huge* space we can choose from. 

Since we do not have any analytic insights about the appropriate settings of these, we have to rely on "expert intuition" or "experience" (so much so, that some researchers [criticize](https://medium.com/@Synced/lecun-vs-rahimi-has-machine-learning-become-alchemy-21cb1557920d) their own field as "alchemy"), or we choose to explore the space of these settings - which can be casted itself as a search or optimization problem.

Researchers tried to attack these optimization problems with the well known machine learning methods, specificly:

- Grid search (standard practice for other ML methods, implemented in [ScikitLearn](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), usable [for DL](https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/))
- Random search: [Bergstra and Bengio 2012](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)
- [Bayesian optimization](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f)
- [Genetic algorithms](https://ai.googleblog.com/2018/03/using-evolutionary-automl-to-discover.html)
- [Reinforcement learning](https://arxiv.org/abs/1602.04062)

The usage of these techniques is out of scope of this course, but it is worth mentioning, that "AutoML" as a trend is the product of this approach. It's proponents argue, that it is a perfect automatic solution for data science problems, it's opponents though tend to point out, that it is exorbitantly costly to train as well as acts as a kind of "replacement" for well understood solutions.


From the above parameter space, we restrict ourselves to discussing Learning Rate, because it's peculiar properties.

## Learning rate and it's "schedules"

Although, as we have already seen, there are GD variants that adaptively set the **learning rate** during training (Adadelta etc.), **non-adaptive GD variants**, e.g. vanilla SGD and Momentum are still frequently used for optimization and require **manual setting**.

For these variants, after setting an initial learning rate (which is a crucial hyperparameter), the rate is typically kept constant during an epoch, but is changed (typically decreased) after every epoch (or every $n$ epochs with a fixed $n$) according to a schedule. The most frequently used schedules are

+ **Step decay**: The learning rate is reduced by a $\rho$ factor after every epoch or $n$ epochs, that is an
$$\text{lr} = \varrho \cdot \text{lr}$$ 
update is performed where $0<\varrho<1$. 

+ **Exponential decay**: The learning rate is set to be
$$\text{lr} = \text{lr}_0\cdot e^{-k t}
$$
where $k$ is a hyperparameter and $t$ is the number of already trained epochs (or $n$ epochs). It is easy to see that this is equivalent to a step decay with $\varrho = e^{-k}$.
+ **${1}/{t}$ decay**: The learning rate is set to
$$\text{lr} = \frac{\text{lr}_0 }{1 + k t}$$ where $k$ is a hyperparameter and $t$ is the number of already trained epochs (or $n$ epochs).
+ **Constant learning rate** across epochs.

These schedules can be combined into more complex ones, e.g. a schedule may keep the learning rate constant for a number of epochs and then start a step decay. The switch between the simple schedules can happen simply at a predetermined epoch, but it is frequently connected to a validation metric, e.g. can be triggered when a metric has stopped improving.

<img src="https://cdn-images-1.medium.com/max/1600/1*VQkTnjr2VJOz0R2m4hDucQ.jpeg" width=600 heigth=600>
<img src="https://cdn-images-1.medium.com/max/1200/1*iSZv0xuVCsCCK7Z4UiXf2g.jpeg" width=600 heigth=600>

[source](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)

## Systematic thinking about LR

[Setting the learning rate of your neural network](https://www.jeremyjordan.me/nn-learning-rate/)

<a href="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-24-at-11.47.09-AM.png"><img src="https://drive.google.com/uc?export=view&id=1j8ZG4mABW8IyT36iN56ERkrq94PjcaDC" width=55%></a>
<a href="https://www.jeremyjordan.me/content/images/2018/02/lr_finder.png"><img src="https://drive.google.com/uc?export=view&id=1shs2UPZZjTDtgN3P8fM0Ay9vnwoRXuFb" width=55%></a>
    


## Connection of LR and batch size 

Furthermore a recently published paper draws attention to the fact that the decrease of the learning rate according to schedule can be subsituted by the **gradual increase of the batch size**.

[Don’t decay the learning rate, increase batch size](https://arxiv.org/abs/1711.00489)

<img src="http://drive.google.com/uc?export=view&id=1ka9PN4zecdNVT6uE05VXEvbY_8EygEAv" width=700 heigth=700>

A potential cause of this phenomenon is, that when we train on  a bigger dataset (relatively for each gradinet update, by the increase of the batch size), we get a more and more "regularized", "smooth" gradient.

If this is true, we can benefit, since the larger batch size speeds up training considerably.


## Caution with big batches!

A bit on contrary with the statements above, it is worth mentioning, that there is emerging evidence, that **small batch** training has it's very **distinctive advantages** that are worth capitalizing on.

In a paper [Revisiting Small Batch Training for Deep Neural Networks](https://arxiv.org/abs/1804.07612) the authors conduct large scale training runs to determine the effect of batch size, and find, that large batch sizes actually decrease generalization performance of models, and there is a rather small optimal batch size for training. 

<img src="https://www.graphcore.ai/hs-fs/hubfs/ResNet32_CIFAR100_Aug_VB_val_dist.png?t=1536059639530&width=600&height=521&name=ResNet32_CIFAR100_Aug_VB_val_dist.png" width=400 heigth=400>

This is directly counteracting the trend of larger batches for parallelism that gained traction with the advent of large-scale distributed training methods.

**We should note, that this points out that batchsize is crucial, though during history was many times falsely determined by technical limitations (or luckily, in case of original SGD).**

This can be caused by the effect of - under constant learning rate - a larger batch's gradients are approximations of a smaller one's, so they may prevent fine-grained convergence (**the results of the research assume that training is always done on the same number of Epochs!** so the higher number of small batches will be relevant.

<img src="https://www.graphcore.ai/hs-fs/hubfs/SGD_fig.png?t=1536059639530&width=900&name=SGD_fig.png" width=400 heigh=400>

A detailed exploration of the paper can be found [here](https://www.graphcore.ai/posts/revisiting-small-batch-training-for-deep-neural-networks).

**Conclusion: Use small  batch sizes, it might help - though will be slower.**

## Loss topology and architecture

"The **loss landscape** of a neural network (visualized below) is a **function of the network's parameter values** quantifying the "error" associated with using a **specific configuration of parameter values** when performing **inference (prediction) on a given dataset**. This loss landscape can look quite different, even for very similar network architectures. The images below are from a paper, Visualizing the Loss **Landscape of Neural Nets**, which shows **how residual connections** in a network can yield a smoother loss topology."

<a href="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-26-at-10.50.53-PM.png"><img src="https://drive.google.com/uc?export=view&id=1_54gU3_A8z22Ncwdk3CBs4y2SmOePiDs" width=75%></a>

[Setting the learning rate of your neural network](https://www.jeremyjordan.me/nn-learning-rate/)

## Only decreasing?

**If it is true, that the loss topology is "bumpy", does it make sense to ONLY decrease the LR?**

What if, **when we are stuck, we increase a bit, and then decrease it again**?

What if we anticipate this in advance?

<a href="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-25-at-8.44.49-PM.png"><img src="https://drive.google.com/uc?export=view&id=1sIoyscsvAc8PVYJLu71pMiLaEniBkiCM" width=55%></a>

## Warm restarts

Original paper:
["Stochastic gradient descent with warm restarts"](https://arxiv.org/abs/1608.03983) 

"In this paper, we propose to periodically simulate warm restarts of SGD, where in each restart the
learning rate is initialized to some value and is scheduled to decrease. Four different instantiations
of this new learning rate schedule are visualized in Figure 1. Our empirical results suggest that SGD
with **warm restarts requires 2× to 4× fewer epochs** than the currently-used learning rate schedule
schemes to achieve comparable or even better results."

<img src="http://drive.google.com/uc?export=view&id=12a2yIch8Nnf27g8UEQ9ydJy9MhuAzysR" width=700 heigth=700>




## "Superconvergence"

The usage of cyclic LR has some interesting sideeffects also, which made it very popular. Enter **"superconvergence"**.

"we describe a phenomenon, which we named “super-convergence”, where neural networks can be trained an order of magnitude faster than with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elements of super-convergence is training with **one learning rate cycle and a large maximum learning rate**. A primary insight that allows super-convergence training is that **large learning rates regularize the training**, hence **requiring a reduction of all other forms of regularization** in order to preserve an optimal regularization balance."

<a href="https://cdn-images-1.medium.com/max/1200/1*cmYfSsmlXm8XdjNddsQqZw.jpeg"><img src="https://drive.google.com/uc?export=view&id=1-N-FgcHib5hK6V91s7Stq4aV14zYxhMl" width=65%></a>

Original source [here](https://arxiv.org/pdf/1708.07120.pdf)

More approachable summary [here](https://towardsdatascience.com/https-medium-com-super-convergence-very-fast-training-of-neural-networks-using-large-learning-rates-decb689b9eb0)

The "why" for cyclic LR:

"The motivation for the “One Cycle” policy was the following: The **learning rate** initially **starts small** to allow convergence to begin but as the network **traverses the flat valley**, the **learning rate becomes large**, allowing for faster progress through the valley. In the final stages of the training, when the training needs to settle into the local minimum, the learning rate is once again reduced to a small value."

<a href="https://cdn-images-1.medium.com/max/1200/1*Y3Xnw8qxOH6zdGmTlCMDbA.jpeg"><img src="https://drive.google.com/uc?export=view&id=1XEkoAKAFNA3riD5xStEqXH3CeIvISqjL" width=65%></a>

### LR as regularizer

An also interesting effect is, that the original paper argues: **large learning rates act as regularizers**, which by the way is consistent from the observations above about the connection with batch sizes: small batch and large LR are both regularizers.


### Snapshot ensembles

Warm restarts represented originally a separate line of research, but got combined with ["Snapshot ensembles"](https://arxiv.org/abs/1704.00109). This technique holds a connection - even in name - with ensemble models. 

During this procedure, we **save all "good" model states** - typically before warm restarts - and finally do an **ensemble of them in the classical sense**.

It's biggest (proposed) advantage is, that we hope to collect solutions from similarly powerful, but distinct optima, which were visited during training, so their ensemble will be more powerful than any of them alone.

<img src="http://ruder.io/content/images/2017/11/snapshot_ensembles.png" width=700 heigth=700>

This also gives an answer to the question - posed in early stopping context - if we should store and use the final model. No, not necessarily.

Moreover if we accept the fact that a **number of "global" optima are present** for a given model, we are inclined to **"sample" from more "basins"**, which even connects us to boosting methods.

For opinions on this [this](http://www.argmin.net/2018/02/05/linearization/) post is worth reading. It can point towards a more unified view.

An also very interesting summary of the developments in this regard is [this](http://ruder.io/deep-learning-optimization-2017/).


(Sebastian and the team at AYLIEN are very up to date in NLP, worth following [here](http://ruder.io/#open).)

## An unexpected side-effect: Better optimizers!

### RAdam

The "one cycle" policy, especially it's first part, the **"warmup" phase**, where we **start with a small initial learning rate, and gradually increase it** can have unexpected sideeffects in case of adaptive learning rate optimizers, especially **Adam**, since **warmup mitigates the generalization problem of adaptive LR methods**.

The crucial understanding of the paper ["On the Variance of the Adaptive Learning Rate and Beyond"](https://arxiv.org/abs/1908.03265) is, that much of the loss in generalization performance in case of Adam comes from it's naive over reliance on the **initial variance in gradients**, thus the **weight distribution gets quickly distorted** and never fully finds it's way back to fruitful, more global optima.

<a href="http://drive.google.com/uc?export=view&id=1mYfms1HJeL7O_OuvPcDg3HZpdVpyh2ow"><img src="https://drive.google.com/uc?export=view&id=1lzYQSLn8AhJU1OjKjbyuvhic_-Qx1-fv" width=50%></a>

So the newest, state of the art optimization method seems to be a form of Adam, namely **rectified Adam (or RAdam)** that incorporates some learnings from the warm-up method and variable learning rates.

_It is basically trying to set up an **adaptive regularization** scheme with which it balances the amount of Adam's adaptive LR properties, thus in **extreme case it can act as SGD**, disregarding the aggregated variance data, and only if appropriate does it start to behave like Adam._

Or with the words of the authors:

"Comparing these two strategies (warmup and RAdam), RAdam **deactivates the adaptive learning rate** when its **variance is divergent**, thus avoiding undesired instability in the first few updates."
 
A nice introduction can be found [here](https://medium.com/@lessw/new-state-of-the-art-ai-optimizer-rectified-adam-radam-5d854730807b).

**Details on Algorithm**

<img src="https://drive.google.com/uc?export=view&id=1-Y139WeaCOLp2VMdBKAzVEqufrlyJLM5" width=75%></a>


It promises to be fast, but more importantly **robust across a wide selection of learning rates**. 

<a href="https://miro.medium.com/max/700/1*BMwu8Km-CtPsvaH8OM5_-g.jpeg"><img src="https://drive.google.com/uc?export=view&id=1DCnEETLqnoQaiRkvc4wZzepCNBv9dGzT" width=65%></a>

Since the [paper](https://arxiv.org/abs/1908.03265) describing the method is still pretty fresh, the verdict is still out, but looks very promising! (Implementations are not yet mainstream...)


### Lookahead

Though the inspiration is not that direct and obvious, but the idea of storing some weights during the training had some "spin-off" ideas in optimization. In their recent paper [LookAhead optimizer: k steps forward, 1 step back](https://arxiv.org/abs/1907.08610) Zhang et al. proposed an optimization method where they **keep a copy of the weights, and use two optimization regimes, one "slow" and one "fast" for the network.**

<a href="http://drive.google.com/uc?export=view&id=1EhpRMvpgKowimKMnuVrkcXtd6hB79iIo"><img src="https://drive.google.com/uc?export=view&id=1ttOcZKzax1Zh5Ar5M39ooAwy7XrYkobg" width=75%></a>

**After a short period (some some iterations, eg. 5) they than "synchronize" the weights.**

Or with the words of the authors: 

"Lookahead maintains a set of **slow weights $φ$** and **fast weights $θ$**, which get **synced with the fast weights every $k$ updates**. The fast weights are **updated** through applying $A$, any standard optimization algorithm, to batches of training examples sampled from the dataset $D$. After $k$ inner optimizer updates using $A$, the **slow weights are updated towards the fast weights by linearly interpolating in weight space, $θ − φ$**. We denote the slow weights learning rate as $α$. After each slow weights update, the fast weights are reset to the current slow weights value."

Why is this good?

As this [excellent description](https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d) puts it:

"...in effect it allows a faster set of weights to ‘look ahead’ or explore while the slower weights stay behind to provide longer term stability. The result is reduced variance during training, and much less sensitivity to sub-optimal hyper-parameters and reduces the need for extensive hyper-parameter tuning... By way of simple analogy, LookAhead can be thought of as the following. Imagine you are at the top of a mountain range, with various dropoff’s all around. One of them leads to the bottom and success, but others are simply crevasses with no good ending. To explore by yourself would be hard because you’d have to drop down each one, and assuming it was a dead end, find your way back out. But, if you had a buddy who would stay at or near the top and help pull you back up if things didn’t look good, you’d probably make a lot more progress towards finding the best way down because exploring the full terrain would proceed much more quickly and with far less likelihood of being stuck in a bad crevasse."

<a href="http://drive.google.com/uc?export=view&id=1A43MSBp0s-zKO8H8EtUxY_B1rcTLJnRC"><img src="https://drive.google.com/uc?export=view&id=1Ugm-nexj9KsNuJZ5NLFEmPRVw5Uwy0TO" width=85%></a>

The thing seems to actually work!

### Surprise: a combination, Ranger

Well, if both RAdam and Lookahead achieved impressive new state-of-the-art results, why not combine the two?

This is exactly what happened, thus the new optimization method [Ranger](https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d) was born.

**The results were impressive, the normalization effects of RAdam at the beginning, and the overall stabilization of Lookahead combine seamlessly!**

As of 2019 September, this was the state-of-the-art, but the saga continues... 


# Epilogue

## Did we understand generalization well enough?

We are by now all too familiar with the drawbacks of empirical risk minimization, that is: we are very much afraid of learning the "quirks of the dataset" (Hinton) and overfit. Further more, we wholeheartedly ascribed to the understandings of **statistical learning theory** about the **relationship between complexity and overfitting / generalization** as follows:

<a href="http://drive.google.com/uc?export=view&id=1aPYe74krTI3RY1Nf2rYO5G06qZVbI3Sc"><img src="https://drive.google.com/uc?export=view&id=12cxR79w_Kv_OEPcbRzsrLY7VIPTcLGXi" width=35%></a>

But what if this is not the while picture? What if the **connection between complexity and generalization is not that simple?**

Some recent observations (like in the recent paper [Reconciling modern machine learning and the bias-variance trade-off](https://arxiv.org/abs/1812.11118)) point in the other direction. Maybe the effect of more capacity is detrimental **only in case of models smaller than the "memorization capacity"**, and maybe **we should have gona even bigger!**

<a href="http://drive.google.com/uc?export=view&id=1xvtWpiUxYkiYqPzrCx7BwFe1_YYh3eo5"><img src="https://drive.google.com/uc?export=view&id=1lFXEXRxZBU9uDFzUXyaBI0xlld1MFEJp" width=85%></a>

Much is still yet unknown.

There are some interesting results pointing in a bit of the opposite direction also, raising the question:


## Do we need training and big networks at all?

To say that the field of deep learning is in flux is a mayor understatement. We at least assumed, that the fact, that we need large networks and train them for an extensive period of time with sophisticated methods holds true.

Well, maybe not so, in two ways:

### The "Lottery Ticket Hypothesis"

"...after training a network, **set all weights smaller than some threshold to zero (prune them), rewind the rest of the weights to their initial configuration, and then retrain the network from this starting configuration keeping the pruned weights weights frozen (not trained).** Using this approach, they obtained two intriguing results.

First, they showed that the pruned networks performed well. **Aggressively pruned networks (with 99.5 percent to 95 percent of weights pruned) showed no drop in performance compared to the much larger, unpruned network. Moreover, networks only moderately pruned (with 50 percent to 90 percent of weights pruned) often outperformed their unpruned counterparts.**

Second, as compelling as these results were, the characteristics of the remaining network structure and weights were just as interesting. Normally, if you take a trained network, re-initialize it with random weights, and then re-train it, its performance will be about the same as before. But with the skeletal Lottery Ticket (LT) networks, this property does not hold. The network trains well only if it is rewound to its initial state, including the specific initial weights that were used. Reinitializing it with new weights causes it to train poorly. As pointed out in Frankle and Carbin’s study, it would appear that the **specific combination of pruning mask** (a per-weight binary value indicating whether or not to delete the weight) **and weights underlying the mask form a lucky sub-network** found within the larger network, or, as named by the original study, a winning “Lottery Ticket.”"

<a href="https://1fykyq3mdn5r21tpna3wkdyi-wpengine.netdna-ssl.com/wp-content/uploads/2019/05/blog_header_2-1068x458.png"><img src="https://drive.google.com/uc?export=view&id=1zVaznTkOUe6_iDpc5LT1xfNY2IZm9ESh" width=85%></a>

[Original paper](https://arxiv.org/pdf/1803.03635.pdf)

A [more thorough analysis](https://eng.uber.com/deconstructing-lottery-tickets/) 

**Takeaways:**
- Much of the **capacity** of deep models and the associated training time is **"wasted"**
- **Initialization is a dominant factor**, maybe the size of networks only matters for giving **large enough room** to randomly to come up with **"lottery tickets"**
- There is a very interesting **interplay** between **structure, learning and performance**** in deep networks
    
### Weight agnostic networks

"...Schmidhuber et al. have shown that a randomly-initialized LSTM [13] with a learned linear output layer can predict time series... we aim to search for **weight agnostic neural networks**, architectures with strong inductive biases that **can already perform various tasks with random weights.**"

<a href="https://storage.googleapis.com/quickdraw-models/sketchRNN/wann/png/mnist_cover.png"><img src="https://drive.google.com/uc?export=view&id=1uDX-0HZ8iih7kHwGeWFbsz2bP9fBfCdc" width=85%></a>


[Original](https://weightagnostic.github.io)

**Takeaways:**
- Maybe we **do not need (that much) training at all**?
- **Structure** can be **key**, as well as the right **inductive bias**!

# Strategic advice: How to debug ML models?

In case of so many "knobs" one can get lost pretty easily, and the chance for making  mistake is high. The question is then: How to start?

Though there are multiple strategies, one of the interesting ones was presented by the leader of OpenAI **Josh Tobin**:

[Troubleshooting Deep Neural Networks](http://josh-tobin.com/troubleshooting-deep-neural-networks.html)

<a href="http://josh-tobin.com/assets/debugging_overview.jpg"><img src="https://drive.google.com/uc?export=view&id=1mOFalWJOS4-_Zlo-xYGBpKQSHZEDfWLg" width=55%></a>

In [1]:
from IPython.display import IFrame 
IFrame("https://www.youtube.com/embed/XtCNNwDi9xg", width="560", height="315")

Very detailed guide, many good advice, worth reading and / or watching!

Another good source is **Andrej Karpathy's
[A Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/)**

**Final remark:**
    
A recent work carried out a thorough investigation of hyperparameters, and came up with "sensible default values".

["Rethinking Defaults Values: a Low Cost and Efficient Strategy to Define Hyperparameters"](https://arxiv.org/abs/2008.00025) is definitely worth a read!