# Chapter 7 Regularization of Deep or Distributed Models

# Contents

* 7.1 Regularization from a Bayesian Perspective
* 7.2 Classical Regularization: Parameter Norm Penalty
* 7.3 Classical Regularization as Constrained Optimization
* 7.4 Regularization and Under-Constrained Problems
* 7.5 Dataset Augmentation
* 7.6 Classical Regularization as Noise Robustness
* 7.7 Early Stopping as a Form of Regularization
* 7.8 Parameter Tying and Parameter Sharing
* 7.9 Sparse Representations
* 7.10 Bagging and Other Ensemble Methods
* 7.11 Dropout
* 7.12 Multi-Task Learning
* 7.13 Adversarial Training

# 7.1 Regularization from a Bayesian Perspective

<img src="figures/cap7.1.png" width=600 />

# 7.2 Classical Regularization: Parameter Norm Penalty

* 7.2.1 L2 Parameter Regularization
* 7.2.2 L1 Regularization
* 7.2.3 Bayesian Interpretation of the Parameter Norm Penalty

<img src="figures/cap7.2.png" width=600 />

## 7.2.1 L2 Parameter Regularization

<img src="figures/cap7.3.png" width=600 />

<img src="figures/cap7.4.png" width=600 />

<img src="figures/cap7.5.png" width=600 />

<img src="figures/cap7.6.png" width=600 />

<img src="figures/cap7.7.png" width=600 />

<img src="figures/cap7.8.png" width=600 />

<img src="figures/cap7.9.png" width=600 />

<img src="figures/cap7.10.png" width=600 />

<img src="figures/cap7.11.png" width=600 />

<img src="figures/cap7.12.png"  />

<img src="figures/cap7.13.png" width=600 />

<img src="figures/cap7.14.png"  />

<img src="figures/cap7.15.png"  />

## 7.2.2 L1 Regularization

<img src="figures/cap7.16.png" width=600 />

<img src="figures/cap7.17.png" width=600 />

<img src="figures/cap7.18.png" width=600 />

<img src="figures/cap7.19.png" width=600 />

<img src="figures/cap7.20.png" width=600 />

<img src="figures/cap7.21.png" width=600 />

## 7.2.3 Bayesian Interpretation of the Parameter Norm Penalty

<img src="figures/cap7.22.png" width=600 />

<img src="figures/cap7.23.png" width=600 />

<img src="figures/cap7.24.png" width=600 />

# 7.3 Classical Regularization as Constrained Optimization

<img src="figures/cap7.25.png" width=600 />

<img src="figures/cap7.26.png" width=600 />

<img src="figures/cap7.27.png" width=600 />

<img src="figures/cap7.28.png" width=600 />

# 7.4 Regularization and Under-Constrained Problems

<img src="figures/cap7.29.png" width=600 />

# 7.5 Dataset Augmentation

# 7.6 Classical Regularization as Noise Robustness

* 7.6.1 Injecting Noise at the Input
* 7.6.2 Injecting Noise at the Weights

## 7.6.1 Injecting Noise at the Input

<img src="figures/cap7.30.png" width=600 />

<img src="figures/cap7.31.png" width=600 />

<img src="figures/cap7.32.png" width=600 />

<img src="figures/cap7.33.png" width=600 />

<img src="figures/cap7.34.png" width=600 />

<img src="figures/cap7.35.png" width=600 />

<img src="figures/cap7.36.png" width=600 />

## 7.6.2 Injecting Noise at the Weights

<img src="figures/cap7.37.png" width=600 />

<img src="figures/cap7.38.png" width=600 />

<img src="figures/cap7.39.png" width=600 />

<img src="figures/cap7.40.png" width=600 />

<img src="figures/cap7.41.png" width=600 />

<img src="figures/cap7.42.png" width=600 />

<img src="figures/cap7.43.png" width=600 />

# 7.7 Early Stopping as a Form of Regularization

<img src="figures/cap7.44.png" width=600 />

<img src="figures/cap7.45.png" width=600 />

<img src="figures/cap7.46.png" width=600 />

<img src="figures/cap7.47.png" width=600 />

<img src="figures/cap7.48.png" width=600 />

### Early stopping and the use of surrogate loss functions: 

### How early stopping acts as a regularizer:

<img src="figures/cap7.49.png" width=600 />

<img src="figures/cap7.50.png" width=600 />

<img src="figures/cap7.51.png" width=600 />

<img src="figures/cap7.52.png" width=600 />

<img src="figures/cap7.53.png" width=600 />

<img src="figures/cap7.54.png" width=600 />

<img src="figures/cap7.55.png" width=600 />

<img src="figures/cap7.56.png" width=600 />

# 7.8 Parameter Tying and Parameter Sharing

Thus far, in this chapter, when we have discussed adding constraints or penaltiesto the parameters, we have always done so with respect to a ﬁxed region or point

For example, L2regularization (or weight decay) penalizes model parametersfor deviating from the ﬁxed value of zero. However, sometimes we may needother ways to express our prior knowledge about suitable values of the modelparameters. Sometimes we might not know precisely what values the parametersshould take but we know, from knowledge of the domain and model architecture,that there should be some dependencies between the model parameters.

This kind of approach was proposed in Lasserre et al. (2006), where theyregularized the parameters of one model, trained as a classiﬁer in a supervisedparadigm, with the parameters of another model, this one trained in an unsu-pervised paradigm (to capture the distribution of the observed input data). Thearchitectures were constructed such that many of the parameters in the classiﬁermodel could be paired to corresponding parameters in the unsupervised model.

While a parameter norm penalty is one way to regularize parameters to beclose to one another, the more popular way is to use constraints: to force setsof parameters to be equal. This method of regularization is often referred to asparameter sharing, where we interpret the various models or model components assharing a unique set of parameters.

A signiﬁcant advantage of parameter sharing over regularizing the parameters to be close (via a norm penalty) is that only asubset of the parameters (the unique set) need to be stored in memory. 

### Convolutional Neural Networks

By far the most popular and extensive useof parameter sharing occurs in convolutional neural networks (CNNs) applied tocomputer vision.

Parameter sharing has allowed CNNs to dramatically lower the number ofunique model parameters and have allowed them to signiﬁcantly increase networksizes without requiring a corresponding increase in training data. It remains oneof the best examples of how to eﬀectively incorporate domain knowledge into thenetwork architecture.

# 7.9 Sparse Representations

In this section we will describe a diﬀerent kind of regular-ization strategy where the eﬀect on the model parameters is only indirect.

Specifically we consider representational sparsity as a form of regularization.

We havealready discussed (in sec. 7.2.2) how L1penalization induces a sparse parametrization – meaning that a signiﬁcant number of the parameters is zero (or close to it). Representational sparsity, on the other hand, describes a representation where asigniﬁcant number of elements are zero (or close to zero). 

parameter sparsity - 
we have an example of an sparsely parametrized linear regression model.

<img src="figures/cap7.57.png" width=600 />

representation sparsity - we have linear regression with a sparse representation h of the data x.

<img src="figures/cap7.58.png" width=600 />

<img src="figures/cap7.59.png" width=600 />

# 7.10 Bagging and Other Ensemble Methods

Bagging (short for bootstrap aggregating) is a technique for reducing generalizationerror by combining several models (Breiman, 1994). 

The idea is to train severaldiﬀerent models separately, then have all of the models vote on the output fortest examples. 

This is an example of a general strategy in machine learning called model averaging. 

Techniques employing this strategy are known as ensemble methods.

<img src="figures/cap7.60.png" width=600 />

<img src="figures/cap7.61.png" width=600 />

# 7.11 Dropout

<img src="figures/cap7.62.png" width=600 />

<img src="figures/cap7.63.png" width=600 />

<img src="figures/cap7.64.png" width=600 />

<img src="figures/cap7.65.png" width=600 />

<img src="figures/cap7.66.png" width=600 />

<img src="figures/cap7.67.png" width=600 />

# 7.12 Multi-Task Learning

# 7.13 Adversarial Training

<img src="figures/cap7.68.png" width=600 />

<img src="figures/cap7.69.png" width=600 />

<img src="figures/cap7.70.png" width=600 />

# 참고자료 

* [1] Bengioy's DLbook : regulariation - http://www.iro.umontreal.ca/~bengioy/dlbook/regularization.html
* [2] Linear Classification - http://vision.stanford.edu/teaching/cs231n/slides/lecture3.pdf
* [3] Linear Classification : Loss - http://cs231n.github.io/linear-classify/#loss    
* [4] Backprop and intro to Neural - Netshttp://vision.stanford.edu/teaching/cs231n/slides/lecture5.pdf
* [5] Training Neural Networks - http://vision.stanford.edu/teaching/cs231n/slides/lecture6.pdf
* [6] Machine Learning (Regularization) - http://enginius.tistory.com/476
* [7] Regularisation methods - http://www.cse.unsw.edu.au/~dimitris/Regularization.pdf
* [8] Nonlinear ridge regression Risk, regularization, and cross-validation - https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/lecture4.pdf
* [9] Large-Scale Elastic-Net Regularized Generalized Linear Models at Spark Summit 2015 (17-20 slide) - http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit
* [10] Regularization in Neural Networks - http://www.cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf
* [11] Gradient, Jacobian 행렬, Hessian 행렬, Laplacian - http://darkpgmr.tistory.com/132
* [12] Regularization - http://www.cs.pomona.edu/~dkauchak/classes/f13/cs451-f13/lectures/lecture15-regularization.pptx
* [13] Convolutional Neural Networks: architectures, convolution / pooling layers - http://vision.stanford.edu/teaching/cs231n/slides/lecture7.pdf