#Chapter 9 Convolutional Networks

* 손고리즘 / 손고리즘 ML : 파트 3 - DML [1]
* 김무성

# Contents

* 9.1 The Convolution Operation
* 9.2 Motivation
* 9.3 Pooling
* 9.4 Convolution and Pooling as an Inﬁnitely Strong Prior
* 9.5 Variants of the Basic Convolution Function
* 9.6 Structured Outputs
* 9.7 Convolutional Modules
* 9.8 Data Types
* 9.9 Eﬃcient Convolution Algorithms
* 9.10 Random or Unsupervised Features
* 9.11 The Neuroscientiﬁc Basis for Convolutional Networks
* 9.12 Convolutional Networks and the History of DeepLearning

Convolutional networks (also known as convolutional neural networks or CNNs)are a specialized kind of neural network for processing data that has a known,grid-like topology.

The name “convolutional neural network”indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.

#### Convolutional networksare simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

# 9.1 The Convolution Operation

#### Convolution operator

* Suppose we are tracking the location of a spaceship with a laser sensor. Ourlaser sensor provides a single output x(t), the position of the spaceship at timet. 
* Now suppose that our laser sensor is somewhat noisy. To obtain a less noisyes timate of the spaceship’s position, we would like to average together several measurements. 
* Of course, more recent measurements are more relevant, so wewill want this to be a weighted average that gives more weight to recent measurements.
* We can do this with a weighting function w(a), where a is the age of ameasurement. If we apply such a weighted average operation at every moment,we obtain a new function s providing a smoothed estimate of the position of thespaceship:

<img src="figures/cap9.1.png" />

The convolution operation is typically denoted with an asterisk:

<img src="figures/cap9.2.png"  />

#### input & kernel & feature map

In convolutional network terminology, the ﬁrst argument (in this example,the function x) to the convolution is often referred to as the input and the secondargument (in this example, the function w) as the kernel. The output is sometimes referred to as the feature map.

#### discrete convolution

<img src="figures/cap9.3.png" />

#### multidimensional case

In machine learning applications, the input is usually a multidimensional arrayof data and the kernel is usually a multidimensional array of learn-able parameters.

We will refer to these multidimensional arrays as tensors.

For example, if we use a two-dimensional image I as our input, we probably alsowant to use a two-dimensional kernel K :

<img src="figures/cap9.4.png" />

Note that convolution is commutative, meaning we can equivalently write:

<img src="figures/cap9.5.png" />

#### cross-correlation

While the commutative property is useful for writing proofs, it is not usuallyan important property of a neural network implementation. Instead, many neuralnetwork libraries implement a related function called the cross-correlation, whichis the same as convolution but without ﬂipping the kernel:

<img src="figures/cap9.6.png" />

* Many machine learning libraries implement cross-correlation but call it convolution. 

<img src="figures/cap9.7.png" width=600 />

Discrete convolution can be viewed as multiplication by a matrix.

Viewing convolution as matrix multiplication usually does nothelp to implement convolution operations, but it is useful for understanding anddesigning neural networks.
* Any neural network algorithm that works with matrix multiplication and does not depend on speciﬁc properties of the matrix structure should work with convolution, without requiring any further changes to the neuralnetwork. 
* Typical convolutional neural networks do make use of further specializations in order to deal with large inputs eﬃciently, but these are not strictly necessary from a theoretical perspective.

# 9.2 Motivation

Convolution leverages three important ideas that can help improve a machinelearning system: 
* sparse interactions, 
* parameter sharing, and 
* equivariant representations. 

<img src="figures/cap9.8.png" width=600 />

<img src="figures/cap9.9.png" width=600 />

<img src="figures/cap9.10.png" width=600 />

<img src="figures/cap9.11.png" width=600 />

<img src="figures/cap9.12.png" width=600 />

# 9.3 Pooling

A typical layer of a convolutional network consists of three stages (see Fig. 9.7).
* In the ﬁrst stage, the layer performs several convolutions in parallel to produce aset of presynaptic activations. 
* In the second stage, each presynaptic activation isrun through a nonlinear activation function, such as the rectiﬁed linear activationfunction. This stage is sometimes called the detector stage. 
* In the third stage,we use a pooling function to modify the output of the layer further.

A pooling function 
* replaces the output of the net at a certain location with asummary statistic of the nearby outputs. 
* For example, 
    - the max pooling operation 
        - reports the maximum output within a rectangular neighborhood. 
* Other popular pooling functions include 
    - the average of a rectangular neighborhood, 
    - the L2 norm of a rectangular neighborhood, or 
    - a weighted average based on the distance from the central pixe

<img src="figures/cap9.13.png" width=600 />

<img src="figures/cap9.14.png" width=600 />

#### translation invariant

In all cases, pooling helps to make the representation become invariant to small translations of the input.

KEY IDEA : Invariance to local translationcan be a very useful property if we care more about whether somefeature is present than exactly where it is.

<img src="figures/cap9.15.png" width=600 />

#### inﬁnitely strong prior

The use of pooling can be viewed as adding an inﬁnitely strong prior thatthe function the layer learns must be invariant to small translations. When thisassumption is correct, it can greatly improve the statistical eﬃciency of the network.

#### transformation invariant

Pooling over spatial regions produces invariance to translation, but if we pool over the outputs of separately parametrized convolutions, the features can learn which transformations to become invariant to (see Fig. 9.9).

<img src="figures/cap9.16.png" width=600 />

#### pooling with downsampling

<img src="figures/cap9.17.png" width=600 />

# 9.4 Convolution and Pooling as an Inﬁnitely Strong Prior

* An inﬁnitely strong prior places zero probability on some parameters and says that these parameter values are completely forbidden, regardless of how much support the data gives to those values.
* Of course, implementing a convolutional net as a fully connected net with aninﬁnitely strong prior would be extremely computationally wasteful. But thinkingof a convolutional net as a fully connected net with an inﬁnitely strong prior cangive us some insights into how convolutional nets work.
* One key insight is that convolution and pooling can cause underﬁtting.
    - If a task relies on preserving precisionspatial information, then using pooling on all features can cause underﬁtting.
* Another key insight from this view is that we should only compare convolutional models to other convolutional models in benchmarks of statistical learningperformance.    

# 9.5 Variants of the Basic Convolution Function

Additionally, the input is usually not just a grid of real values. Rather, it is agrid of vector-valued observations.

#### 3-D tensor

* For example, a color image has a <font color="red">red, green,and blue</font> intensity at each pixel. 
* In a multilayer convolutional network, the inputto the second layer is the output of the ﬁrst layer, which usually has the outputof many diﬀerent convolutions at each position. 
* When working with images, we usually think of the <font color="red">input and output of the convolution as being 3-D tensors</font>, with <font color="blue">one index into the diﬀerent channels and two indices into the spatial coordinates</font> of each channel.

#### 4-D tensors

* Assume we have a 4-D kernel tensor 
    - K with element K_i,j,k,l 
        - giving the connection strength 
        - between 
            - a unit in channel i of the output and 
            - a unit in channelj of the input, 
        - with an oﬀset of k rows and l columns 
            - between the output unit and the input unit. 
    - Assume our input consists of observed data V with elementV_i,j,k giving the value of the input unit within channel i at row j and column k.
    - If Z is produced by convolving K across V without ﬂipping K, then

<img src="figures/cap9.18.png" />

We may also want to skip over some positions of the kernel in order to reducethe computational cost (at the expense of not extracting our features as ﬁnely).

* We can think of this as downsampling the output of the full convolution function.
* If we want to sample only every s pixels in each direction in the output, then we can deﬁned a downsampled convolution function c such that :

<img src="figures/cap9.19.png" />

* We refer to s as the stride of this downsampled convolution.
* It is also possibleto deﬁne a separate stride for each direction of motion. 

#### zero-padding

One essential feature of any convolutional network implementation is the ability to implicitly zero-pad the input V in order to make it wider.

* Without thisfeature, the width of the representation shrinks by the kernel width - 1 at eac hlayer. 
* Zero padding the input allows us to control the kernel width and the size ofthe output independently. 
* Without zero padding, we are forced to choose between shrinking the spatial extent of the network rapidly and using small kernels–both scenarios that signiﬁcantly limit the expressive power of the network.

<img src="figures/cap9.20.png" width=600 />

<img src="figures/cap9.21.png" width=600 />

#### Three special cases of the zero-padding setting

* valid convolution
    - One is the extreme case in which no zero-padding is used whatsoever, and the convolution kernel is only allowed to visit positions where the entire kernel is contained entirely within the image.
    - In MATLAB terminology, this is called valid convolution.
    - this case, all pixels in the output are a function of the same number of pixels inthe input, so the behavior of an output pixel is somewhat more regular. However,the size of the output shrinks at each layer.
    - As layers are added, the spatial dimension of the network will eventually drop to 1 × 1, at which point additional layers cannot meaningfully be considered convolutional. 
* same convolution
    - Another special case of the zero-padding settingis when just enough zero-padding is added to keep the size of the output equalto the size of the input. 
    - MATLAB calls this same convolution.
    - In this case,the network can contain as many convolutional layers as the available hardware can support, since the operation of convolution does not modify the architectural possibilities available to the next layer.
    - However, the input pixels near the borderinﬂuence fewer output pixels than the input pixels near the center. This canmake the border pixels somewhat underrepresented in the model.
* full convolution
    - MATLAB refers to as full convolution, in which enough zeroes are added for every pixel to be visited k times in each direction,resulting in an output image of size m+k −1×m + k −1. 
    * In this case, the output pixels near the border are a function of fewer pixels than the output pixels nearthe center. 
    * This can make it diﬃcult to learn a single kernel that performs wellat all positions in the convolutional feature map. 

Usually the optimal amount ofzero padding (in terms of test set classiﬁcation accuracy) lies somewhere between“valid” and “same” convolution.

#### In some cases, we do not actually want to use convolution, but rather locallyconnected layers.

* In this case, the adjacency matrix in the graph of our MLP isthe same, but every connection has its own weight, speciﬁed by a 6-D tensor W.
* The indices into W are respectively: 
    - i, the output channel, 
    - j, the output row, 
    - k,the output column, 
    - l, the input channel, 
    - m, the row oﬀset within the input, and 
    - n, the column oﬀset within the input. 
* The linear part of a locally connected layeris then given by

<img src="figures/cap9.22.png" />

#### Tiled convolution 

Tiled convolution (Gregor and LeCun, 2010; Le et al., 2010) oﬀers a compro-mise between a convolutional layer and a locally connected layer. Rather thanlearning a separate set of weights at every spatial location, we learn a set of kernelsthat we rotate through as we move through space

This means that immediatelyneighboring locations will have diﬀerent ﬁlters, like in a locally connected layer,but the memory requirements for storing the parameters will increase only by afactor of the size of this set of kernels, rather than the size of the entire outputfeature map. 

To deﬁne tiled convolution algebraically, let k be a 6-D tensor, where two ofthe dimensions correspond to diﬀerent locations in the output map. Rather thanhaving a separate index for each location in the output map, output locationscycle through a set of t diﬀerent choices of kernel stack in each direction. If t isequal to the output width, this is the same as a locally connected layer.

<img src="figures/cap9.23.png" />

### training

To train the network, we need to compute the derivatives with respect to theweights in the kernel. To do so, we can use a function

<img src="figures/cap9.24.png" />

If this layer is not the bottom layer of the network, we’ll need to compute thegradient with respect to V in order to backpropagate the error farther down. Todo so, we can use a function

<img src="figures/cap9.25.png" />

We could also use h to deﬁne the reconstruction of a convolutional autoencoder, or the probability distribution over visible given hidden units in a convo-lutional RBM or sparse coding model. Suppose we have hidden units H in thesame format as Z and we deﬁne a reconstruction

<img src="figures/cap9.26.png" />

In order to train the autoencoder, we will receive the gradient with respectto R as a tensor E. To train the decoder, we need to obtain the gradient withrespect to K. This is given by g(H, E, s). 

To train the encoder, we need to obtainthe gradient with respect to H. This is given by c(K, E, s). It is also possible todiﬀerentiate through g using c and h, but these operations are not needed for thebackpropagation algorithm on any standard network architectures.

# 9.6 Structured Outputs

# 9.7 Convolutional Modules

# 9.8 Data Types

The data used with a convolutional network usually consists of several channels,each channel being the observation of a diﬀerent quantity at some point in spaceor time. See Table 9.1 for examples of data types with diﬀerent dimensionalitiesand number of channels.

So far we have discussed only the case where every example in the trainand test data has the same spatial dimensions. One advantage to convolutionalnetworks is that they can also process inputs with varying spatial extents. Thesekinds of input simply cannot be represented by traditional, matrix multiplication-based neural networks. 

<img src="figures/cap9.27.png" width=600 />
<img src="figures/cap9.28.png" width=600 />

Table 9.1: Examples of diﬀerent formats of data that can be used with convolutional networks

# 9.9 Eﬃcient Convolution Algorithms

Convolution is equivalent to converting both the input and the kernel to thefrequency domain using a Fourier transform, performing point-wise multiplicationof the two signals, and converting back to the time domain using an inverseFourier transform. For some problem sizes, this can be faster than the naiveimplementation of discrete convolution.

#### separable

* When a d-dimensional kernel can be expressed as the outer product of dvectors, one vector per dimension, the kernel is called separable. When the kernelis separable, naive convolution is ineﬃcient. 
* It is equivalent to compose d one-dimensional convolutions with each of these vectors. The composed approachis signiﬁcantly faster than performing one k-dimensional convolution with theirouter product. 
* The kernel also takes fewer parameters to represent as vectors.
* If the kernel is w elements wide in each dimension, then naive multidimensionalconvolution requires O(wd) runtime and parameter storage space, while separableconvolution requires O(w × d) runtime and parameter storage space. 
* Of course,not every convolution can be represented in this way

# 9.10 Random or Unsupervised Features

Typically, the most expensive part of convolutional network training is learningthe features. The output layer is usually relatively inexpensive due to the smallnumber of features provided as input to this layer after passing through severallayers of pooling. When performing supervised training with gradient descent,every gradient step requires a complete run of forward propagation and backwardpropagation through the entire network.

#### unsupervised fashion

One way to reduce the cost of convo-lutional network training is to use features that are not trained in a supervised fashion.

There are two basic strategies for obtaining convolution kernels without supervised training. 
* One is to simply initialize them randomly. 
* The other is tolearn them with an unsupervised criterion.
    - This approach allows the features tobe determined separately from the classiﬁer layer at the top of the architecture.One can then extract the features for the entire training set just once, essentiallyconstructing a new training set for the last layer. Learning the last layer is thentypically a convex optimization problem, assuming the last layer is something likelogistic regression or an SVM.

Random ﬁlters often work surprisingly well in convolutional networks (Jar-rett et al., 2009b; Saxe et al., 2011; Pinto et al., 2011; Cox and Pinto, 2011).Saxe et al. (2011) showed that layers consisting of convolution following by pool-ing naturally become frequency selective and translation invariant when assignedrandom weights. They argue that this provides an inexpensive way to choose thearchitecture of a convolutional network: ﬁrst evaluate the performance of severalconvolutional network architectures by training only the last layer, then take thebest of these architectures and train the entire architecture using a more expensiveapproach.

An intermediate approach is to learn the features, but using methods that do not require full forward and back-propagation at every gradient step. Aswith multilayer perceptrons, we use greedy layer-wise unsupervised pretraining,to train the ﬁrst layer in isolation, then extract all features from the ﬁrst layeronly once, then train the second layer in isolation given those features, and so on

As with other approaches to unsupervised pretraining, it remains diﬃcult totease apart the cause of some of the beneﬁts seen with this approach. Unsuper-vised pretraining may oﬀer some regularization relative to supervised training,or it may simply allow us to train much larger architectures due to the reducedcomputational cost of the learning rule.

# 9.11 The Neuroscientiﬁc Basis for Convolutional Networks

<img src="figures/cap9.29.png" width=300 />

<img src="figures/cap9.30.png" width=600 />

<img src="figures/cap9.31.png" width=600 />

# 9.12 Convolutional Networks and the History of DeepLearning

<img src="figures/cap9.32.png" width=600 />

# 참고자료

* [1] bengio's book - Chapter 9 Convolutional Networks - http://www.iro.umontreal.ca/~bengioy/dlbook/version-07-08-2015/convnets.html
* [2] Linear Systems and Convolution - http://www.slideshare.net/lineking/lecture4-26782530
* [3] Convolutional Neural Networks: architectures, convolution / pooling layers - http://vision.stanford.edu/teaching/cs231n/slides/lecture7.pdf