# Fundamentals of Deep Learning
## 目录
- Chapter 5. Convolutional Neural Networks

## Vanilla Deep Neural Networks Don’t Scale
In MNIST, our images were only 28 x 28 pixels and were black and white. As a result, a neuron in a fully connected hidden layer would have 784 incoming weights. This seems pretty tractable for the MNIST task, and our vanilla neural net performed quite well. This technique, however, does not scale well as our images grow larger. For example, for a full-color 200 x 200 pixel image, our input layer would have 200 x 200 x 3 = 120,000 weights. 

![5-3](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0503.png)

Figure 5-3. The density of connections between layers increases intractably as the size of the image increases

As we’ll see, the neurons in a convolutional layer are only connected to a small, local region of the preceding layer. A convolutional layer’s function can be expressed simply: it processes a three-dimensional volume of information to produce a new three-dimensional volume of information.

![5-4](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0504.png)

Figure 5-4. Convolutional layers arrange neurons in three dimensions, so layers have width, height, and depth

## Filters and Feature Maps
A `filter` is essentially a feature detector.

![5-5](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0505.png)

Figure 5-5. We’ll analyze this simple black-and-white image as a toy example

Let’s say that we want to detect vertical and horizontal lines in the image. For example, to detect vertical lines, we would use the feature detector on the top, slide it across the entirety of the image, and at every step check if we have a match. This result is our `feature map`, and it indicates where we’ve found the feature we’re looking for in the original image. We can do the same for the horizontal line detector (bottom), resulting in the feature map in the bottom-right corner.

![5-6](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0506.png)

Figure 5-6. Applying filters that detect vertical and horizontal lines on our toy example

This operation is called a convolution. We take a filter and we multiply it over the entire area of an input image.

Filters represent combinations of connections (one such combination is highlighted in Figure 5-7) that get replicated across the entirety of the input.

The output layer is the feature map generated by this filter. A neuron in the feature map is activated if the filter contributing to its activity detected an appropriate feature at the corresponding position in the previous layer.

![5-7](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0507.png)

Figure 5-7. Representing filters and feature maps as neurons in a convolutional layer

Express the feature map as follows:

$$m_{ij}^k=f((W \cdot x)_{ij} + b^k)$$

- the $k^{th}$ feature map in layer m as $m^k$
- the corresponding filter by the values of its weights upper W
- assuming the neurons in the feature map have bias $b^k$ (note that the bias is kept identical for all of the neurons in a feature map)

And we have accumulated three feature maps, one for eyes, one for noses, and one for mouths. We know that a particular location contains a face if the corresponding locations in the primitive feature maps contain the appropriate features (two eyes, a nose, and a mouth). In other words, **to make decisions about the existence of a face, we must combine evidence over multiple feature maps.**

As a result, feature maps must be able to operate over volumes, not just areas. This is shown below in Figure 5-8. Each cell in the input volume is a neuron. A local portion is multiplied with a filter (corresponding to weights in the convolutional layer) to produce a neuron in a filter map in the following volumetric layer of neurons.

![5-8](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0508.png)

Figure 5-8. Representing a full-color RGB image as a volume and applying a volumetric convolutional filter

The depth of the output volume of a convolutional layer is equivalent to the number of filters in that layer, because each filter produces its own slice. We visualize these relationships in Figure 5-9.

![5-9](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0509.png)

Figure 5-9. A three-dimensional visualization of a convolutional layer, where each filter corresponds to a slice in the resulting output volume

## Full Description of the Convolutional Layer
This input volume has the following characteristics:

- Its width $w_{in}$
- Its height $h_{in}$
- Its depth $d_{in}$
- Its zero padding p

This volume is processed by a total of k filters, which represent the weights and connections in the convolutional network. These filters have a number of hyperparameters, which are described as follows:

- Their spatial extent e, which is equal to the filter’s height and width.
- Their stride s, or the distance between consecutive applications of the filter on the input volume. If we use a stride of 1, we get the full convolution described in the previous section. We illustrate this in Figure 5-10.
- The bias b (a parameter learned like the values in the filter) which is added to each component of the convolution.

![5-10](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0510.png)

Figure 5-10. An illustration of a filter’s stride hyperparameter

This results in an output volume with the following characteristics:

- Its function f, which is applied to the incoming logit of each neuron in the output volume to determine its final value
- Its width $w_{out}=\lceil \frac{w_{in}-e+2p}{s} \rceil + 1$
- Its height $h_{out}=\lceil \frac{h_{in}-e+2p}{s} \rceil + 1$
- Its depth $d_{out}=k$

![5-11](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0511.png)

Figure 5-11. This is a convolutional layer with an input volume that has width 5, height 5, depth 3, and zero padding 1. There are 2 filters, with spatial extent 3 and applied with a stride of 2. It results in an output volume with width 3, height 3, and depth 2. We apply the first convolutional filter to the upper-leftmost 3 x 3 piece of the input volume to generate the upper-leftmost entry of the first depth slice.

![5-12](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0512.png)

Figure 5-12. Using the same setup as Figure 5-11, we generate the next value in the first depth slice of the output volume.  

TensorFlow provides us with a convenient operation to easily perform a convolution on a minibatch of input volumes (note that we must apply our choice of function  ourselves and it is not performed by the operation itself):

```py
tf.nn.conv2d(input, filter, strides, padding, use_cudnn_on_gpu=True, name=None)
```

- `input`:a four-dimensional tensor of size $N \times h_{in} \times w_{in} \times d_{in}$, where  is the number of examples in our minibatch.
- `filter`:also a four-dimensional tensor representing all of the filters applied in the convolution. It is of size $e \times e \times d_{in} \times k$.
- The resulting tensor emitted by this operation has the same structure as `input`
- Setting the padding argument to "SAME" also selects the zero padding so that height and width are preserved by the convolutional layer.

## Max Pooling
The essential idea behind max pooling is to break up each feature map into equally sized tiles.Then we create a condensed feature map. Specifically, we create a cell for each tile, compute the maximum value in the tile, and propagate this maximum value into the corresponding cell of the condensed feature map. This process is illustrated in Figure 5-13.

![5-13](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0513.png)

Figure 5-13. An illustration of how max pooling significantly reduces parameters as we move up the network

We can describe a pooling layer with two parameters:

- Its spatial extent e
- Its stride s

It’s important to note that only two major variations of the pooling layer are used. The first is the nonoverlapping pooling layer with e = 2, s = 2. The second is the overlapping pooling layer with e = 3， s = 2. The resulting dimensions of each feature map are as follows:

- Its width $w_{out}=\lceil \frac{w_{in}-e}{s} \rceil + 1$
- Its height $h_{out}=\lceil \frac{h_{in}-e}{s} \rceil + 1$

## Full Architectural Description of Convolution Networks
![5-14](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0514.png)

Figure 5-14. Various convolutional network architectures of various complexities. The architecture of VGGNet, a deep convolutional network built for ImageNet, is shown in the rightmost network.