<< [第四章：机器学习基础](Chapter4_Fundamentals_of_machine_learning.ipynb)|| [目录](index.md) || [第六章：文字和序列中的深度学习](Chapter6_Deep_learning_for_text_and_sequences.ipynb) >>

# 第五章：计算机视觉中的深度学习 

> In this chapter, you will learn about convolutional neural networks (or "convnets"), a
type of deep learning model almost universally used in computer vision applications.
You will learn to apply them to image classification problems, in particular those
involving small training datasets, the most common use case if you are not a large tech
company.

在本章中，你会学习到卷积神经网络（或者叫做"convnets"），这是一种广泛应用在计算机视觉领域的深度学习模型。你会学习应用它们到图像分类问题中，特别是涉及到小训练数据集的情况下，这是当你并非任职在大型科技公司时最常见的情况。

> We will start with an introduction to the theory behind convnets, specifically:

> - What is convolution and max-pooling?
- What are convnets?
- What do convnets learn?

我们会首先介绍卷积神经网络背后的理论，特别是？

- 什么是卷积和最大池化？
- 什么是卷积神经网络？
- 卷积神经网络学习什么？

> Then we will cover image classification with small datasets:

> - Training your own small convnets from scratch.
- Using data augmentation to mitigate overfitting.
- Using a pre-trained convnet to do feature extraction.
- Fine-tuning a pre-trained convnet.

然后我们会介绍使用小型数据集进行图像分类：

- 从头开始训练你自己的小型卷积神经网络。
- 使用数据增强来抑制过拟合。
- 使用一个预训练的卷积神经网络来进行特性提取。
- 对一个预训练卷积神经网络进行精细调参。

> Finally, we will cover a few techniques for visualizing what convnets learn and how
they make classification decisions.

最后，我们会介绍一些技巧来可视化卷积神经网络的学习过程以及它们是如何进行图像分类的。

## 5.1 卷积神经网络介绍

> We are about to dive into the theory of what convnets are and why they have been so
successful at computer vision tasks. But first, let’s take a practical look at a very simple
convnet example. We will use our convnet to classify MNIST digits, a task that you’ve
already been through in Chapter 2, using a densely-connected network (our test accuracy
then was 97.8%). Even though our convnet will be very basic, its accuracy will still blow
out of the water that of the densely-connected model from Chapter 2.

我们将要进入到卷积神经网络的原理介绍，然后说明为什么它们能在计算机视觉任务上取得如此的成功。但是首先，让我们来看一个非常简单的卷积神经网络例子。我们会使用我们的卷积神经网络来分类MNIST数字，一个你已经在第二章中看到的例子，当时使用的是全连接层网络（我们达到了97.8%的测试准确率）。虽然下面使用的是一个非常基本的卷积神经网络，它能达到的准确率还是能远超第二章中的全连接层模型。

> The 6 lines of code below show you what a basic convnet looks like. It’s a stack of
Conv2D and MaxPooling2D layers. We’ll see in a minute what they do concretely.
Importantly, a convnet takes as input tensors of shape (image_height, image_width,
image_channels) (not including the batch dimension). In our case, we will configure
our convnet to process inputs of size (28, 28, 1) , which is the format of MNIST
images. We do this via passing the argument input_shape=(28, 28, 1) to our first
layer.

下面的6行代码为你展示了一个基本的卷积神经网络的样子。它是一个`Conv2D`和`MaxPooling2D`层堆叠起来的结构。一会我们会具体介绍它们。重要的是，卷积神经网络接受的输入张量形状为(图像高度，图像宽度，图像通道)（未包括批次维度）。在我们的例子中，我们设置卷积神经网络处理输入形状为(28, 28, 1)，这是MNIST图像的格式。我们通过在第一层中加入参数`input_shape=(28, 28, 1)`来设置。

In [1]:
from tensorflow.keras import models, layers

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

> Let’s display the architecture of our convnet so far:

让我们显示这个卷积神经网络目前的结构：

In [2]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


> You can see above that the output of every Conv2D and MaxPooling2D layer is a 3D
tensor of shape (height, width, channels) . The width and height dimensions tend to
shrink as we go deeper in the network. The number of channels is controlled by the first
argument passed to the Conv2D layers (e.g. 32 or 64).

上面的结构概述可以看到，每个`Conv2D`和`MaxPooling2D`层的输出都是一个3D的张量，形状为`(图像高度，图像宽度，图像通道)`。宽度和高度两个维度随着网络层次加深而不断减少。通道的数量通过传递给`Conv2D`层的第一个参数来确定（如32或64）。

> The next step would be to feed our last output tensor (of shape (3, 3, 64) ) into a
densely-connected classifier network like those you are already familiar with: a stack of
Dense layers. These classifiers process vectors, which are 1D, whereas our current output
is a 3D tensor. So first, we will have to flatten our 3D outputs to 1D, and then add a few
Dense layers on top:

下面的步骤将会是将我们卷积最后的输出张量（形状为(3, 3, 64)）输入到一个全连接分类网络当中，正如我们已经熟悉的那样：也就是多个全连接层的堆叠。这些分类器处理向量，也就是1D张量，而我们目前得到的是3D张量。因此一开始，我们可以将3D输出平铺展开成1D，然后在后面再加上一些全连接层：

In [3]:
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

> We are going to do 10-way classification, so we use a final layer with 10 outputs and
a softmax activation. Now here’s what our network looks like:

我们需要的是10个不同的分类，因此最后一层有着10个隐藏单元作为输出，并使用`softmax`激活函数。现在再来看看网络的样子：

In [4]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten (Flatten)            (None, 576)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                3

> As you can see, our (3, 3, 64) outputs were flattened into vectors of shape (576,) ,
before going through two Dense layers.

正如上面展示的，卷积层输出的(3, 3, 64)张量被铺平展开成了一个(576,)的向量，然后才被输入到全连接层中。

> Now, let’s train our convnet on the MNIST digits. We will reuse a lot of the code we
have already covered in the MNIST example from Chapter 2.

现在，我们使用MNIST数据来训练这个卷积神经网络。我们会重复使用很多在第二章MNIST例子中的代码。

In [5]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=20, batch_size=64)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Train on 60000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f97940911d0>

> Let’s evaluate the model on the test data:

然后我们直接在测试集上检验一下模型：

In [6]:
test_loss, test_acc = model.evaluate(test_images, test_labels)
test_acc



0.9908

> While our densely-connected network from Chapter 2 had a test accuracy of 97.8%,
our basic convnet has a test accuracy of 99.3%: we decreased our error rate by 68%
(relative). Not bad!

我们在第二章使用全连接网络获得了97.8%的测试准确率，而现在这个很基本的卷积神经网络就可以达到测试准确率99.3%：我们将错误率降低了68%（相对）。不错！

> But why does this simple convnet work so well compared to a densely-connected
model? To answer this, let’s dive into what these Conv2D and MaxPooling2D layers
actually do.

为什么这个简单的卷积神经网络与全连接网络相比能够达到这么高的性能呢？要回答这个问题，我们需要深入了解`Conv2D`和`MaxPooling2D`层实际上的原理。

### 5.1.1 卷积操作

> The fundamental difference between a densely-connected layer and a convolution layer is
this: dense layers learn global patterns in their input feature space (e.g. for a MNIST
digit, patterns involving all pixels), while convolution layers learn local patterns (see
Figure 5.1), i.e. in the case of images, patterns found in small 2D windows of the inputs.
In our example above, these windows were all 3x3.

全连接层和卷积层最基本的区别在于：全连接层学习的是输入特征空间的整体模式（立方说在MNIST中，就是所有像素的模式），而卷积层学习的是局部特征（参见图5-1），也就是说，在图像当中会找到那些位于输入图像内小的二维窗口的局部模式，在上面的例子中，这些窗口大小是3x3。

![windows](imgs/f5.1.jpg)

图5-1 图像被窗口分割成一个一个小的部分，每个部分都会包含诸如边缘、纹理之类的局部模式。

> This key characteristic gives convnets two interesting properties:

> - The patterns they learn are translation-invariant , i.e. after learning a certain pattern in the
bottom right corner of a picture, a convnet is able to recognize it anywhere, e.g. in the top
left corner. A densely-connected network would have to learn the pattern anew if it
appeared at a new location. This makes convnets very data-efficient when processing
images (since the visual world is fundamentally translation-invariant ): they need less
training samples to learn representations that have generalization power.
- They can learn spatial hierarchies of patterns (figure 5.2). A first convolution layer will
learn small local patterns such as edges, but a second convolution layer will learn larger
patterns made of the features of the first layers. And so on. This allows convnets to
efficiently learn increasingly complex and abstract visual concepts (since the visual world
is fundamentally spatially hierarchical ).

这个关键的特性为卷积神经网络提供了两个有趣的属性：

- 它们学习到的模式具有转换不变性，也就是当学习到了图像右下角的某个模式之后，卷积网络能够在任何地方识别出它，比如图像的左上角。全连接网络在碰到不同位置的相同模式时需要重新学习一个新的模式。这使得卷积网络在处理图像时具有相当高的数据效率（因为在视觉领域当中转换不变性是非常基础的原理）：它们需要更少的训练样本就可以学习到具有泛化能力的表示形式。
- 它们可以学习到层次化的模式形式（参见图5-2）。第一个卷积层会学习到小的局部模式如边缘，第二个卷积层从第一层的模式组合中会学习到大一点的模式，以此类推。这使得卷积网络能高效的学习到不断复杂和抽象的视觉概念（因为在视觉领域中层次模式也是非常基础的）。

![spatial hierarchies patterns](imgs/f5.2.jpg)

图5-2 视觉模型中的层次化模式识别：超局部模式（边缘）组成了局部对象如眼睛、耳朵，然后在组成了高层次的概念如猫

> Convolutions operate over 3D tensors, called "feature maps", with two spatial axes
("height" and "width") as well as a "depth" axis (also called the "channels" axis). For a
RGB image, the dimension of the "depth" axis would be 3, since the image has 3 color
channels, red, green, and blue. For a black and white picture, like our MNIST digits, the
depth is just 1 (levels of gray). The convolution operation extracts patches from its input
feature map, and applies a same transformation to all of these patches, producing an
output feature map . This output feature map is still a 3D tensor: it still has a width and a
height. Its depth can be arbitrary, since the output depth is a parameter of the layer, and
the different channels in that depth axis no longer stand for specific colors like in an RGB
input, rather they stand for what we call filters . Filters encode specific aspects of the
input data: at a high level, a single filter could be encoding the concept "presence of a
face in the input", for instance.

卷积网络操作的是3D张量，也被叫做“特征地图”，具有两个空间维度（高度和宽度）和一个“深度”维度（也被称为“通道”维度）。对于一个RGB图像来说，深度的向量维度是3，因为图像有着3个颜色通道，红绿蓝。对于黑白图像来说，就像我们的MNIST数字，深度的向量维度为1（灰阶）。卷积网络操作从它输入的特征地图中提取出多个部分局部，然后在这些局部之上应用相同的转换操作，产生一个输出的特征地图。这个输出的特征地图仍然是一个3D张量：它仍然有着宽度和高度。其深度可以是任意值，因为输出深度是层的一个参数，这里的深度不再代表图像的颜色通道，而是代表我们称之为过滤器的概念。过滤器将输入数据的某些方面进行了编码：如果我们从宏观来理解，一个单一的过滤可能会编码成“输入中存在一张脸孔”这样的概念。

> In our MNIST example, the very first convolution layer takes a feature map of size
(28, 28, 1) and outputs a feature map of size (26, 26, 32) , i.e. it computes 32
"filters" over its input. Each of these 32 output channels contains a 26x26 grid of values,
which is a "response map" of the filter over the input, indicating the response of that filter
pattern at different locations in the input (figure 5.3). That is what the term "feature map"
really means: every dimension in depth axis is a feature (or filter), and the 2D tensor
output[:, :, n] is the 2D spatial "map" of the response of this filter over the input.

在我们MNIST例子中，第一个卷积层接收的输入特征地图的张量形状为(28, 28 ,1)，然后输出了一个特征地图形状为(26, 26, 32)，也就是它在输入数据上计算得到了32个“过滤器”。每个过滤器包含着一个26x26的数值网格，这是过滤器在输入数据上的“响应地图”，指示着这个过滤器模式在输入数据上不同位置的响应值（参见图5-3）。这就是术语“特征地图”的真正含义：每个深度上的维度都是一个特征（或者叫过滤器），输出的2D张量`[:, :, n]`是这个过滤器在输入上面的响应，组成了一个2D的空间地图。

![filters and response map](imgs/f5.3.jpg)

图5-3 响应地图：一个模式在输入地图上不同位置出现的响应值

> Convolutions are defined by two key parameters:

> - The size of the patches that are extracted from the inputs (typically 3x3 or 5x5). In our
example it was always 3x3, which is a very common choice.
- The depth of the output feature map, i.e. the number of filters computed by the
convolution. In our example, we started with a depth of 32 and ended with a depth of 64.

卷积层被如下两个关键参数所定义：

- 从输入中取出的局部窗口的大小（典型的是3x3或者5x5）。在上例中都是3x3，也是通用的选择。
- 输出特征地图的深度，也就是卷积层计算得到的过滤器数量。在上例中一开始是32最后是64。

> In Keras Conv2D layers, these parameters are the first arguments passed to the layer:
Conv2D(output_depth, (window_height, window_width)) .

在Keras的`Conv2D`层中，这些参数是代入构造器的头两个参数：`Conv2D(output_depth, (window_height, window_width))`。

> A convolution works by "sliding" these windows of size 3x3 or 5x5 over the 3D input
feature map, stopping at every possible location, and extracting the 3D patch of
surrounding features (shape (window_height, window_width, input_depth) ). Each
such 3D patch is then transformed (via a tensor product with a same learned weight
matrix, called "convolution kernel") into a 1D vector of shape (output_depth,) . All
these vectors are then spatially reassembled into a 3D output map of shape (height,
width, output_depth) . Every spatial location in the output feature maps corresponds
to the same location in the input feature map (e.g. the bottom right corner of the output
contains information about the bottom right corner of the input). For instance, with 3x3
windows, the vector output[i, j, :] comes from the 3D patch input[i-1:i+1,
j-1:j+1, :] . The full process is detailed in figure 5.4.

卷积层的工作原理就是让这些3x3或者5x5的窗口在3D的输入特征地图上进行“滑动”，在每个可能的位置上都抽取得到一个3D的局部窗口（形状为(窗口高度, 窗口宽度, 输入深度)）。这样抽取出来的3D窗口然后被转换（通过与相同权重矩阵进行张量乘积，权重矩阵被称为“卷积核”）为一个形状为(输出深度, )的1D向量。所有这些向量然后被在空间中重组到一个形状为(高度, 宽度, 输出深度)的3D输出特征地图中。输出特征地图中每个空间位置都代表着输入特征地图中的相应位置（例如，输出的右下角包含着输入的右下角对应的模式信息）。举例来说，使用3x3窗口，输出张量中的`[i, j, :]`向量来自于从输入中抽取出的3D窗口张量`[i-1:i+1, j-1:j+1, :]`。图5-4详细解释了这个过程。

![convulution process](imgs/f5.4.jpg)

图5-4 卷积神经网络的工作原理

> Note that the output width and height may differ from the input width and height.
They may differ for two reasons:

> - Border effects, which can be countered by padding the input feature map.
- The use of "strides", which we will define in a second.

值得注意的是输出的宽度和高度可能与输入的宽度和高度不同。它们不同的原因有两个：

- 边界效应，可以用来应对输入特征地图的填充方式。
- 使用“步长”设置，我们马上会介绍到。

> Let’s took a deeper look at these notions.

我们来深入了解这些术语。

#### 理解边界效应和填充

> Consider a 5x5 feature map (25 tiles in total). There are only 9 different tiles around
which you can center a 3x3 window (see figure 5.5 below), forming a 3x3 grid. Hence
the output feature map will be 3x3: it gets shrunk a little bit, by exactly two tiles
alongside each dimension in this case. You can see this "border effect" in action in our
example above: we start with 28x28 inputs, which become 26x26 after the first
convolution layer.

设想有一个5x5的特征地图（总共25个格子）。这里面只有9个不同的格子能够让一个3x3的窗口居中放置（参见下面的图5-5），组成一个3x3的网格。因此输出的特征地图也会是3x3：它变小了一点点，也就是每个维度减少了两个格子。你在上面的例子也可以看到这种“边界效应”：我们接受的是28x28的输入，经过第一个卷积层后变成了26x26。

![border effect](imgs/f5.5.jpg)

图5-5 5x5输入的特征地图上所有可能的3x3的窗口位置

> If you want to get an output feature map with the same spatial dimensions as the
input, you can use padding . Padding consists in adding an appropriate number of rows
and columns on each side of the input feature map so to as make it possible to fit center
convolution windows around every input tile. For a 3x3 window, one would add one
column on the right, one column on the left, one row at the top, one row at the boom. For
a 5x5 window, it would be two rows (see figure 5.6).

如果你希望获得的输出特征地图与输入有相同的空间维度大小，你可以使用填充。填充就是用合适的数值补充到输入特征地图边缘的每一行和每一列至少，这样可以使得窗口能够在输入特征地图中的每一格都可以居中抽取局部特征。对于一个3x3的窗口来说，那就是需要在左右各加一列，上下各加一行。对于一个5x5的窗口来说，那就是两行（参见图5-6）。

![padding](imgs/f5.6.jpg)

图5-6 在一个5x5的输入上进行填充以满足一个3x3的窗口对局部特征进行抽取

<< [第四章：机器学习基础](Chapter4_Fundamentals_of_machine_learning.ipynb)|| [目录](index.md) || [第六章：文字和序列中的深度学习](Chapter6_Deep_learning_for_text_and_sequences.ipynb) >>