<< [第三章：进入神经网络](Chapter3-Getting-started-with-neural-networks.ipynb) || [目录](index.md) || [第五章：计算机视觉中的深度学习](Chapter5-Deep-learning-for-computer-vision.ipynb) >>

# 第四章：机器学习基础 

> After three practical examples, you are starting to get familiar with how to approach
classification and regression problems using neural networks, and you have witnessed the
central problem of machine learning: overfitting. This chapter will formalize some of the
intuition you are starting to form into a solid conceptual framework for attacking and
solving deep learning problems.

经过了三个实践性的例子之后，你已经开始熟悉使用神经网络方法来解决分类和回归问题了，并且还亲眼见到了机器学习中的核心问题：过拟合。本章会将你前面获得的感性认知转化为脑中扎实的概念性框架，来挑战和解决深度学习问题。

> In this chapter, you will:

> - Learn about more forms of machine learning, beyond classification and regression.
- Learn about formal evaluation procedures for machine learning models, a simple version
of which you have already seen in action a few times.
- Learn how to prepare data for deep learning, and what is "feature engineering".
- Learn ways to tackle the central problem of machine learning: overfitting, which we
faced in all of our three previous examples.

本章中，我们将：

- 了解机器学习的更多形式，除了分类和回归之外。
- 学习机器学习模型的正规验证方法，这是前面我们已经多次实际遇到过的概念。
- 学习如何在深度学习中准备数据，以及什么叫做“特征工程”。
- 学习解决机器学习核心问题的方法：过拟合，这个问题在前面三个例子中我们都目睹过。

> Finally, we will consolidate all these concepts—model evaluation, data preprocessing
and feature engineering, tackling overfitting—into a detailed 7-step workflow for
tackling any machine learning problem.

最后，我们会将所有这些概念 - 模型验证、数据预处理和特征工程、解决过拟合 - 总结成一个详细的7步工作流，用来解决所有的机器学习问题。

## 4.1 机器学习的四个类别

> Throughout our previous examples, you’ve become familiar with three specific types of
machine learning problems: binary classification, multi-class classification, and scalar
regression. All three are instances of "supervised learning", where the goal is to learn the
relationship between training inputs and training targets.

在前面的例子中，你已经熟悉了三种特定的机器学习问题：二分分类、多类别分类和标量回归。所有这三种问题都是“有监督学习”的例子，其目标是学习到训练输入和训练目标之间的关联关系。

> Supervised learning is just the tip of the iceberg. Machine learning is a vast field with
a complex subfield taxonomy. Machine learning algorithms generally fall into four broad
categories:

有监督学习仅仅是冰山一角。机器学习实际上是一个包含了很多复杂子范围的广泛领域。机器学习算法通常可以分为下面四个类别：

### 4.1.1 有监督学习

> This is by far the most common case. It consists of learning to map input data to known
targets (also called annotations), given a set of examples (often annotated by humans).
All four examples you’ve encountered in this book so far were canonical examples of
supervised learning. Generally, almost all applications of deep learning that are getting
the spotlight these days belong in this category, such as optical character recognition,
speech recognition, image classification or language translation.

这是目前为止最常见的场景。它学习从数据数据到已知目标（也称为标记）之间的映射关系，通过在给定一组样本（通常由人来进行标记）进行。本书中前面介绍的所有四个例子都是有监督学习。更普遍来说，进来深度学习领域受到关注的应用基本上也都是这个类别，例如光学字母辨识、语音识别、图像分类和机器翻译。

### 4.1.2 无监督学习

> This one consists of finding interesting transformations of the input data without the help
of any targets, for the purposes of data visualization, data compression, data denoising...
or simply to better understand the correlations present in the data at hand. Unsupervised
learning is the bread and butter of "data analytics", and is often a necessary step in better
understanding a dataset before attempting to solve a supervised learning problem.
"Dimensionality reduction" and "clustering" are well-known categories of unsupervised
learning.

这个类别包含着从输入数据中找到有趣的转换形式，其中不需要任何目标的帮助，用来进行数据可视化、数据压缩、数据去噪......或者仅仅只是为了更好的理解手头数据的内在关联。无监督学习是“数据分析”的黄油加面包，并且经常是在试图解决有监督学习问题之前，用来更好理解数据集的必须步骤。“降维”和“聚类”是无监督学习中广为人知的类型。

### 4.1.3 自监督学习

> This is actually a specific instance of supervised learning, but it different enough that it
deserves its own category. Self-supervised learning is supervised learning without
human-annotated labels. There are still labels involved (since the learning has to be
supervised by something), but they are generated from the input data itself, typically
using a heuristic algorithm. You can think of it as supervised learning without any
humans in the loop. For instance, "autoencoders" are a well-known instance of
self-supervised learning, where the generated targets are... the input themselves,
unmodified. In the same way, trying to predict the next frame in a video given past
frames, or the next word in a text given previous words, would be another instance of
self-supervised learning (temporally supervised learning, in this case: supervision comes
from future input data). Note that the distinction between supervised, self-supervised and
unsupervised learning can be blurry sometimes—these categories are more of continuum
without solid frontiers. Self-supervised learning can be reinterpreted as either supervised
or unsupervised learning depending on whether you pay attention to the learning
mechanism or to the context of its application.

这实际上是有监督学习中的一个特定种类，但它又很特殊足以自立山头。自监督学习是不需要人工标记的有监督学习。虽然它还是需要标签（因为学习必须被某些指标监督），但是标签能够从输入数据中自动产生，通常使用的是启发性算法来生成。你可以将它想象成在整个循环中不需要人类参与的有监督学习。例如，“自动编码器”就是一种自监督学习，它能自动产生目标...与输入一模一样的目标。同样的，在视频中根据前面帧的内容预测下一帧，或者在文本中根据前面的单词预测下一个单词，也是自监督学习的例子（暂时性有监督学习，这里的监督来自未来的输入数据）。注意有监督学习、自监督学习和无监督学习可能会是很模糊的 - 这些类别更像是连续的而非有着明确边界的。自监督学习可以被重新解读为有监督学习或者无监督学习，取决于你着重研究的是学习的机制还是应用的上下文方面。

### 4.1.4 强化学习

> Long overlooked, this branch of machine learning has recently started getting a lot of
attention, after Google DeepMind successfully applied it to learning to play Atari games
(and later, to learning to play Go at the highest level). In reinforcement learning, an
"agent" receives information about its environment and learns to pick actions that will
maximize some reward. For instance, a neural network that "looks" at a video game
screen and outputs game actions in order to maximize its score can be trained via
reinforcement learning. Currently, reinforcement learning is mostly a research area and
has not yet had significant practical successes beyond games. In time, however, I would
expect to see reinforcement learning take over an increasingly large range of real-world
applications—self-driving, robotics, resource management, education... It is an idea
whose time has come, or will come soon.

这种机器学习类别曾经长期被忽视，最近开始得到了相当多的关注，特别是自从Google DeepMind将其成功的应用到了计算机学习进行Atari游戏（还有就是后来学习成为了超人类的围棋算法）之后。在强化学习中，存在一个“agent”用来从环境中获取信息然后学习选择能够最大化奖赏的行为。例如，一个神经网络可以“看见”视频游戏画面然后输出相应的游戏动作，目标是为了最大化获得高分，这就可以使用强化学习。目前强化学习还大多只是研究领域而并没有在除了游戏之外的实践中获得重要的成功。然而待以时日，作者期待能看到强化学习会占领真实世界应用的广泛领域 - 自动驾驶、机器人、资源管理、教育......当时机成熟时，或终将成熟时。

> In this book, we will focus specifically on supervised learning, since it is by far the
dominant form of deep learning today, with a wide range of industry applications. We
will also take a briefer look at self-supervised learning in later chapters.

本书我们会集中在有监督学习上，因为它是目前机器学习应用中的统治者，在广泛范围的工业应用中得到了使用和验证。我们也会在后面的章节中对自监督学习作简要的介绍。

> Although supervised learning mostly consists of classification and regression, there
are more exotic variants as well:

> - Sequence generation (e.g. given a picture, predict a caption describing it). Sequence
generation can sometimes be reformulated as a series of classification problems (e.g.
repeatedly predicting the word or token in a sequence).
- Syntax tree prediction (e.g. given a sentence, predict its decomposition into a syntax
tree).
- Object detection: given a picture, draw a bounding box around certain objects inside the
picture. This can also be expressed as a classification problem (given many candidate
bounding boxes, classify the contents of each one) or as a joint classification and
regression problem, where the bounding box coordinates are being predicted via vector
regression.
- Image segmentation: given a picture, draw a pixel-level mask on a specific object.
- etc...

虽然有监督学习通常就是分类和回归，但是它也有一些很有趣的变体：

- 序列生成（例如给定照片，预测描述照片的说明）。序列生成有时能被重定义为一系列的分类问题（例如重复的在一个序列中预测单词或符号）。
- 语法树预测（例如给定一个句子，预测它解构成语法树的形式）。
- 目标检测：给定照片，在其中某个特定目标上绘制方框。这也可以被表达成为一个分类问题（给定很多个可选的方框，对每个其中的内容进行分类）或者一个分类和回归问题的联合体，其中方框所在的坐标点是通过矢量回归预测得到的。
- 图像分割：给定照片，在特定目标之上绘制像素级的遮盖。
- 等等......

### 4.1.5 分类和回归问题术语词汇表

> Classification and regression involve many specialized terms. You have already come
across some of them in our first examples, and you will see more of them come up in the
following chapters. They have precise, machine-learning specific definitions, and you
should be familiar with them.

分类和回归包含着很多特定的术语。在我们前面那些例子中，你已经看到了其中的一些，后续章节内容中还会出现更多的相关术语。它们都有着机器学习特定的精确定义，你应该熟悉它们。

> Sample, or input : one data points that goes into your model.

样本，或者叫输入：代入到你模型中的一个数据点。

> Prediction, or output : what goes out of your model.

预测，或者叫输出：从你的模型得到的结果。

> Target: the truth. What your model should ideally have predicted, according to an
external source of data.

目标：真实结果。你的模型理想情况下应该得到的预测，这些目标都是通过外部数据得到的。

> Prediction error, or loss value : a measure of the distance between you model’s
prediction and the target.

预测误差，或者叫损失值：在你的模型预测和目标之间距离的度量。

> Classes: set of possible labels to choose from in a classification problem, e.g. when classifying cat and dog pictures, "dog" and "cat" are the two classes.

类别：在分类问题中为样本选择的一组可能的标签，例如当从猫和狗的照片中分类时，“猫”和“狗”就是两个类别。

> Label: specific instance of a class annotation in a classification problem. For instance,
if picture #1234 is annotated as containing the class "dog", then "dog" is a label of picture
#1234.

标签：在分类问题中对某个特定类别所作的标记。例如照片序号1234被标记为类别“狗”，那么“狗”就是1234号照片的标签。

> Ground-truth, or annotations : all targets for a dataset, typically collected by humans.

基本事实，或者叫标记：一个数据集的所有目标集合，通常是由人工收集整理的。

> Binary classification: classification task where each input sample should be
categorized into two exclusive categories.

二分分类：指的是每个输入样本都应该被划分到两个互斥的类别之中的分类任务。

> Multi-class classification: classification task where each input sample should be
categorized into more than two categories: for instance, classifying handwritten digits is a
multi-class classification task.

多类别分类：指的是每个输入样本都应该被划分到多于两个的类别之中的分类任务，例如前面的手写数字识别就是一个多类别分类任务。

> Multi-label classification: classification task where each input sample can be assigned
multiple labels. For instance, a given image may contain both a cat and a dog, and should
be annotated both with the "cat" label and the "dog" label. The number of labels per
image is usually variable.

多标签分类：指的是每个输入样本都能被指定为多个标签的分类任务。例如，一个图像可能既包括猫也包括狗，因此可以被标记为“猫”标签以及“狗”标签。每张图像上标签的数量通常是可变的。

> Scalar regression: task where the target is a continuous scalar value. House price
prediction is a good example: the different target prices form a continuous space.

标量回归：指的是预测目标是一个连续标量的任务。前面的房价预测就是这样的例子，不同的目标价格构成了一个连续的空间。

> Vector regression: task where the target is a set of continuous values, e.g. a
continuous vector. If you are doing regression against multiple values (e.g. the
coordinates of a bounding box in an image) then your are doing vector regression.

矢量回归：指的是预测目标是一组连续值的任务，也就是一个连续的矢量。如果你针对多个值进行回归任务（例如图像上目标定位框的坐标点），那么你就应该使用矢量回归。

> Mini-batch or batch : a small set of samples that are being processed at once by the
model (typically between 8 and 128 samples). It is often a power of 2 in order to facilitate
memory allocation on GPU. When training, a mini-batch is used to compute a single
gradient descent update applied to the weights of the model.

小批量或批量：每次让模型处理一小部分的样本集（通常介于8和128个样本之间）。通常使用2的幂值作为批量样本数来适应GPU上显存的分布。在训练过程中，每个小批量样本都被用来计算一次梯度下降优化，并更新模型上的权重。

## 4.2 验证机器学习模型

> In the three examples we covered in the previous chapters, we split our data into a
training set, a validation set, and a test set. The reason why we did not evaluate our
models on the same data as they were trained on quickly became evident: after just a few
epochs, all three models started to overfit , which is to say that their performance on
never-seen-before data started stalling (or even worsening) compared to their
performance on the training data—which always go up as training progresses.

上一章的三个例子中，我们将整个数据集划分成了一个训练集、一个验证集和一个测试集。我们不在同一个数据集上对模型进行验证的原因是很明显的：仅仅经过几次迭代之后，前面的三个模型就开始出现过拟合，也就是说它的性能在从未见过的数据上开始停滞（甚至变差）了，而它在训练数据上的性能却是一直增加的。

> In machine learning, our goal is to achieve models that generalize , i.e. that perform
well on never-seen-before data, and overfitting is the central obstacle. We can only
control that which we can observe, so it is crucial to be able to reliably measure the
generalization power of our model. In the next sections, we will take a look at strategies
for mitigating overfitting and maximizing generalization. In the present section, we will
focus on how we can measure generalization, i.e. how to evaluate machine learning
models.

在机器学习中，我们的目标是让模型实现泛化，也就是说能够在从未见过的数据上表现良好，因此过拟合是比较严重的障碍。而我们只能对可观测的实验结果来控制训练过程，因此能够可靠的测量模型的繁华能力是十分重要的。在下面章节中我们会学习到抑制过拟合以及最大化泛化能力的策略。本小节我们重点关注我们如何测量泛化能力，也就是如何验证机器学习模型。

### 4.2.1 训练集、验证集和测试集

> Evaluating a model always boils down to splitting your available data into three sets:
training, validation, and test set. You train on the training data, and evaluate your model
on the validation data. Once your model is ready for prime time, you test it one final time
on the test data.

验证模型归结起来总是需要将可用的数据分成三个部分：训练集、验证集和测试集。在训练集上训练模型，在验证集上验证模型。一旦模型经过上述步骤并优化封装后，就可以最后在测试集上来测试模型的性能了。

> You may ask, why not simply have two sets, a training set and a test set? We would
train on the training data, and evaluate on the test data. Much simpler!

你可能会问，为什么不简单的分成两个集合，一个训练集和一个测试集？然后在训练集上训练在测试集上验证，多简单啊！

> The reason is that developing a model always involves tuning its configuration, e.g.
picking the number of layers or the size of the layers (what is called the
"hyperparameters" of the model, to distinguish them from the "parameters", which are
the network’s weights). You will do this tuning by using as feedback signal the
performance of the model on the validation data, so in essence this tuning is a form of
learning : a search for a good configuration in some parameter space. As a result, tuning
the configuration of the model based on its performance on the validation set can quickly
result in overfitting to the validation set , even though your model is never being directly
trained on it.

不这样做的原因是，当你开发一个模型时，你永远需要在过程中调整它的配置，例如模型的层数或者层的单元尺寸（这些被称为模型的“超参数”，用来与模型自身的权重“参数”作为区别）。你需要通过模型在验证数据上的性能表现来作为调整它们的依据，因此本质上这种调整也是一种形式的学习过程：在一些参数空间中搜索最佳的超参数配置。这个学习的结果就是，基于在验证集数据上的性能来调整模型的配置也会导致在验证集上的过拟合，即使在模型从未直接在它之上训练的情况下。

> Central to this phenomenon is the notion of "information leak". Every time you are
tuning a hyperparameter of your model based on the model’s performance on the
validation set, some information about the validation data is leaking into your model. If
you only do this once, for one parameter, then very few bits of information would be
leaking and your validation set would remain a reliable way to evaluate your model. But
if you repeat this many times, running one experiment, evaluating on the validation set,
modifying your model as a result, then you are leaking an increasingly significant amount
of information about the validation set into your model.

这个现象的重点在于一个被称为“信息泄露”的问题上。每次当你使用模型在验证集上的性能来调整它的超参时，验证数据中的一些信息就被泄露进入到你的模型当中。如果你只在一个参数上进行了一次调整，那么仅有极少量的信息被泄露到模型之中。但是如果你在一个实验中重复这个过程多次，在验证集上验证模型并依此修改模型，那么就会泄露大量的信息进入到你的模型当中。

> At the end of the day, you end up with a model that performs artificially well on the
validation data, because it is what you optimized it for. Since what you care about is
actually performance on completely new data, not the validation data, you need a
completely different, never-seen-before dataset to evaluate your model: the test dataset.
Your model shouldn’t have had access to any information about the test set, even
completely indirectly. If anything about model has been tuned based on test-set
performance, then your measure of generalization will be flawed.

最终，模型在你的干预之下会在验证数据上表现出良好的性能，因为这就是你调整超参的目的。因为你实际上关注的是它在全新的数据上的性能表现，而不是验证集，你需要一个完全不同从未见过的数据集来检验模型：测试数据集。你的模型不应该与测试集的信息有任何的接触，即使是间接的接触也不允许。如果模型依据测试集性能做了任何的调整，你就破坏了整个泛化性测量的基础。

> Splitting your data into a training, validation, and test sets may seem straightforward,
but there are a few advanced ways to do it which can come when very few data
is available. Let’s review three classic evaluation recipes.

将数据分为训练集、验证集和测试集看起来虽然直接，但当手头的数据非常少的时候也有着一些高级的方法。我们来查看一下三种经典的验证配方。

### 4.2.2 简单的留出验证

> Set apart some fraction of your data as your test set. Train on remaining data, evaluate on
the test set. As you saw in the previous sections, in order to prevent information leaks,
you should not tune your model based on the test set, and therefore you should also
reserve a validation set.

从数据中分出一部分作为你的测试集。在剩下的数据上进行训练，然后使用测试集进行验证。正如上一节你看到的，为了避免信息泄露，你不应该基于测试集来调整模型，因此你应该再留出部分的验证集。

> Schematically, hold-out validation looks like this:

留出验证在原理上如下图：

![hold-out validation](imgs/f4.1.jpg)

图4-1 简单留出验证数据分割

> Here’s a simple implementation:

下面是其简单的实现：

译者注：下面是伪代码实现，无法运行。

In [None]:
num_validation_samples = 10000

# 对数据进行重新洗牌通常都是有用的
np.random.shuffle(data)

# 分出验证集
validation_data = data[:num_validation_samples]
data = [num_validation_samples:]

# 分出训练集
training_data = data[:]

# 使用训练集训练模型，然后用验证集进行验证
model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)

# 到这里，你开始根据验证性能调整模型超参
# 然后重新训练，再次调整.....直至满意

# 然后你已经完成了超参数调整，可以将训练集和验证集合起来对模型作最后一次训练
# 完成后在测试数据集上进行最后验证
model = get_model()
model.train(np.concatenate([training_data,
validation_data]))
test_score = model.evaluate(test_data)

> This is the simplest evaluation protocol, and it suffers from one flaw: if little data is
available, then your validation and test sets may contain too few samples to be
statistically representative of the data at hand. This is easy to notice: if different random
shuffling rounds of the data before splitting end up yielding very different model
performance measures, then you are having this issue. K-fold validation and iterated
K-fold validation are two ways to address this.

这种方法是最简单的验证方案，它有一个缺点：如果可用的数据很少，那么你的验证集和测试集可能包括很少的样本，它们将无法正确的表示样本空间。这很容易能被注意到：如果对样本数据进行不同的随机洗牌过程，会导致模型性能结果差异很大，就可以确认这是产生的原因了。K-折验证和迭代K-折验证是解决这个问题的两种方法。

### 4.2.3 K-折验证

> Split your data into K partitions of equal size. For each partition i , train a model on the
remaining N-1 partitions, and evaluate it on partition i . Your final score would then be
the averages of the K scores obtained. This method is helpful when the performance of
your model shows significant variance based on your train-test split. Like hold-out
validation, this method doesn’t exempt you from using a distinct validation set for model
calibration.

将数据均分为K个不同部分。对于每个分区i，使用其他的K-1个分区进行训练，然后在第i个分区上进行验证。最终的验证分数交会时K个验证分数的均值。这个方法在模型使用训练-测试划分情况下显示出较高方差的时候非常有帮助。像留出验证一样，这个方法仍然依赖于留出一个独立的验证集进行模型调整。

> Schematically, K-fold cross-validation looks like this:

K-折交叉验证原理上如下图：

![K-fold validation](imgs/f4.2.jpg)

图5-2 K-折验证

> Here’s a simple implementation:

下面是简单的实现：

译者注：伪代码，无法运行。

In [None]:
k = 4
num_validation_samples = len(data) // k
np.random.shuffle(data)

validation_scores = []
for fold in range(k):
    # 分出第i个验证集
    validation_data = data[num_validation_samples * fold: num_validation_samples * (fold + 1)]
    
    # 剩下的数据部分全部用来作为训练集
    training_data = data[:num_validation_samples * fold] + data[num_validation_samples * (fold + 1):]
    
    # 每个折都创建一个全新的网络，然后训练和验证模型，将验证分数加到分数列表中
    model = get_model()
    model.train(training_data)
    validation_score = model.evaluate(validation_data)
    validation_scores.append(validation_score)

# 求出K折的平均验证分数
validation_score = np.average(validation_scores)
# 最后在除了测试集外的全部数据上重新训练一遍模型，最后得到测试分数
model = get_model()
model.train(data)
test_score = model.evaluate(test_data)

### 4.2.4 带随机洗牌的迭代K-折验证

> This one is for situations in which you have relatively little data available and you need
to evaluate your model as precisely as possible. I have found it to be extremely helpful in
Kaggle competitions. It consists of applying K-fold validation multiple times, shuffling
the data every time before splitting it K-ways. Your final score would be the average of
the scores obtained at each run of K-fold validation. Note that you end up training and
evaluating P * K models (where P is the number of iterations you use), which can very
expensive.

本方法适用于当你有相对很少的数据，而却需要尽可能精确的验证你的模型。作者发现它在Kaggle竞赛中非常有用。它包括将K-折验证应用多次，每次在应用K-折方法之前都将数据进行随机洗牌。最终的验证分数将会是多次K-折验证中的平均分数。注意这个方法中数据会被训练和验证 $P \times K$次（其中P是迭代的次数），这可能需要大量的计算资源。

### 4.2.5 需要记住的内容

> There are a few things to keep an eye out for when picking an evaluation protocol:

> - Data representativeness. You want your training set and test set to be both representative
of the data at hand; for instance if you are trying to classify images of digits, and you are
starting from an array of samples where the samples are ordered by their class, taking the
first 80% of the array as your training set and the remaining 20% as your test would
result in your training set only having classes 0-7 while your test set would only have
classes 8-9. This seems like a ridiculous mistake, but it’s surprisingly common. For this
reason, you should most likely randomly shuffle your data before splitting it into a
training and test set.
 - The arrow of time. If you are trying to predict the future given the past (e.g. the weather
tomorrow, stock movements, and so on), you should not randomly shuffle your data
before splitting it, because that would create a "temporal leak": you model would
effectively be trained on data from the future. In such situations you should always make
sure that all data in your test set is posterior to the data in the training set.
- Redundancy in your data. If some data points in your data appear twice (fairly common
with real-world data), then shuffling the data and splitting it into a training set and a test
set will result in redundancy between the training and test set. In effect, you would be
testing on part of your training data, which is the worst thing you could do! Make sure
that your training sets and tests sets are disjoint.

当选择验证方法时，有如下需要牢记的要点：

- 数据代表性。你必须保证训练集和测试集都能够全面代表手头的样本空间；例如当你进行手写数字识别时，你获得的数据集的样本是按照它们的类别标签排序的，你将前80%的样本取出来作为训练集而剩下20%作为测试集，这会导致训练集中仅包含0-7的类别而测试集中仅包含8-9的类别。这看起来很滑稽，但其实这个错误经常犯。因此，在划分训练集和测试集之前你应该坚持对样本进行随机洗牌。
- 时间箭头。如果你在进行未来预测的任务（如明天的天气，股市变化等），你不应该对数据进行随机洗牌，因为这会造成“时间泄露”：模型会有效地从未来的数据中得到训练。在这种情况下你应该保证测试机中的数据在时间上都处于训练集数据之后。
- 数据冗余性。如果数据中的样本多次出现（真实世界中很常见），那么分训练集和测试集之前进行随机洗牌有可能会将冗余的样本分到训练集和测试集中。结果你会在部分的训练数据上对模型进行验证测试，这是你可以做到的最糟糕的事情了！确保训练集和测试集必须是正交的。

## 4.3 数据预处理、特征工程和特征学习

> Besides model evaluation, an important question we must tackle before we dive deeper
into model development is the following: how to prepare the input data and targets before
feeding them into a neural network? Many data preprocessing and feature engineering
techniques are domain-specific (e.g. specific to text data or image data), and we will
cover those in the next chapters as we encounter them in practical examples. For now, we
will review the basics, common to all data domains.

除了模型验证，在我们深入学习开发模型之前还有一个重要的问题：在输入数据和目标代入神经网络之前，我们应该如何准备它们？很多数据预处理和特征工程的技巧都是针对特定领域的（例如针对文本数据或者图像数据），后续章节中我们会使用实际的例子来介绍它们。目前我们将概述基本知识，对于全部数据领域都是通用的部分。

### 4.3.1 神经网络中的数据预处理

#### 矢量化

> All inputs and targets in a neural network must be tensors of floating point data (or in
specific cases, tensors of integers). Whatever data you need to process—sound, images,
text—you must first turn it into tensors, a step called "data vectorization". For instance, in
our two previous text classification examples, we started from text represented as lists of
integers (standing for sequences of words), and we used "one-hot encoding" to turn them
into a tensor of float32 data. In the digits classification example and house price
prediction example, the data already came in vectorized form, so we could skip this step.

在神经网络中所有的输入和目标都必须是浮点数的张量（或者在特定情况下，整数张量）。无论你是在处理声音、图像还是文本，你都需要将它转换成张量，被称为“数据矢量化”。例如，在前面两个文本分类的例子当中，我们拿到的都是整数列表的输入数据（代表着单词的序列），然后使用“one-hot”编码将它们转换成float32数据的张量。在手写数字分类和房价预测例子中，拿到的数据已经是矢量化的形式了，因此可以跳过这个步骤。

#### 值标准化

> In our digits classification example, we started from image data encoded as integers in
the 0-255 range, encoding grayscale values. Before we fed this data into our network, we
had to cast it to float32 and divide by 255, so we would end up with floating point
values in the 0-1 range. Similarly, in our house price prediction example, we started from
features that took a variety of ranges—some features had small floating point values,
others had fairly large integer values. Before we fed this data into our network, we had to
normalize each feature independently so that each feature would have a standard
deviation of 1 and a mean of 0.

在手写数据分类例子中，我们拿到的数据被编码成0-255区间的整数，代表灰度值。在我们将数据代入网络之前，需要通过将它除以255转换成一个float32类型，得到的是一个0-1区间的浮点数值。类似的，在房价预测例子中，数据中的特征具有不同的区间 - 有些特征是小浮点数值，其他一些却是较大的整数值。在代入网络之前，我们需要对每个特征进行标准化，得到的每个特征都具有0均值和1的标准差。

> In general, it isn’t safe to feed into a neural network data that takes relatively "large"
values (e.g. multi-digit integers, which is much larger than the initial values taken by the
weights of a network), or data that is "heterogeneous", e.g. data where one feature would
be in the 0-1 range and another in the 100-200 range. It can trigger large gradient
updates which will prevent your network from converging. To make learning easier for
your network, your data should:

> - Take "small" values: typically most values should be in the 0-1 range.
- Be homogenous, i.e. all features should take values roughly in the same range.

通常来说，将相对较大的值代入神经网络中是不安全的（例如两位数的值，它们比起网络权重的初始值来说大很多），同样“异质的”的数据也是一样，例如样本中一个特征值处于0-1区间而另一个处于100-200区间。它们会触发巨大的梯度更新，这会影响网络收敛过程。要使得你的网络学习过程更加容易，你的数据应该具备：

- 使用“较小”的数值：通常大部分数据值应该处于0-1区间。
- 同质数据，也就是所有特征应该大致处于相同的取值区间。

> Additionally, the following stricter normalization practice is common and can
definitely help, although it isn’t always necessary (e.g. we did not do this in our digits
classification example) :

> - Normalizing each feature independently to have a mean of 0.
- Normalizing each feature independently to have a standard deviation of 1.

更具体来说，下面这些严格的标准化实践是通用和有用的，虽然并不是一定必须的（例如我们在手写数字分类例子中就没有使用）：

- 对每个特征进行独立的标准化，使得均值为0。
- 对每个特征进行独立的标准化，使得标准偏差为1。

> This is easy to do with Numpy arrays:

使用Numpy数组这很容易实现：

译者注：伪代码，不能运行

In [None]:
# 假设x是一个二维张量，形状为(samples, features)
x -= x.mean(axis=0)
x /= x.std(axis=0)

#### 处理缺失值

> You may sometimes have missing values in your data. For instance, in our house price
prediction example, the first feature (the column of index 0 in the data) was "per capita
crime rate". What if this feature was not available for all samples? We would then have
missing values in our training or test data.

有的时候你的数据中会有一些缺失值。例如在房价预测例子中，第一个特征（也就是数据的第0列）的“人口犯罪率”。如果这个特征并不是所有样本都具有的话会怎样？这时就会出现训练数据或测试数据中的缺失值。

> In general, with neural networks, it is safe to input missing values as 0 , under the
condition that 0 is not already a meaningful value. The network will learn from exposure
to the data that the value 0 simply means "missing data" and will start ignoring the value.
However, note that if you are expecting missing values in the test data but the network
was trained on data without any missing values, then the network will not have learned to
ignore missing values! In this situation, then you should artificially generate training
samples with missing entries: simply copy some training samples several times and drop
some of the features that you expect are susceptible to go missing in the test data.

通常来说在神经网络中，在数值0没有被用来表示有意义的值的情况下，将缺失值当成0是安全的。网络会从暴露给它的数据中自动学习到0就代表着“缺失值”然后开始忽略它。然而请注意，如果遇到测试集中有缺失值而训练集中没有缺失值的情况，网络将无法从数据中学习到忽略缺失值！在这样的情况下，你应该人工在训练集中生成具有缺失值的样本：只需要将一些训练样本复制并且丢弃其中一些特征值放入训练集中，这样就能令网络能够感知到缺失值的存在。

### 4.3.2 特征工程

> Feature engineering is the process of using your own knowledge about the data and about
the machine learning algorithm at hand (in our case a neural network) to make the
algorithm work better by applying hard-coded (non-learned) transformations to the data
before it goes into the model. In many cases, it isn’t reasonable to expect a machine
learning model to be able to learn from completely arbitrary data. The data needs to be
presented to the model in a way that will make the job of the model easier. One intuitive
example of this is the following: suppose that we are trying to develop a model that can
take as input an image of a clock, and can output the time of the day.

特征工程就是运用你手头对数据和机器学习算法（在本书中就是神经网络）的认识，通过硬编码的方式（而非学习）将数据在代入模型之前进行转换，达到使算法更好的工作的目的。在很多情况下，我们都有理由相信机器学习模型能够从完全任意的数据中进行学习。对数据进行相应的处理知识为了使得模型完成任务的工作更加容易。下面用一个直观的例子来进行说明：假如说我们希望开发一个模型能够接受一个钟表的图像作为输出，然后输出它指示的时间。

![feature engineering](imgs/f4.3.jpg)

图4-3 阅读时钟图像任务中的特征工程

译者注：上图中第一行是原始数据，图像的像素点数据；第二行是处理后的数据，时钟指针的坐标值；第三行是更好的方式处理后的数据，时钟指针的角度值。

> If you choose to use the raw pixels of the image as input data, then you have on your
hands a difficult machine learning problem. You will need a convolutional neural
network to solve it, and you will have to expend quite a bit of computational resources to
train it.

如果选择使用输入图像中的原始像素点数据，那么你就会面对一个更加困难的机器学习问题。你会需要用到卷积神经网络才能解决它，并且你不得不话费更多的计算资源来进行训练。

> However, if you already understand the problem at a high-level (you understand how
humans read time on a clock face), then you can come up with much better input features
for a ML algorithm: for instance, it is easy to write a 5-line Python script to follow the
black pixels of the clock hands and output the coordinates of the tip of each (x, y)
hand. Then a very simple ML algorithm can learn to associate these coordinates with the
appropriate time of the day.

但是如果你已经在高层次理解了问题（你理解人类是如何阅读时钟指示的时间的），那么你就可以将输入数据处理成更适合机器学习算法的特征值形式：例如很容易写出仅仅5行的Python代码沿着时钟指针黑色的像素并输出指针针尖位置的$(x, y)$坐标。然后使用一个很简单的机器学习算法就能够从这些坐标的组合中学习到时间的表示。

> You can go even further: you can do a coordinate change, and express the (x, y)
coordinates as polar coordinates with regard to the center of the image. Your input would
simply become… the angle of each clock hand. At this point your features are theta
making the problem so easy that no machine learning is required anymore; a simple
rounding operation and dictionary lookup are enough to recover the approximate time of
day.

还能更进一步：你可以进行坐标转换，将$(x, y)$的坐标表示转换成极坐标方式。这时你的输入就会简化成每根时针的角度。在这种情况下特征值就只剩下$\theta$角度，使得这个问题已经简化到根本不需要机器学习了；一个简单的四舍五入计算和查表就能还原时间表示。

> That’s the essence of feature engineering: making a problem easier by expressing it in
a simpler way. It usually requires understanding the problem in-depth.

上面说的就是特征工程的实质：使用一种简单方式表示数据令我们面对的问题变得简单。它通常需要深度理解问题本身。

> Before deep learning, feature engineering used to be critical, because classical
"shallow" algorithms did not have hypothesis spaces rich enough to learn useful features
by themselves. The way you would present the data to the algorithm would be essential
to its success. For instance, before convolutional neural networks started becoming
successful on the MNIST digits classification problem, solutions were typically based on
hard-coded features such as the number of loops in a digit image, the height of each digit
in an image, an histogram of pixel values, and so on.

在深度学习之前，特征工程曾经非常重要，因为经典的浅学习算法都不具备足够大的假设空间来自动学习到有用的特征。所以输入到算法的数据形式对于模型的成功非常重要。例如，在卷积神经网络在MNIST手写数字分类问题取得成功之前，应用的机器学习算法通常都基于硬编码的特征提取，诸如数字图像中圈的个数，一张图像中每个数字的高度，像素值的直方图等等。

> Thankfully, modern deep learning removes the need for most feature engineering,
since neural networks are capable of automatically extracting useful features from raw
data. Does this mean you don’t have to care about feature engineering at all as long as
you are using deep neural networks? No, for two reasons:

> - Good features can still allow you to solve problems more elegantly while using less
resources. For instance, it would be ridiculous to solve our clock face reading problem
using a convolutional neural network.
- Good features can allow you to solve a problem with much less data. The ability of deep
learning models to learn features on their own relies on having lots of training data
available; if only few samples are available, then the informativeness of their features
becomes critical.

幸运的是，现代深度学习已经不再需要大部分的特征工程，因为神经网络能够自动从原始数据中提取出有用的特征。这是否意味着当你使用深度神经网络时完全不再需要关心特征工程了呢？答案是否定的，两个原因：

- 优质的特征仍然有助于你在使用更少资源的情况下更加优雅的解决问题。例如前面的时钟问题，如果我们选择使用卷积神经网络的话就会显得十分荒唐。
- 优质的特征允许你使用更少的数据来解决问题。深度学习模型能够自动学习到有用特征的能力取决于它有很多的训练数据；如果手头只有少量的样本，那么它们特征的表示能力就变得特别重要。

## 4.4 过拟合和欠拟合

> In all the examples we saw in the previous chapter—movie review sentiment prediction,
topic classification, and house price regression—we could notice that the performance of
our model on the held-out validation data would always peak after a few epochs and
would then start degrading, i.e. our model would quickly start to overfit to the training
data. Overfitting happens in every single machine learning problem. Learning how to
deal with overfitting is essential to mastering machine learning.

在前面几章我们看到的例子中 - 影评观点预测，新闻主题分类和房价预测回归 - 我们观察到模型在验证集上的性能在数个迭代之后就会达到顶峰，然后开始下降，也就是说我们的模型很快开始对训练数据产生了过拟合。过拟合问题存在于每个机器学习问题中。学习如何应对过拟合对于掌握机器学习非常关键。

> The fundamental issue in machine learning is the tension between optimization and
generalization. "Optimization" refers to the process of adjusting a model to get the best
performance possible on the training data (the "learning" in "machine learning"), while
"generalization" refers to how well the trained model would perform on data it has never
seen before. The goal of the game is to get good generalization, of course, but you do not
control generalization; you can only adjust the model based on its training data.

机器学习中的基本问题是优化和泛化之间的博弈。“优化”指的是在训练过程中尽可能提高模型在训练数据上的性能的过程（也就是“机器学习”中的“学习”），而“泛化”指的是已经训练好的模型在它从未见过的数据上的性能表现。这个过程的目标当然是获得足够好的泛化能力，但是你无法直接控制泛化；你只能依据训练数据调整模型。

> At the beginning of training, optimization and generalization are correlated: the lower
your loss on training data, the lower your loss on test data. While this is happening, your
model is said to be : there is still progress to be made; the network hasn’t yet under-fit
modeled all relevant patterns in the training data. But after a certain number of iterations
on the training data, generalization stops improving, validation metrics stall then start
degrading: the model is then starting to over-fit, i.e. is it starting to learn patterns that are
specific to the training data but that are misleading or irrelevant when it comes to new
data.

在训练的开始阶段，优化和泛化是相关的：在训练集上更低的损失也代表着在测试集上更低的损失。出现这种情况时，你的模型可以说：还有很多学习过程需要进行；网络还未学习到所有训练数据上的有关模式，也叫作欠拟合。但是再经过在训练集上一定数量的迭代学习之后，泛化能力停止改善，验证指标停止上升并开始下降：模型开始过拟合，也就是它开始学习那些训练数据上特定的模式，但这些模式在处理新数据是误导或无关的。

> To prevent a model from learning misleading or irrelevant patterns found in the
training data, the best solution is of course to get more training data . A model trained on
more data will naturally generalize better. When that is no longer possible, the next best
solution is to modulate the quantity of information that your model is allowed to store, or
to add constraints on what information it is allowed to store. If a network can only afford
to memorize a small number of patterns, the optimization process will force it to focus on
the most prominent patterns, which have a better chance of generalizing well.

要阻止模型学习训练数据中那些误导性或无关的模式，最好的解决方法当然是获得更多的训练数据。一个在更多数据上训练得到模型自然会泛化的更好。当这个办法不可能的情况下，另一个最好的办法是条件模型能够存储信息的数量，或者为它存储的信息做出一定约束。如果一个网络只能记住少量的模式，优化过程会强制它更加关注那些更加重要的模式，从而达到更好泛化的目标。

> The processing of fighting overfitting in this way is called regularization . Let’s
review some of the most common regularization techniques, and let’s apply them in
practice to improve our movie classification model from the previous chapter.

这种对抗过拟合的过程被称为正则化。下面我们来介绍一些最通用的正则化技巧，然后将它们应用到上一章的影评分类例子中来改善模型的性能。

<< [第三章：进入神经网络](Chapter3-Getting-started-with-neural-networks.ipynb) || [目录](index.md) || [第五章：计算机视觉中的深度学习](Chapter5-Deep-learning-for-computer-vision.ipynb) >>