Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

定点量化技术 #78

Merged
merged 2 commits into from
May 30, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 60 additions & 144 deletions performance/quantization.md
Original file line number Diff line number Diff line change
@@ -1,156 +1,96 @@
# Fixed Point Quantization
# 定点量化

Quantization techniques store and calculate numbers in more compact formats.
[TensorFlow Lite](/mobile/tflite/) adds quantization that uses an 8-bit fixed
point representation.
量化技术计算并存储了更加紧凑的数字格式。[TensorFlow Lite](/mobile/tflite/) 增加了使用 8 位的定点量化表示。

Since a challenge for modern neural networks is optimizing for high accuracy, the
priority has been improving accuracy and speed during training. Using floating
point arithmetic is an easy way to preserve accuracy and GPUs are designed to
accelerate these calculations.
由于现代神经网络的挑战之一是进行高精度的优化,其优先级已改善了训练期的精度和速度。使用浮点数运算是保持精度的简单方法之一,同时 GPU 也被设计为能为这些运算进行加速。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『其优先级已改善了训练期的精度和速度』=>『首先要做的是改善训练期的精度和速度』


However, as more machine learning models are deployed to mobile devices,
inference efficiency has become a critical issue. Where the computational demand
for *training* grows with the amount of models trained on different
architectures, the computational demand for *inference* grows in proportion to
the amount of users.
然而,随着越来越多的机器学习模型需要被部署到移动设备上,推理的效率已经成为了一个关键性问题。训练期计算的需求随着不同训练模型的数量的增加而迅速增长,对于推理的计算需求也随着用户数量的增加而成比例的增加。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『训练期计算的需求随着不同训练模型的数量的增加而迅速增长,对于推理的计算需求也随着用户数量的增加而成比例的增加。』=>『对于训练期的计算需求,随着在不同架构上训练的模型的数量增加而迅速增长;对于推理的计算需求,也随着用户数量的增加而成比例的增加。』


## Quantization benefits
## 量化的优势


Using 8-bit calculations help your models run faster and use less power. This is
especially important for mobile devices and embedded applications that can't run
floating point code efficiently, for example, Internet of Things (IoT) and
robotics devices. There are additional opportunities to extend this support to
more backends and research lower precision networks.
使用 8 位定点量化表示的计算可以加速模型的运行速度,同时也能降低功耗。这点对于无法高效运行浮点计算的移动设备和嵌入式应用是非常有用的。比如物联网(IoT)和机器人设备。此外,在后端扩展这种支持、研究更低精度的神经网络还有很多机遇。

### Smaller file sizes {: .hide-from-toc}
### 更小的文件大小 {: .hide-from-toc}

Neural network models require a lot of space on disk. For example, the original
AlexNet requires over 200 MB for the float format—almost all of that for the
model's millions of weights. Because the weights are slightly different
floating point numbers, simple compression formats perform poorly (like zip).
神经网络的模型需要消耗大量的磁盘空间。举个例子,原始的 AlexNet 需要至少 200MB 的空间来存储浮点格式的模型文件——几乎全部用于模型的百万权重。在权重间只有细微差异的表示中,简单的压缩格式效果不佳(如 Zip)。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『原始的 AlexNet 需要至少 200MB 的空间来存储浮点格式的模型文件——几乎全部用于模型的百万权重』=>『原始的 AlexNet 需要至少 200 MB 的空间来存储浮点格式的模型文件 —— 几乎全部用于模型数百万的权重』


Weights fall in large layers of numerical values. For each layer, weights tend to
be normally distributed within a range. Quantization can shrink file sizes by
storing the minimum and maximum weight for each layer, then compress each
weight's float value to an 8-bit integer representing the closest real number in
a linear set of 256 within the range.
权重在所有层中都以数值形式出现。对每一层而言,权重倾向于分布在一定范围内。量化技术则可以通过存储每层中的最大和最小的权重,然后压缩每层权重的浮点值转换为表示最接近真实实数的 8 位整数,从而达到压缩未文件大小的目的。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『从而达到压缩未文件大小的目的。』=>『从而达到压缩文件大小的目的。』

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『a linear set of 256 within the range.』漏译


### Faster inference {: .hide-from-toc}
### 更快的推断 {: .hide-from-toc}

Since calculations are run entirely on 8-bit inputs and outputs, quantization
reduces the computational resources needed for inference calculations. This is
more involved, requiring changes to all floating point calculations, but results
in a large speed-up for inference time.
由于计算完全是在 8 位输入和输出上执行的,量化减少推理计算所需的计算资源。这在训练阶段需要引入更多浮点计算,但在推断期间则会加速很多。

### Memory efficiency {: .hide-from-toc}
### 内存效率 {: .hide-from-toc}

Since fetching 8-bit values only requires 25% of the memory bandwidth of floats,
more efficient caches avoid bottlenecks for RAM access. In many cases, the power
consumption for running a neural network is dominated by memory access. The
savings from using fixed-point 8-bit weights and activations are significant.
对比浮点值而言,获取 8 位值只需要 25% 的内存和带宽,更加有效的避免了 RAM 访问的瓶颈。在很多情况下,神经网络的运行性能取决于内存的访问。使用八位定点值的权重与激活值所带来的提升是显著的。

Typically, SIMD operations are available that run more operations per clock
cycle. In some cases, a DSP chip is available that accelerates 8-bit calculations
resulting in a massive speedup.
通常情况下,SIMD 操作能使在每个时钟周期内运行更多的操作。某些情况下,DSP 芯片还可以加速八位计算,最终获得大规模的加速。

## Fixed point quantization techniques
## 定点量化技术

The goal is to use the same precision for weights and activations during both
training and inference. But an important difference is that training consists of
a forward pass and a backward pass, while inference only uses a forward pass.
When we train the model with quantization in the loop, we ensure that the forward
pass matches precision for both training and inference.
我们的目标要在训练和推断期间内使用相同的精度来使用权重和激活值,但是一个相当重要的区别是在前向和后向传播中,推断值使用了前向过程。所以当我们训练模型期间同时加入量化,就要确保前向过程的训练和推理的精度。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『我们的目标要在训练和推断期间内使用相同的精度来使用权重和激活值,但是一个相当重要的区别是在前向和后向传播中,推断值使用了前向过程。所以当我们训练模型期间同时加入量化,就要确保前向过程的训练和推理的精度。』=>『我们的目标是要在训练和推断期间内,对于权重和激活值使用相同的精度,但是一个相当重要的区别是在前向和后向传播中,推断只使用了前向过程。所以当我们训练模型期间同时加入量化,就要确保前向过程的训练和推理的精度相匹配。』


To minimize the loss in accuracy for fully fixed point models (weights and
activations), train the model with quantization in the loop. This simulates
quantization in the forward pass of a model so weights tend towards values that
perform better during quantized inference. The backward pass uses quantized
weights and activations and models quantization as a straight through estimator.
(See Bengio et al., [2013](https://arxiv.org/abs/1308.3432))
为了尽量减少完全定点模型(权重与激活)的精度损耗,我们在训练中就使用量化。这样就模拟了模型在前向传播中的量化从而权重会倾向于其值在量化推断期间表现更好。后向传播使用量化后的权重、激活值及模型直接给估计器使用(见 [Bengio et al, 2013](https://arxiv.org/abs/1308.3432) )。

Additionally, the minimum and maximum values for activations are determined
during training. This allows a model trained with quantization in the loop to be
converted to a fixed point inference model with little effort, eliminating the
need for a separate calibration step.
此外,还需要在训练期间确定激活值的最小值和最大值。这是为了训练中使用量化时不费吹灰之力的将其转换为定点推断模型。从而消除了一个需要单独校准的步骤。

## Quantization training with TensorFlow
## 使用 TensorFlow 进行量化训练

TensorFlow can train models with quantization in the loop. Because training
requires small gradient adjustments, floating point values are still used. To
keep models as floating point while adding the quantization error in the training
loop, @{$array_ops#Fake_quantization$fake quantization} nodes simulate the
effect of quantization in the forward and backward passes.
TensorFlow 可以在训练模型的同时完成量化。由于训练时需要对梯度进行少量调整,所以仍然使用浮点值。为了保证在增加量化误差时候模型仍然是浮点值,@{$array_ops#Fake_quantization$fake quantization} 节点模拟了前向与后向量化的效果。

Since it's difficult to add these fake quantization operations to all the
required locations in the model, there's a function available that rewrites the
training graph. To create a fake quantized training graph:
由于很难将这些伪量化操作添加到所有模型所需的位置,有一个函数可以帮助我们重写整个训练图。从而创建一个伪量化的训练图,我们有:

```
# Build forward pass of model.
```python
# 构建模型的前向传播
loss = tf.losses.get_total_loss()

# Call the training rewrite which rewrites the graph in-place with
# FakeQuantization nodes and folds batchnorm for training. It is
# often needed to fine tune a floating point model for quantization
# with this training tool. When training from scratch, quant_delay
# can be used to activate quantization after training to converge
# with the float graph, effectively fine-tuning the model.
# 调用训练重写,将模型内部的 FakeQuantization 节点以及 fold batchnorm
# 为训练期进行重写。这通常需要使用训练时的量化工具对浮点模型进行调优。
# 当重头开始训练时,quant_delay 可以用来激活量化后的训练过程收敛到浮点图,
# 从而有效的对模型进行调优
tf.contrib.quantize.create_training_graph(quant_delay=2000000)

# Call backward pass optimizer as usual.
# 如同往常一样调用后向传播
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
optimizer.minimize(loss)
```

The rewritten *eval graph* is non-trivially different from the *training graph*
since the quantization ops affect the batch normalization step. Because of this,
we've added a separate rewrite for the *eval graph*:
重写后的 **eval 图**与**训练图**并非平凡地等价,这是因为量化操作会影响 batchnorm 这一步骤。因此,我们为 **eval 图**增加了单独的重写步骤:

```
# Build eval model
```Python
# 构建 eval
logits = tf.nn.softmax_cross_entropy_with_logits(...)

# Call the eval rewrite which rewrites the graph in-place with
# FakeQuantization nodes and fold batchnorm for eval.
# 调用 eval 重写,在 FakeQuantization 节点重写图并使用 fold batchnorm 技术
tf.contrib.quantize.create_eval_graph()

# Save the checkpoint and eval graph proto to disk for freezing
# and providing to TFLite.
# 将检查点和 eval 图原型保存并冻结到磁盘后提供给 TFLite
with open(eval_graph_file, ‘w’) as f:
f.write(str(g.as_graph_def()))
saver = tf.train.Saver()
saver.save(sess, checkpoint_name)
```

Methods to rewrite the training and eval graphs are an active area of research
and experimentation. Although rewrites and quantized training might not work or
improve performance for all models, we are working to generalize these
techniques.
重写训练和评估图的方法是一个活跃的研究和实验领域。 尽管重写和量化训练可能无法奏效或者不能提高所有模型的性能,但我们正在努力推广这些技术。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

尽管前多了空格


## Generating fully quantized models
## 生成全量化模型

The previously demonstrated after-rewrite eval graph only *simulates*
quantization. To generate real fixed point computations from a trained
quantization model, convert it to a fixed point kernel. Tensorflow Lite supports
this conversion from the graph resulting from `create_eval_graph`.
前面演示的重写后的、重新计算求出参数后的 eval 图仅仅只是**模拟**了量化这一过程。为了从训练的量化模型生成实际的定点运算,还需要将其转换为定点内核。TensorFlow Lite 支持从 `create_eval_graph` 生成图形并进行此转换。

First, create a frozen graph that will be the input for the TensorFlow Lite
toolchain:
首先,通过 TensorFlow Lite 工具链创建一个冻结的图:

```
```Shell
bazel build tensorflow/python/tools:freeze_graph && \
bazel-bin/tensorflow/python/tools/freeze_graph \
--input_graph=eval_graph_def.pb \
--input_checkpoint=checkpoint \
--output_graph=frozen_eval_graph.pb --output_node_names=outputs
```

Provide this to the TensorFlow Lite Optimizing Converter (TOCO) to get a fully
quantized TensorFLow Lite model:
然后将输出结果提供给 TensorFlow Lite 优化转换器(TOCO)以获得全量化的 TensorFLow Lite 模型:

```
```Shell
bazel build tensorflow/contrib/lite/toco:toco && \
./bazel-bin/third_party/tensorflow/contrib/lite/toco/toco \
--input_file=frozen_eval_graph.pb \
Expand All @@ -163,25 +103,19 @@ bazel build tensorflow/contrib/lite/toco:toco && \
--std_value=127.5 --mean_value=127.5
```

See the documentation for @{tf.contrib.quantize} and
[TensorFlow Lite](/mobile/tflite/).
See the documentation for @{tf.contrib.quantize} and [TensorFlow Lite](/mobile/tflite/).

## Quantized accuracy
## 量化模型的精度

Fixed point [MobileNet](https://arxiv.org/abs/1704.0486) models are released with
8-bit weights and activations. Using the rewriters, these models achieve the
Top-1 accuracies listed in Table 1. For comparison, the floating point accuracies
are listed for the same models. The code used to generate these models
[is available](https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md)
along with links to all of the pretrained mobilenet_v1 models.
定点形式的 [MobileNet](https://arxiv.org/abs/1704.0486) 模型由八位权重与激活方式发布。通过使用重写器,模型实现了表 1 中列出的 Top-1 精度。作为比较,这里针对同样的模型列出了浮点精度。用于生成这些模型的代码可以连同所有预训练的 [mobilenet_v1 模型](https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md)一起使用。

<figure>
<table>
<tr>
<th>Image Size</th>
<th>Depth</th>
<th>Top-1 Accuracy:<br>Floating point</th>
<th>Top-1 Accuracy:<br>Fixed point: 8 bit weights and activations</th>
<th>图片大小</th>
<th>深度</th>
<th>Top-1 精度:<br>浮点值</th>
<th>Top-1 精度:<br>定点:8 位权重及激活值</th>
</tr>
<tr><td>128</td><td>0.25</td><td>0.415</td><td>0.399</td></tr>
<tr><td>128</td><td>0.5</td><td>0.563</td><td>0.549</td></tr>
Expand All @@ -201,53 +135,35 @@ along with links to all of the pretrained mobilenet_v1 models.
<tr><td>224</td><td>1</td><td>0.709</td><td>0.697</td></tr>
</table>
<figcaption>
<b>Table 1</b>: MobileNet Top-1 accuracy on Imagenet Validation dataset.
<b> 1</b>MobileNet Imagenet 验证集上的 Top-1 精度
</figcaption>
</figure>

## Representation for quantized tensors
## 量化张量的表示

TensorFlow approaches the conversion of floating-point arrays of numbers into
8-bit representations as a compression problem. Since the weights and activation
tensors in trained neural network models tend to have values that are distributed
across comparatively small ranges (for example, -15 to +15 for weights or -500 to
1000 for image model activations). And since neural nets tend to be robust
handling noise, the error introduced by quantizing to a small set of values
maintains the precision of the overall results within an acceptable threshold. A
chosen representation must perform fast calculations, especially the large matrix
multiplications that comprise the bulk of the computations while running a model.
作为压缩,TensorFlow 会将浮点数数组转换为 8 位定点表示。由于训练好的神经网络中,权重和激活张量的值更倾向于分布在相对范围较小之内(例如,权重为 -15 到 +15 或者 -500 到 +100 用于图像模型的激活函数)。由于神经网络倾向于鲁棒地处理噪声,量化所引入的误差在整体结果的精度总是保持在可接受的阈值之内。选择的表示形式必须具备快速执行计算的能力,尤其是在运行模型时组成大量计算的大型矩阵乘法。

This is represented with two floats that store the overall minimum and maximum
values corresponding to the lowest and highest quantized value. Each entry in the
quantized array represents a float value in that range, distributed linearly
between the minimum and maximum. For example, with a minimum of -10.0 and maximum
of 30.0f, and an 8-bit array, the quantized values represent the following:
这种具有两个浮点数的表示形式存储了整体值中最小和最大值所对应的最低和最高的量化值。每个量化值数组的实例表示了一个浮点数的范围,线性的分布在最小值和最大值之间。举个例子,最小值为 -10.0 和 最大值为 30.0f 及一个八位数组,其量化值的表示如下:

<figure>
<table>
<tr><th>Quantized</th><th>Float</th></tr>
<tr><th>量化的</th><th>浮点的</th></tr>
<tr><td>0</td><td>-10.0</td></tr>
<tr><td>255</td><td>30.0</td></tr>
<tr><td>128</td><td>10.0</td></tr>
</table>
<figcaption>
<b>Table 2</b>: Example quantized value range
<b> 2</b>:量化值的平均的例子
</figcaption>
</figure>

The advantages of this representation format are:
这种表示形式的优势在于:

* It efficiently represents an arbitrary magnitude of ranges.
* The values don't have to be symmetrical.
* The format represents both signed and unsigned values.
* The linear spread makes multiplications straightforward.
* 有效的表示了任意范围的大小
* 值无需对称
* 表示形式同时表达了有符号数和无符号数
* 其线性展开使得乘法相对简单。

Alternative techniques use lower bit depths by non-linearly distributing the
float values across the representation, but currently are more expensive in terms
of computation time. (See Han et al.,
[2016](https://arxiv.org/abs/1510.00149).)
其他备用的技术通过使用整个表示中非线性地分配浮点值作为低位的深度,然而这使得计算时间方面相对更加昂贵(见 Han et al.[2016](https://arxiv.org/abs/1510.00149) )。

The advantage of having a clear definition of the quantized format is that it's
always possible to convert back and forth from fixed-point to floating-point for
operations that aren't quantization-ready, or to inspect the tensors for
debugging.
对量化格式有清晰定义的好处在于,对于未准备好量化的操作,或者为了调试而检查张量,始终可以从定点到浮点之间来回转换。