Quantization in Glow
Quantization is the process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set (such as the integers). In this context, Quantization is the process of converting the inference phase of the neural network execution from floating point arithmetic to integer arithmetic. Arithmetic using small integers is more efficient than the computation of full-width floating-point numbers, and additionally decreases memory usage.
This is an external link that explains how post-training quantization is done in TensorFlow Lite.
Glow is able to convert floating-point-based networks into signed 8-bit integer networks. The canonical quantization representation is using signed integers, though it is possible to support other quantization formats. Glow uses profile-guided quantization, observing execution during inference to estimate the possible numeric range for each stage of the neural network. Training-based quantization is considered future work.
In Glow, tensors are typed and can represent floats, quantized non-floating-point values such as currently supported Int8 (8-bit signed integers), and index types. A quantized tensor's type is made up of the underlying element type (Int8), as well as the possible range of the values in the tensor using 'scale' and 'offset' fields. To convert from the 8-bit integer range of [-128..127] to the floating-point number that they represent, Glow uses the following conversion formula:
value = (input - offset) * scale
Activations, weights, and variables all use the same type-system and represent information in a uniform way.
Different parts of the network contain floating-point values in different ranges. In some parts, the typical range of the numbers is between zero and one, while in other parts of the network the possible range is in the hundreds. Choosing a single conversion scale for the whole network would not work, because a single scale value could be imprecise for small values and truncate large values.
We use profile-guided information to estimate the possible numeric range for each stage of the neural network. Our quantization conversion works using a two-phase process. First, we statically instrument the network with special profiling nodes that record the ranges of activations that flow in the network, optimize the network including these profiling nodes, and then run inference. Then, we recompile the network using this profile information to convert the network into a quantized form, allowing for static optimization of the quantized graph. We convert portions of the network into islands of integer computation and aim to generate outputs in the range that the original floating-point network produces. During the conversion, for the following types of quantized nodes, we ignore the output's quantization params (if they are provided) and force the output have the same quantization params as the input for performance purpose:
LocalResponseNormalizationNode SliceNode ReshapeNode TopKNode GatherNode MaxPoolNode
The figure below shows a quantized subgraph from Resnet50.
How to perform NN conversion
The Glow loader tool provides options to execute both profiling and conversion of a NN graph.
dump-profile=profile.yaml option is used to dump per node's output profile data
This information can be used in the process of quantized conversion.
For example, you can run the following command to capture profile for Resnet50.
./bin/image-classifier tests/images/imagenet/*.png -image-mode=0to1 -m=resnet50 -model-input-name=gpu_0/data -dump-profile="profile.yaml"
By default, everything will be lowered for profiling. This allows for the
lowered components of nodes to be profiled, allowing for good precision of
complex nodes. For example, the
SigmoidCrossEntropyWithLogitsNode has many
internal nodes that it is lowered to. Without profiling the internal nodes,
there would be no information on how best to quantize its internal nodes that it
is lowered to.
Lowering all nodes may cause performance issues for some models, e.g. if a model
has group Convolutions which explode the size of the graph when lowered, leading
to long compilation and run time during profiling. Thus, we allow for disabling
certain NodeKinds for profiling. This means that during quantization, these
nodes should also not be lowered by the backend. This can be done using the
command line option
-do-not-lower-nodes-for-profiling (note: multiple Nodes
can be listed via comma separation). For example:
./bin/image-classifier tests/images/imagenet/*.png -image-mode=0to1 -m=shufflenet -model-input-name=gpu_0/data -dump-profile="shufflenet.yaml" -do-not-lower-nodes-for-profiling=Convolution
The loader supports the following modes (or schemas) of quantization:
asymmetric- maps the floating data to quantized ranges not necessarily centered on 0. This is the default quantization schema.
symmetric- maps the floating data to ranges centered on 0. In practice, this means the symmetric schema may extend the range it needs to capture to make sure 0.0 is at the center of that range. Therefore, this schema potentially wastes some encoding space to enforce the symmetric property, but it comes with the property that the offset is always equal to zero.
symmetric with uint8- produces ranges where the offset is always equal to zero but allows the quantized ranges to be either int8 [-128; 127] or uint8 [0; 255]. In practice, this schema represents uint8 ranges using int8 ranges with an offset of -128. Therefore, when using this schema, the produced profile will have two kinds of ranges: one with an offset of 0 and the other with an offset of -128.
symmetric with power of 2 scale- produces quantized ranges centered on 0 (symmetric) but also restricts the scale parameter to be a power of 2. Restricting the scale parameter to be a power of 2 might result in a poor exploitation of the quantized range (poor accuracy) but has the potential to provide a better performance.
quantization-schema=<schema> to specify the schema for the quantization
process, where schema will have one of the values:
load-profile=profile.yaml option is used to quantize graph based on the
captured profile in
profile.yaml file. Important note, graph structure
should not be changed between a step of capturing profile and a step of quantizing
For example, you can run the following command to load the profile and quantize
./bin/image-classifier tests/images/imagenet/*.png -image-mode=0to1 -m=resnet50 -model-input-name=gpu_0/data -load-profile="profile.yaml"
By default, all nodes that can be quantized will be quantized. However, we may
only want to quantize some parts of a model, e.g. if accuracy loss is too high
when all node kinds are quantized. The Glow loader currently allows for
disabling quantization of all nodes of a specific kind which are found in the
graph. For example, if the loaded model sees high accuracy loss when
element-wise Add is quantized, it can be left in floating point. This can be
done by passing on the command line the node name via the option
-keep-original-precision-for-nodes. Multiple node kinds can be specified to
not be quantized. For example, to not quantize any Add or Div nodes when running
the quantized text translator:
./bin/text-translator -m en2gr -load-profile=en2gr.yaml -keep-original-precision-for-nodes=Add,Div
By default, target quantization precision is int8. However, precision can be
controlled via command line parameter:
quantization-precision. There are
two supported values:
Caffe2 Quantized Model Support
Glow is able to support Caffe2 Resnet50 quantized model: https://github.com/caffe2/models/tree/master/resnet50_quantized
To support Caffe2 quantized models, Glow has:
- Supported additional quantized Caffe2 operators.
Int8Quantize Int8Dequantize Int8Conv Int8ConvRelu Int8MaxPool Int8AveragePool Int8FC Int8SumRelu Int8GivenIntTensorFill Int8GivenTensorFill
- Supported int32 quantized bias.
In most of the cases, bias is quantized in int32 to improve precision
(the partial sum of the matrix-matrix multiplication is accumulated into int32,
so int32 bias can be added to the int32 partial sum for better accuracy).
Glow now supports int32 quantized bias in
- Supported the conversion from uint8 quantized activations to int8 quantized activations.
For the quantized Caffe2 ops, the activations are quantized to uint8. In Glow, the activations are quantized to int_8. Therefore, for the offset read from quantized Caffe2 model, we need to subtract 128(i.e. INT8_MIN) to make the activations become int8.
Glow features a number of compiler optimizations that transform the compute graph and make it more efficient. There are a few classes of optimizations and parameters to optimize.
First, we attempt to minimize the number of conversions between floating-point tensors and integer tensors, in both directions. Some operations, such as 'transpose' and 'concat' operate on both types, and changing the representation can minimize conversions.
Second, the neural network contains 'rescale' nodes that change the range of the integers. These nodes are required to convert between numeric ranges that mimic the original floating-point network. However, in many cases, it is possible to fold the rescale operations into numeric-producing operations, and eliminate them.
Third, it's possible to rescale the values in the network in order to allow fast hardware implementations of the quantized operations. For example, consider the 'max' operations. By converting both sides of the 'max' into the same scale we allow the hardware to perform a simple comparison. By normalizing both sides of the 'max' operation to the same scale we enable this efficient optimization.
For more specific graph optimizations check here.
Row-wise (or channel-wise) quantization is an important way to minimize accuracy drop. Glow supports row-wise quantized FullyConnected and SparseLengthsWeightedSum nodes; They are enabled by the model loader option "-enable-rowwise".
For the regular quantized FC, we quantize the whole weights tensor with the same
scale and offset, which are computed based on the max and min of the entire
tensor. But for row-wise, after getting
max_i for each
i, we compute the pair of
(scale_i, offset_i) to quantize each
element in row
i. The figure below shows the quantized FC node and
RowwiseQuantizedFullyConnected node. Instead of using only one tensor to
represent the quantized weights, we need 2 extra vectors
Offsets to store the
(scale, offset) for each row.
Row-wise quantized SparseLengthsWeightedSum is also supported. Similar to the
above, we compute scales and offsets per row, to be used with the
RowwiseQuantizedSparseLengthsSumNode. Scales and Offsets are inputs to
the node. Output of this node is float, matching the Caffe2 implementation.
Fused Row-wise Quantization
For some backends it may be beneficial to keep each row's scales and offsets
fused inline with the data. Caffe2 implements nodes with fused storage, such as
supports such fused Nodes/Instructions, for example
ElemKind of fused tensors
UInt8FusedFP16QTy. Tensors with these
are 2-dimensional, and have extra columns for each row to store scales and
offsets for that row.
UInt8FusedQTy stores scales and offsets as float (so
there are 8 extra columns), while
UInt8FusedFP16QTy stores them as float16_t
(so there are 4 extra columns). Note that similar to normal row-wise quantized
tensors, they use a dummy scale and offset in the Type.
Conversion formula when using row-wise quantization
Some row-wise quantized operators prefer to use float offsets instead of int32. For these operators, they use the following conversion formula:
value = (scale * input) + offset
UInt8FusedQTy always use float offsets and this alternate
conversion formula. Nodes that use float offsets and this alternate conversion