# Quantized Neural Network 
## 03- Quantizing Tensorflow Graph

by [Soon Yau Cheong](http://www.linkedin.com/in/soonyau)

In last tutorial, we learned about quantization and dequantization of tensor, and also introduced TF function tf.fake_quant_with_min_max_args to do that. Today, we will first look at how to quantize a pre-trained Tensorflow graph, followed by creating a quantization-aware graph for training.

In [1]:
import os
import sys
import numpy as np
import tensorflow as tf
import time
import matplotlib.pyplot as plt

import utils

print("Tensorflow", tf.__version__)
print("Python", sys.version)

Tensorflow 1.10.0
Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609]


## Post Training Quantization

We can take a pre-trained Tensorflow graph and convert it into quantized TFlite graph. To do that, we will only need two things: the range (min and max values) of weights and activation. The former is easy, since the weights don't change after training, we can therefore work out the range from frozen graph. To recall basic of neural network, activation is the output of a layer and the range depend on both the weights and the inputs. Thus, we can't get the range of activation directly from frozen graph. Unless, the activation range is fixed by design, e.g. the non-linearity tf.nn.relu6 cap the value between 0.0 and 6.0. This is also the reason why Google uses relu6 instead of relu in quantized Mobilenet as the latter has no upperbound. However, even if the range i sknown, it is still less than ideal we may lose some granularity. To give a concrete example, for range of (0.0, 6.0), the granularity of 8 bit quantization is 6.0/255 = 0.023529412, meaning the number is discretized into multiple of 0.023529412. If the actual range with real data is within 0.0 and 1.0, then the granularity improved six fold to 1.0/255 = 0.003921569. 

You can do post training quantization using either Python APIs or command line. You can use different graph format e.g. frozen graph, saved model, from session etc which I find it quite confusing. I'll go through the fundamental using a mixture of format (Python/command line/graph format) and you can refer to the many online examples provided by Tensorflow best suited for your project. Instead of treating them like black box, we'll go through the examples from bottom-up, starting by quantizing a convolution layer using Toco converter (just another fancy acronym came up by Google engineers).

In [2]:
tf.reset_default_graph()

# Create a simple network
def simple_network(input):

    x = tf.layers.conv2d(input, filters=32, kernel_size=3)
    x = tf.nn.relu(x)
    return x

input_dim = [1, 224, 224, 3]
input = tf.placeholder(tf.float32, input_dim)
output = simple_network(input)


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    # pass in the graph's input and output
    converter = tf.contrib.lite.TocoConverter.from_session(sess, [input], [output])
    # set inference type to uint8
    converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
    # set the activation range. Try comment out this line and the conversion will fail
    converter.default_ranges_stats = (0., 6.)
    
    input_mean = 128
    input_stddev = 128
    input_arrays = converter.get_input_arrays()
    # the input mean and standard deviation is needed to work out the scale and offset
    # to de-quantize the input, since we assume input is quantized. Therefore, we can use
    # image's raw RGB uint8 as input directly.
    converter.quantized_input_stats = {input_arrays[0] : (input_mean, input_stddev)}  # mean, std_dev
    
    # now convert
    tflite_model = converter.convert()
    
    # now you can save the quantized model
    save_path = "models/practice"
    if not os.path.exists(save_path):
        os.makedirs(save_path)
        
    open(os.path.join(save_path, "simple_model.tflite"), "wb").write(tflite_model)
    
    # you can start using it right now
    # load it into interpreter and you can use it like in Tutorial 1
    interpreter = tf.contrib.lite.Interpreter(model_content=tflite_model)

INFO:tensorflow:Froze 2 variables.
INFO:tensorflow:Converted 2 variables to const ops.


## Quantization Aware Training

Converting a "normal" frozen graph will normally result in loss in accuracy for two reasons:
1. Un-optimized activation range
2. Quantization errors

Among the two, the quantizaztion error is the greater contributor to accuracy loss. During training, the neural network use full precision, say for example, a value of 1.2345, but during inference, it can see the dequantize value of say 1.1 which is different from what it sees in training. This error will propagate and accumulate as it traverse across the layers which can result in quite some difference at the output.

To tackle these 2 problems, we'll need to do quantization aware training whicih insert additionial operations to :
1. simulate the quantization effect, and
2. measure the activation range

Tensorflow has a built-in function to add those operation to your graph. You can look at the website [here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/README.md)
```
tf.contrib.quantize.create_training_graph(input_graph=g,
                                          quant_delay=2000000)
```
That's it, just one instruction to prep the graph to be ready for training. The quant_delay (in number of steps) is to delay the fake quantization till the network activation range is more stable. 

Hooray, jobs done! Hm... yes but if you are curious about what's happening in the black box, then let's dive in for the details. We will build on the simple_network we've just created, and add a few lines to show what actually happens in the transformed graph.


## Digging Into Details

![sim](images/sim_quant_graph)
The graph above is taken from Google's paper ["Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference"](https://arxiv.org/abs/1712.05877). It looks straight forward, we only need to implement two fake quantizations, one for weights and one for activation.

In simple_network, we use high level function tf.layers to define the convolution layer which takes care of creation and initization of weights and bias. Now since we want to do the quantization on weights and activations ourselves, we'll use lower level APIs tf.get_variable to create the weights and tf.nn to perform the convolution. 

We'll use tf.fake_quant_with_min_max_vars which is almost identical with tf.fake_quant_with_min_max_args introduced earlier. It is not clear to me what their difference except the input min and max are made mandatory in tf.fake_quant_with_min_max_vars. From the converted graph and source code, I can see only tf.fake_quant_with_min_max_vars therefore we'll use that from now on.

In [3]:
tf.reset_default_graph()

def less_simple_network(input):
    '''
        Create weights
    '''
    # [filter_height, filter_width, in_channels, out_channels]
    w_dim = [3, 3, 3, 32] 
    
    w = tf.get_variable("weight", 
                        shape=w_dim,
                        initializer=tf.contrib.layers.xavier_initializer())
        
    '''
        Fake quantizer weights
    '''
    w_min = tf.reduce_min(w)
    w_max = tf.reduce_max(w)
    
    w_fake_quant = tf.fake_quant_with_min_max_vars(w, 
                    min=w_min, 
                    max=w_max, 
                    narrow_range=True, # will be explained below
                    name="quant_weights")

    '''
        Create bias but don't quantize it for reasons to be explained later
    '''
    bias = tf.get_variable("bias", 
                        shape=[32],
                        initializer=tf.initializers.zeros)
    '''
        Perform convolution and relu
    '''
    strides = 1
    out = tf.nn.conv2d(input, w_fake_quant, [1, strides, strides, 1], padding='SAME')
    out = tf.nn.bias_add(out, bias, name='bias')    
    out = tf.nn.relu6(out)
    
    '''
        Fake quantize activation
    '''
    out_fake_quant = tf.fake_quant_with_min_max_vars(out, 
                    min=0.0, 
                    max=6.0, 
                    narrow_range=False,
                    name="act_weights")
    
    return out_fake_quant

input_dim = [1, 224, 224, 3]
input2 = tf.placeholder(tf.float32, input_dim)
output2 = less_simple_network(input2)


Alright, I admit now the code looks much longer because of the lower level APIs but the idea is not that complicated as shown in the diagram above, just add two quantizaton nodes. Now, if we comment out the converter option that set the default_ranges_stats, it will no longer crash because the graph already contains all the range information.

In [4]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    # pass in the graph's input and output
    converter = tf.contrib.lite.TocoConverter.from_session(sess, [input2], [output2])
    # set inference type to uint8
    converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
    # Conversion won't crash now because range is set inside fake_quant code
    #converter.default_ranges_stats = (0., 6.)
    
    input_mean = 128
    input_stddev = 128
    input_arrays = converter.get_input_arrays()
    # the input mean and standard deviation is needed to work out the scale and offset
    # to de-quantize the input, since we assume input is quantized. Therefore, we can use
    # image's raw RGB uint8 as input directly.
    converter.quantized_input_stats = {input_arrays[0] : (input_mean, input_stddev)}  # mean, std_dev
    
    # now convert
    tflite_model = converter.convert()
    
    # now you can save the quantized model
    save_path = "models/practice"
    if not os.path.exists(save_path):
        os.makedirs(save_path)
        
    open(os.path.join(save_path, "less_simple_model.tflite"), "wb").write(tflite_model)
    
    # you can start using it right now
    # load it into interpreter and you can use it like in Tutorial 1
    interpreter = tf.contrib.lite.Interpreter(model_content=tflite_model)

INFO:tensorflow:Froze 2 variables.
INFO:tensorflow:Converted 2 variables to const ops.


### Quantization Range
#### Weights
As explained earlier, the range of weight's real number is simply the min and max values. However, on the quantize range, there is an option in the function known as narrow_range. We mentioned in earlier tutorials, the range is $2^N-1$ which is 255 for 8 bits so the quantized signed a bit range from -128 to 127. When narrow_range is set, then 254 is used instead hence -127 to 127. I initially thought it was because of making sure value 0 is in the center with equal number of positive and negative values. It turns out that Google want to make use of a ARM processor instruction that require smaller number range to prevent overflow in calculating matrix multiplication. So, that's it. On the other hand, activation uses the full 'wide range'.

#### Activation
In the example above, I hard code min and max to be 0.0 and 6.0. However, those numbers should be dynamic determined by statistics during training. In practice, an exponential moving average values are used to prevent sudden change from batch to batch.

## Command Lines & Graph Visualization

Before the end of the tutorial, I thought I should introduce the command line to do quantization too. It has a powerful feature that the Python APIs don't have - creating graph visualization.

In [2]:
# I run this on Linux, it if doesn't work on your platform, 
# skip this step and look at directory using your usual ways
! ls models/mobilenet_v1/

converted_model.tflite
graphviz
mobilenet_v1_1.0_224_quant.ckpt.data-00000-of-00001
mobilenet_v1_1.0_224_quant.ckpt.index
mobilenet_v1_1.0_224_quant.ckpt.meta
mobilenet_v1_1.0_224_quant_eval.pbtxt
mobilenet_v1_1.0_224_quant_frozen.pb
mobilenet_v1_1.0_224_quant_info.txt
mobilenet_v1_1.0_224_quant.tflite
mobilenet_v1_1.0_224_quant.tgz


In [3]:
# Look at the info file to find out about the input and output nodes
! cat models/mobilenet_v1/mobilenet_v1_1.0_224_quant_info.txt

Model: mobilenet_v1_1.0_224_quant
Input: input
Output: MobilenetV1/Predictions/Reshape_1


In [5]:
# convert the graph
# if you get CUDA memory error, restart the jupyter kernel and skip the Python experiments above
! mkdir models/mobilenet_v1/graphviz
! tflite_convert \
--graph_def_file=models/mobilenet_v1/mobilenet_v1_1.0_224_quant_frozen.pb \
--output_file=models/mobilenet_v1/converted_model.tflite \
--input_arrays=input \
--output_arrays=MobilenetV1/Predictions/Reshape_1 \
--dump_graphviz_dir=models/mobilenet_v1/graphviz 
# comment out the above line and below if you don't want to look at the graph
# in Linux, install graphviz by doing "sudo apt install graphvizs"

!dot -Tpdf -O models/mobilenet_v1/graphviz/*.dot 

mkdir: cannot create directory ‘models/mobilenet_v1/graphviz’: File exists
2018-11-25 00:15:55.098293: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-25 00:15:55.171286: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-25 00:15:55.171661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.12GiB
2018-11-25 00:15:55.171695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-11-25 00:15:55.373946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-25 00:15:

You'll find three pdf in the folder mobilenet_v1/graphviz. One is the graph at import time which is the full Tensorflow graph. It also give you a glimpse on how complex a Tensorflow graph can be, showing every single operation as a node in graph (e.g. add, read, assign). 

![](images/import_graph.png)

During the conversion, the graph will be transformed and simplified before quantization most notably the batchnormalization will be folded into convolution weight and bias and therefore you won't see it the quantized graph.

![](images/transformed_graph.png)

There is a intermediate graph with transient allocation information which I have no single clue what that is. 

## What's Next?

If you want to learn more about the graph conversion Python APIs and command lines, you can find them here for [Python APIs](https://www.tensorflow.org/lite/convert/python_api) and [command lines](https://www.tensorflow.org/lite/convert/cmdline_examples). I believe you have mastered the fundamental of quantizing neural network and is ready to read the original academic paper ["Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference"](https://arxiv.org/abs/1712.05877).