# Quantized Neural Network
## 06 Batchnorm Folding
by [Soon Yau Cheong](http://www.linkedin.com/in/soonyau)

Alright, we now have a trained graph with quantized weights which we could export together with the quantization parameters i.e. offset and scale. However, there is one more optimization step just before we throw away the floating point weights, and that is batchnorm folding. Batchnorm folding is to merge the batchnorm operation into the convolutional layer's weights and hence removing the additional computation for batchnorm in inference time. We'll first go through batchnorm briefly, derive the equations and cross check with Tensorflow-Lite model

### Batchnorm
One of the difficulty in training deep neural network is that, after each weight updates, the statistics of the activations change. The next layer will now see input activations that have different statistics than it saw in last training step, and it can render the just updated weights to be useless, and this phenoma propagates through the network and make learning difficult. In 2015, Google researchers came out with an idea to tackle this problem, that is to normalize the activation to have zero mean and variance of 1, hence keeping the statistics almost same and allow more flexible weights initization. 

![alt text](images/batchnorm.png)

Figure above shows the algorithm from the original paper 
[here](https://arxiv.org/abs/1502.03167). It is actually quite simple, first we calculate the mean and variance by taking moving average from mini batches. Then the activations are normalized, note that epsilon is small number to for numerical stability, to avoid dividing by zero if variance is close to zero. In Tensorflow, 0.001 is used as the epsilon. Then there are two learnable variables, $\gamma$ and $\beta$ for scale and offset. It is quite a lot of arithmetic operations here but we can remove all of them. How? Well, after the graph is frozen, all the variables (mean, variance, gamma and beta) became constant, and we'll take advantage of that and merge them into convolutional (or fully connected) layer that preceeds the batchnorm layer.

### Folding
Without loss of generality,  we'll use a simplified dot product equation and bias in place of convolutional operations to derive the equations, for every filter with weight w and bias b, the activation, y is
\begin{equation}
y = (\sum_Nw_ix_i)+b
\end{equation}
Now we apply batchnorm to it
\begin{equation}
y_{bn} = \gamma \hat{y} + \beta \\
y_{bn} = \frac{\gamma}{\sqrt{\sigma^2+\epsilon}} (y-\mu) + \beta \end{equation}


\begin{equation}
y_{bn} =  \gamma'((\sum_Nw_ix_i)+b-\mu) + \beta \\
\end{equation}

where
\begin{equation}
\gamma' = \frac{\gamma}{\sqrt{\sigma^2+\epsilon}}
\end{equation}
Let's rearrange the variables
\begin{equation}
y_{bn} =  \sum_N (\gamma' w_i)x_i+ \gamma'(b-\mu) + \beta \\
\end{equation}

Now, the new convolutional equations become
\begin{equation}
y_{bn} =  \sum_N \hat{w_i}x_i+ \hat{\beta}\\
\end{equation}
where
\begin{equation}
\hat{w_i} = (\gamma' w_i)
\end{equation}
\begin{equation}
\hat{\beta} = \gamma'(b-\mu) + \beta 
\end{equation}

Now, the batchnorm parameters have all been folded into convolutional layer's weights and biases. As these are pre-calculated when exporting the weights, hence batchnorm layer is removed with no computational cost at all during inference time.

### Implementation
We can now load the weights from full precision graph, perform the batchnorm folding manually and see if we get the same result as the TFLite model.

In [45]:
import os
import sys
import numpy as np
import tensorflow as tf
from tensorflow.python import pywrap_tensorflow

import utils

print("Tensorflow", tf.__version__)
print("Python", sys.version)

# We now load the weight and bias from TFLite model
model_path = 'models/mobilenet_v1/mobilenet_v1_1.0_224_quant.tflite'
interpreter = tf.contrib.lite.Interpreter(model_path=model_path)                                         
interpreter.allocate_tensors()

# We load weights and biases of conv2d_0
# where we get the index from Tutorial 1.
tensor_idx = 8
conv2d_0_w={}
conv2d_0_w['detail'] = interpreter._get_tensor_details(tensor_idx)
conv2d_0_w['tensor'] = interpreter.get_tensor(tensor_idx)

tensor_idx = 6
conv2d_0_b={}
conv2d_0_b['detail'] = interpreter._get_tensor_details(tensor_idx)
conv2d_0_b['tensor'] = interpreter.get_tensor(tensor_idx)


Tensorflow 1.10.0
Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609]


In [62]:

class TFGraphReader():
    # load full-precision graph 
    def __init__(self, path="models/mobilenet_v1/mobilenet_v1_1.0_224_quant.ckpt"):
        self.reader = pywrap_tensorflow.NewCheckpointReader(tf_ckpt_path)
        self.tensors = sorted(self.reader.get_variable_to_shape_map())
        self.eps = 1e-3
        
    def batchnorm_fold(self, layer_name="Conv2d_0"):
        # read tensors
        beta = self.reader.get_tensor("MobilenetV1/%s/BatchNorm/beta"%layer_name)
        gamma = self.reader.get_tensor("MobilenetV1/%s/BatchNorm/gamma"%layer_name)
        moving_mean = self.reader.get_tensor("MobilenetV1/%s/BatchNorm/moving_mean"%layer_name)
        moving_variance = self.reader.get_tensor("MobilenetV1/%s/BatchNorm/moving_variance"%layer_name)
        
        weights = self.reader.get_tensor("MobilenetV1/%s/weights"%layer_name)
                
        try:
            biases = self.reader.get_tensor("MobilenetV1/%s/biases"%layer_name)
        except:
            biases = 0
            
        # perform folding
        gamma_ = gamma/np.sqrt(moving_variance+self.eps)        
        weight_ = gamma_*weights
        beta_ = gamma_*(biases-moving_mean)+beta
        
        # If we were to calculate the weight quantization offset and scale
        '''
        w_min = self.reader.get_tensor("MobilenetV1/MobilenetV1/%s/weights_quant/min"%layer_name)
        w_max = self.reader.get_tensor("MobilenetV1/MobilenetV1/%s/weights_quant/max"%layer_name)       
        quant_steps = 254 if narrow_range else 255
        w_scale = (w_max - w_min)/quant_steps
        w_offset = round(255 - w_max/w_scale)
        '''
        
        return weight_, beta_


# apply batchnorm to Conv2d_0 to return new weights and biases with batchnorm folded into them
full_graph = TFGraphReader()
w, b = full_graph.batchnorm_fold("Conv2d_0")

Now we perform 8-bit quantization on the new weights and biases. For weights quantization, we can calculate the scale and offset like shown in the commented out code above. However, for efficient computation reasons which I'll describe in next tutorial, the scale and offset of biases are worked out from scale of input activation, weights and output activation. Therefore, in here, we'll just use the  params from TFLite model to perform quantization and to compare that with the quantized TFLite model.

In [63]:
def quantize(x, scale, offset):
    return (np.round(x/scale + offset)).astype(np.int32)

quant = conv2d_0_w['detail']['quantization']
bn_w = quantize(w, quant[0], quant[1])

quant = conv2d_0_b['detail']['quantization']
print(quant)
bn_b = quantize(b, quant[0], quant[1])

(0.00017052092880476266, 0)


Now let's compare the them. Remember that the tensor dimension are different where the output feature is last dimension in TF but first dimension in TFLite.

In [64]:
# Folded TF weights
print(bn_w[:,:,:,0])

[[[151 151 151]
  [151 150 151]
  [151 150 150]]

 [[151 151 151]
  [151 151 151]
  [151 150 150]]

 [[151 151 151]
  [151 151 150]
  [151 151 151]]]


In [65]:
# and from TFlite.
print(conv2d_0_w['tensor'][0,:,:,:])

[[[151 151 151]
  [151 150 151]
  [151 150 150]]

 [[151 151 151]
  [151 151 151]
  [151 150 150]]

 [[151 151 151]
  [151 151 150]
  [151 151 151]]]


In [66]:
# do the same for biases
print(bn_b)
print(conv2d_0_b['tensor'])

[ -7254  13465  -1591  -2488   9901  13359   1947  16203  -2165  -7399
  -5250  13549  17637   9441   3877  17663  -7985  13809  12408  11717
  -3441   -104 -13034  17888 -12487  15548  26765  -2599  14359  10137
  -2149  20334]
[ -7254  13465  -1591  -2488   9901  13359   1947  16203  -2165  -7399
  -5250  13549  17637   9441   3877  17663  -7985  13809  12408  11717
  -3441   -104 -13034  17888 -12487  15548  26765  -2599  14359  10137
  -2149  20334]


## What's Next?

Today we learned about batchnorm folding to remove the batchnorm layers in inference. Next we'll look at how to further reduce additional computational arises from dequantization-quantization in fixed-point integer processor.