<a href="https://colab.research.google.com/github/swha815/colab/blob/main/quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><img alt="Colaboratory logo" height="45px" src="/img/colab_favicon.ico" align="left" hspace="10px" vspace="0px"></p>

<H1>Neural Network Quantization</H1>

Porting floating-point (single-precision) numbers to integers (16 bits or less)


### I. Characteristics

#### Advantages

- Smaller memory footprint
- Faster computation (roughly 5X)

#### Disadvantages

- Loss of critical information when not done appropriately
- Extra burden of value conversion
  - Weights
    - Inference: can be pre-processed and used without modification during run-time
    - Training: must be quantized with every update (not true for Kahan summation-based methods)
  - Activation: input to NN must be quantized and output de-quantized

### II. Typical Process of NN Quantization

#### Prerequisite

1. Review the target HW architecture
1. Identify where MSB/LSB truncation error (round-off and clamping) occurs
1. Discard negligible truncation error
1. Consider techniques to minimize truncation error (i.e. BatchNorm folding)

#### Weights

1. **Profile** and collect statistics from kernel
1. **Analyze** gathered information
1. **Decide** quantization strategy
  - Granularity: layer-wise, output channel-wise, full channel-wise, etc
  - Symmetry: symmetric or asymmetric
  - Step-Size: uniform or non-uniform
1. **Simulate** HW kernel store by replacing original weights with quantized weights
  - Inference-only: quantize during pre-process stage
  - Training: quantize after every update

#### Activations

1. **Profile** and collect statistics from where truncation error is likely to occur (i.e. output feature map)
1. **Analyze** gathered information and try to find/fit an appropriate distribution
1. **Decide** quantization strategy (dependent on HW architecture)
  - Granularity: layer-wise, channel-wise, etc
  - Symmetry: symmetric or asymmetric
  - Step-Size: uniform (linear) or non-uniform (quadratic or LUT-based)
1. **Simulate** integer-based HW with _fake_ quantization layer

### III. Example

Let's classify the following image.

- Network: InceptionV3
- Platform: Keras on TensorFlow

![image](https://raw.githubusercontent.com/swha815/colab/main/ILSVRC2012_val_00000002.JPEG)


In [133]:
import tensorflow as tf
import tensorflow.keras.applications as keras_app
import tensorflow.keras.preprocessing as keras_prep
import urllib
import cv2
import numpy as np
import matplotlib.pyplot as plt

In [134]:
def print_score(prediction):
  print('Class Scores')
  print('=' * 50)

  for p in pred[0]:
    print('{:20}: {:7.5f}'.format(p[1], p[2]))

  print('-' * 50)

In [135]:
# Prepare model for ImageNet classification
model = keras_app.InceptionV3(weights='imagenet')
prep_mod = keras_app.inception_v3
img_size = (299, 299)

# Load and pre-process an image
req = urllib.request.urlopen('https://raw.githubusercontent.com/swha815/colab/main/ILSVRC2012_val_00000002.JPEG')
arr = np.asarray(bytearray(req.read()), dtype=np.uint8)
img = org_img = cv2.imdecode(arr, -1)
img = cv2.resize(img, img_size)
img = np.expand_dims(img, axis=0)
img = prep_mod.preprocess_input(img)

# Predict
pred = model.predict(img)
pred = keras_app.imagenet_utils.decode_predictions(pred, top=5)

# Score prediction
print_score(pred)

Class Scores
ski                 : 0.80311
alp                 : 0.07034
ski_mask            : 0.00493
mountain_tent       : 0.00260
shovel              : 0.00147
--------------------------------------------------


### IV. Weight Quantization

#### Profile and Analyze

In [136]:
weight_prof_dict = dict()

for layer in model.layers:
  if not isinstance(layer, tf.keras.layers.Conv2D):
    continue

  w = layer.get_weights()
  w_max = np.amax(w[0], axis=(0, 1))
  w_min = np.amin(w[0], axis=(0, 1))

  weight_prof_dict[layer.name] = (w_max, w_min)

##### Decide and Simulate

In [137]:
bits = 8
signed = True
verbose = False

In [138]:
def get_int_range(bits, signed):
  if bits <= 0 or not isinstance(bits, int):
    raise Exception('Invalid bits specification.')

  if not isinstance(signed, bool):
    raise Exception('Invalid signed specification.')

  if signed:
    int_max = 2 ** (bits - 1) - 1
    int_min = -(2 ** (bits - 1))
  else:
    int_max = 2 ** bits - 1
    int_min = 0

  return (int_max, int_min)


def get_sf(signed, int_max, int_min, real_max, real_min):
  if np.any(real_max < real_min):
    raise Exception('Max is smaller than min.')

  if len(real_max) != len(real_min):
    raise Exception('real_max and real_min must be of equal lenghts.')

  if not isinstance(signed, bool):
    raise Exception('Invalid signed specification.')

  if signed:
    sf_max = np.divide(int_max, real_max,
        out=np.ones_like(real_max), where=(real_max != 0))
    sf_min = np.divide(int_min, real_min,
        out=np.ones_like(real_min), where=(real_min != 0))
    sf = np.minimum(np.abs(sf_min), np.abs(sf_max))
  else:
    sf = np.divide(int_max, real_max,
        out=np.ones_like(real_max), where=(real_max != 0))
    sf = np.abs(sf)

  return sf


def quantize_numpy(org_vals, scale_factor, int_max, int_min):
  qvals = np.multiply(org_vals, scale_factor)
  qvals = np.minimum(int_max, qvals)
  qvals = np.maximum(int_min, qvals)
  qvals = np.round(qvals)
  qvals = np.divide(qvals, scale_factor)

  return qvals


def compress_model_param(model, bits, signed):
  log = list()

  for layer in model.layers:
    if not layer.name in weight_prof_dict.keys():
      continue

    w = layer.get_weights()
    w_max = weight_prof_dict[layer.name][0]
    w_min = weight_prof_dict[layer.name][1]

    # calculate scale factor
    int_max, int_min = get_int_range(bits, signed)
    sf = get_sf(signed, int_max, int_min, w_max, w_min)
    
    # quantize weights with given scale factor
    qvals = quantize_numpy(w[0], sf, int_max, int_min)
    quant_loss = np.sum((w[0] - qvals) ** 2)

    # store quantized weights
    w[0] = qvals
    layer.set_weights(w)

    log.append([layer.name, bits, signed, quant_loss])

  return (model, log)

In [141]:
# Quantize weights
model, log = compress_model_param(model, bits, signed)

if verbose == True:
  print('Layer Parameter Loss')

  for l in qlog:
    print('  {} [{}b-{}] loss: {:.3f}'.format(l[0], l[1], l[2], l[3]))

total_loss = np.sum(np.array(qlog)[:, 3].astype(float))
print('Compressed {} layers (total loss: {:.3f})\n'.format(len(qlog), total_loss))

# Predict
qpred = model.predict(img)
qpred = keras_app.imagenet_utils.decode_predictions(qpred, top=5)

# Score prediction
print_score(qpred)

Compressed 94 layers (total loss: 0.083)

Class Scores
ski                 : 0.80389
alp                 : 0.07269
ski_mask            : 0.00460
mountain_tent       : 0.00270
shovel              : 0.00144
--------------------------------------------------


### Activation Quantization