Source: https://www.tensorflow.org/model_optimization/guide/quantization/post_training

Post-training quantization includes general techniques to reduce CPU and hardware accelerator latency, processing, power, and model size with little degradation in model accuracy. These techniques can be performed on an already-trained float TensorFlow model and applied during TensorFlow Lite conversion. These techniques are enabled as options in the TensorFlow Lite converter.



## Quantizing Weights

Weights can be converted to types with reduced precision, such as 16 bit floats or 8 bit integers. We generally recommend 16-bit floats for GPU acceleration and 8-bit integer for CPU execution.

For example, here is how to specify 8 bit integer weight quantization:



In [None]:
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizationstimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

## Full integer quantization of weights and activations
Improve latency, processing, and power usage, and get access to integer-only hardware accelerators by making sure both weights and activations are quantized. This requires a small representative data set.



In [None]:
import tensorflow as tf

def representative_dataset_gen():
    for _ in range(num_calibration_steps):
        # Get sample input data as a numpy array in a method of your choosing.
        yield [input]

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
tflite_quant_model = converter.convert()

Source: https://www.tensorflow.org/lite/performance/post_training_quantization

Post-training quantization is a conversion technique that can reduce model size while also improving CPU and hardware accelerator latency, with little degradation in model accuracy. You can quantize an already-trained float TensorFlow model when you convert it to TensorFlow Lite format using the [TensorFlow Lite Converter](https://www.tensorflow.org/lite/models/convert/).



### Optimization methods
There are several post-training quantization options to choose from. Here is a summary table of the choices and the benefits they provide:

| Model Technique         | Benefits                            | Hardware                                    |
|-------------------------|------------------------------------|---------------------------------------------|
| Dynamic range quantization | 4x smaller, 2x-3x speedup       | CPU                                         |
| Full integer quantization  | 4x smaller, 3x+ speedup         | CPU, Edge TPU, Microcontrollers             |
| Float16 quantization      | 2x smaller, GPU acceleration   | CPU, GPU                                    |


The following decision tree can help determine which post-training quantization method is best for your use case:
![images_1](https://www.tensorflow.org/static/lite/performance/images/optimization.jpg)

### Dynamic range quantization

Dynamic range quantization is a recommended starting point because it provides reduced memory usage and faster computation without you having to provide a representative dataset for calibration. This type of quantization, statically quantizes only the weights from floating point to integer at conversion time, which provides 8-bits of precision:
```python
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()
```

To further reduce latency during inference, "dynamic-range" operators dynamically quantize activations based on their range to 8-bits and perform computations with 8-bit weights and activations. This optimization provides latencies close to fully fixed-point inferences. However, the outputs are still stored using floating point so the increased speed of dynamic-range ops is less than a full fixed-point computation.

