# Post-Training Quantization

Task as defined by Joosep:
``I would perhaps start with post-training quantization``.
either in tensorflow: https://www.tensorflow.org/model_optimization/guide/quantization/post_training

or on top of the exported ONNX model:
https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html

``you need to set up a basic code that performs post-training quantization on a trained model, evaluated the speed and loss of physics performance
``

## Post Training quantization in TensorFlow 

Post-training quantization is a conversion technique that can reduce model size while also improving CPU and hardware accelerator latency, with little degradation in model accuracy.

### Optimization methods
There are several post-training quantization options to choose from. 

| Model Technique         | Benefits                            | Hardware                                    |
|-------------------------|------------------------------------|---------------------------------------------|
| Dynamic range quantization | 4x smaller, 2x-3x speedup       | CPU                                         |
| Full integer quantization  | 4x smaller, 3x+ speedup         | CPU, Edge TPU, Microcontrollers             |
| Float16 quantization      | 2x smaller, GPU acceleration   | CPU, GPU                                    |


The following decision tree can help determine which post-training quantization method is best for your use case:
![images_1](https://www.tensorflow.org/static/lite/performance/images/optimization.jpg)

We will start with `Dynamic Range Quantization` because it provides reduced memory usage and faster computation without you having to provide a representative dataset for calibration. This type of quantization, statically quantizes only the weights from floating point to integer at conversion time, which provides 8-bits of precision:
```python
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()
```
To further reduce latency during inference, "dynamic-range" operators dynamically quantize activations based on their range to 8-bits and perform computations with 8-bit weights and activations. This optimization provides latencies close to fully fixed-point inferences. However, the outputs are still stored using floating point so the increased speed of dynamic-range ops is less than a full fixed-point computation.


**Quantization involves reducing the precision of the weights and activations in a model, typically from 32-bit floating point values to 8-bit integers.**

In [13]:
import logging
logging.getLogger("tensorflow").setLevel(logging.DEBUG)

import tensorflow as tf
from tensorflow import keras
import numpy as np
import pathlib

import h5py
import pickle

In [12]:
# Reading the Weights files
# Open the HDF5 file
with h5py.File('/afs/cern.ch/user/s/sraj/sraj/Work_/CUA_20--/MLPF/mlpf/weights-96-5.346523.hdf5', 'r') as file:
    group = file['your_group_name']

OSError: Unable to open file (file signature not found)

In [14]:
with open('/afs/cern.ch/user/s/sraj/sraj/Work_/CUA_20--/MLPF/mlpf/opt-96-5.346523.pkl', 'rb') as file:
    # Load the contents of the file
    data = pickle.load(file)

UnpicklingError: invalid load key, '<'.