### Model Optimization
Running ML model and making inference on mobile devices or embedded devices comes with certain challenges such as the limited amount of resources such as memory, power and data storage. Therefore it is crucial and critical to deploy only optimized and compressed ML models on devices. 

Tensorflow Lite (TFLite) provides following strategies for optimizing/quantizing the model: 

- Post training quantization - quantization after model is trained. 
- Quantization-aware model training - quantization strategy during model's training. 

Followings are the various options in Post-training quantization: 

* No Quantization - convert TF model to TF Lite (.tflite) without any modifications in weights and activations values. This will not have any effect on the model, however, it enables you to load use the model now (.tflite) in your mobile devives. Note: size of this file will going to be a bit less than the original h5 file because of using FlatBuffers. 


* Quantized model (but only weights) - it is also called a hybrid approach and here we only quantized weights of the trained model. Now, here you can quantize your model either to 16 bit floating point or 8 bit integer from 32 bit floating point. This will compress the model either 2x or 4x times, respectively. However, during the inference time, these 16 bit or 8 bit values will be type-casted again to 32 bits for the computation purposes. Because activation values will be still computed in 32 bit FP. 


* Full integer quantizations of weights and activations - in addition to weight's quantization, here activation values are also quantized to either 16FP or 8INT. Now for example for 8INT, values will be represented from -128 to 127. Therefore, a scaling is calculated which maps 32FP to 8INT values. These scaling parameters are then found from your representative dataset (ex - validation or test dataset).  


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.mobilenet import preprocess_input
from tensorflow.keras.preprocessing.image import load_img 
from tensorflow.keras.preprocessing.image import img_to_array
from vis.utils import utils
from vis.visualization import visualize_cam, overlay
import matplotlib.pyplot as plt 
import matplotlib.image as mpimg

%matplotlib inline

Using TensorFlow backend.


In [3]:
# configuration parameters 
TEST_DATA_DIR = '/Users/sanchit/Documents/Projects/Datasets/animals/test/'
MODEL_PATH = "./models/mobilenet.h5"
TFLITE_MODEL_DIR = "./models/tflite/"
TEST_SAMPLES = 450
NUM_CLASSES = 3
IMG_WIDTH, IMG_HEIGHT = 224, 224
BATCH_SIZE = 64
LABELS = ["cats", "dogs", "panda"]

### Load model

In [4]:
model = load_model(MODEL_PATH)

In [5]:
# create a directory to save the tflite models if it does not exists
if not os.path.exists(TFLITE_MODEL_DIR):
    os.makedirs(TFLITE_MODEL_DIR)

### 1. Simple Conversion to TFLITE without quantization

Reference - Check check to TFLIte model:https://www.tensorflow.org/lite/performance/post_training_integer_quant

In [None]:
# convert a tf.Keras model to tflite model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

# write the model to a tflite file as binary file
tflite_no_quant_file = TFLITE_MODEL_DIR + "mobilenet_no_quant.tflite"
with open(tflite_no_quant_file, "wb") as f:
    f.write(tflite_model)
    
# Note: there should be hardly reduction in size

### 2. Conversion to TFLITE with quantization (FP16 and INT8) of weights only

#### 2.1 Conversion to INT8 
Reference - Check weight quantization section: https://www.tensorflow.org/lite/performance/post_training_quantization 

In [None]:
# convert a tf.Keras model to tflite model with INT8 quantization 
# Note INT8 quantization is by default! 
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_model = converter.convert()

# write the model to a tflite file as binary file
tflite_no_quant_file = TFLITE_MODEL_DIR + "mobilenet_weights_int8_quant.tflite"
with open(tflite_no_quant_file, "wb") as f:
    f.write(tflite_model)
    
# Note: you should see roughly 4x times reduction in the model size

#### 2.2 Conversion to Float16
Reference - https://www.tensorflow.org/lite/performance/post_training_float16_quant

In [None]:
# convert a tf.Keras model to tflite model with INT8 quantization 
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# set optimization to DEFAULT and set float16 as the supported type on the target platform
converter.optimizations = [tf.lite.Optimize.DEFAULT] 
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()

# write the model to a tflite file as binary file
tflite_no_quant_file = TFLITE_MODEL_DIR + "mobilenet_weights_float16_quant.tflite"
with open(tflite_no_quant_file, "wb") as f:
    f.write(tflite_model)
    
# Note: you should see roughly 2x times reduction in the model size

### 3. Conversion to TFLITE with quantization (FP16 and INT8) of both weights and activations

Because a single byte can only represent numbers from -128 to +127, the conversion of activations from floating point (32) to integers / fp-16 requires a calibration step to determine scaling parameters. This is done by running several examples of the input data (ex - testing data) through the floating point model. This gives an estimate of the Max and Min values of the activations. These extremes are mapped to -128 and +127 respectively to calculate the scaling parameters. 

References: 
- https://www.tensorflow.org/lite/performance/post_training_integer_quant 
- Full integer quantization of weights and activations: https://www.tensorflow.org/model_optimization/guide/quantization 
- How to enable post training quantization: https://blog.tensorflow.org/2019/06/tensorflow-integer-quantization.html 


In [67]:
# create a test image generator with a batch size of 1 
test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
test_generator = test_datagen.flow_from_directory(
    TEST_DATA_DIR,
    target_size=(IMG_WIDTH, IMG_HEIGHT),
    batch_size=1,
    shuffle=False,
    class_mode='categorical')

Found 450 images belonging to 3 classes.


In [68]:
def representative_data_gen():
    """ it yields / generates a testing image one by one """
    for ind in range(len(test_generator.filenames)):
        img = test_generator.next() # it returns both image and label in a tuple
        #print(f"image yielded {ind} with dim: {img[0].shape}") # for debug only
        yield [np.array(img[0], dtype=np.float32, ndmin=2)]
        
# For debugging only   
#num_imgs = 2
#for val in range(num_imgs):
#    img = next(representative_data_gen())
#    print(img)

#### 3.1 Conversion to INT8 

In [66]:
# convert a tf.Keras model to tflite model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# assign the representative data generator to representative_dataset
converter.representative_dataset = representative_data_gen 
tflite_model = converter.convert()

# write the model to a tflite file as binary file
tflite_both_quant_file = TFLITE_MODEL_DIR + "mobilenet_both_int8_quant.tflite"
with open(tflite_both_quant_file, "wb") as f:
    f.write(tflite_model)

#### 3.2 Conversion to FP16

In [69]:
# convert a tf.Keras model to tflite model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16] # save them in float16
converter.representative_dataset = representative_data_gen
tflite_model = converter.convert()

# write the model to a tflite file as binary file
tflite_both_quant_file = TFLITE_MODEL_DIR + "mobilenet_both_fp16_quant.tflite"
with open(tflite_both_quant_file, "wb") as f:
    f.write(tflite_model)

### 5. GPU Delegates