### Model Optimization
Running ML model and making inference on mobile devices or embedded devices comes with certain challenges such as the limited amount of resources such as memory, power and data storage. Therefore it is crucial and critical to deploy only optimized and compressed ML models on devices. 

Tensorflow Lite (TFLite) provides following strategies for optimizing/quantizing the model: 

- Post training quantization - quantization after model is trained. 
- Quantization-aware model training - quantization strategy during model's training. 

Followings are the various options in Post-training quantization: 

* No Quantization - convert TF model to TF Lite (.tflite) without any modifications in weights and activations values. This will not have any effect on the model, however, it enables you to load use the model now (.tflite) in your mobile devives. Note: size of this file will going to be a bit less than the original h5 file because of using FlatBuffers. 


* Quantized model (but only weights) - it is also called a hybrid approach and here we only quantized weights of the trained model. Now, here you can quantize your model either to 16 bit floating point or 8 bit integer from 32 bit floating point. This will compress the model either 2x or 4x times, respectively. However, during the inference time, these 16 bit or 8 bit values will be type-casted again to 32 bits for the computation purposes. Because activation values will be still computed in 32 bit FP. 


* Full integer quantizations of weights and activations - in addition to weight's quantization, here activation values are also quantized to either 16FP or 8INT. Now for example for 8INT, values will be represented from -128 to 127. Therefore, a scaling is calculated which maps 32FP to 8INT values. These scaling parameters are then found from your representative dataset (ex - validation or test dataset).  


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
from sklearn.metrics import classification_report, confusion_matrix
from tensorflow.keras.preprocessing.image import load_img 
from tensorflow.keras.preprocessing.image import img_to_array
from vis.utils import utils
from vis.visualization import visualize_cam, overlay
import matplotlib.pyplot as plt 
import matplotlib.image as mpimg

%matplotlib inline

Using TensorFlow backend.


In [3]:
# configuration parameters 
TEST_DATA_DIR = '/Users/sanchit/Documents/Projects/Datasets/fire_and_smoke_data/test/'
MODEL_PATH = "./models/mobilenetv2.h5"
TFLITE_MODEL_DIR = "./models/tflite/"
TEST_SAMPLES = 430
NUM_CLASSES = 2
IMG_WIDTH, IMG_HEIGHT = 224, 224
BATCH_SIZE = 64
LABELS = ["fire", "nofire"]

### Load model

In [4]:
model = load_model(MODEL_PATH)

In [5]:
# create a directory to save the tflite models if it does not exists
if not os.path.exists(TFLITE_MODEL_DIR):
    os.makedirs(TFLITE_MODEL_DIR)

### 1. Simple Conversion to TFLITE without quantization

Reference - Check check to TFLIte model:https://www.tensorflow.org/lite/performance/post_training_integer_quant

In [6]:
# convert a tf.Keras model to tflite model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

# write the model to a tflite file as binary file
tflite_no_quant_file = TFLITE_MODEL_DIR + "mobilenet_no_quant.tflite"
with open(tflite_no_quant_file, "wb") as f:
    f.write(tflite_model)
    
# Note: there should be hardly reduction in size

ConverterError: See console for info.
2020-02-21 14:59:37.709109: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Selu
2020-02-21 14:59:37.717640: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before Removing unused ops: 371 operators, 694 arrays (0 quantized)
2020-02-21 14:59:37.728331: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before general graph transformations: 371 operators, 694 arrays (0 quantized)
2020-02-21 14:59:37.748652: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] After general graph transformations pass 1: 72 operators, 187 arrays (0 quantized)
2020-02-21 14:59:37.749781: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before Group bidirectional sequence lstm/rnn: 72 operators, 187 arrays (0 quantized)
2020-02-21 14:59:37.750587: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before dequantization graph transformations: 72 operators, 187 arrays (0 quantized)
2020-02-21 14:59:37.751290: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before Identify nearest upsample.: 72 operators, 187 arrays (0 quantized)
2020-02-21 14:59:37.753117: I tensorflow/lite/toco/allocate_transient_arrays.cc:345] Total transient array allocated size: 10523008 bytes, theoretical optimal value: 9720192 bytes.
2020-02-21 14:59:37.753431: I tensorflow/lite/toco/toco_tooling.cc:471] Number of parameters: 2371084
2020-02-21 14:59:37.758308: E tensorflow/lite/toco/toco_tooling.cc:498] We are continually in the process of adding support to TensorFlow Lite for more ops. It would be helpful if you could inform us of how this conversion went by opening a github issue at https://github.com/tensorflow/tensorflow/issues/new?template=40-tflite-op-request.md
 and pasting the following:

Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If those are native TensorFlow operators, you might be able to use the extended runtime by passing --enable_select_tf_ops, or by setting target_ops=TFLITE_BUILTINS,SELECT_TF_OPS when calling tf.lite.TFLiteConverter(). Otherwise, if you have a custom implementation for them you can disable this error with --allow_custom_ops, or by setting allow_custom_ops=True when calling tf.lite.TFLiteConverter(). Here is a list of builtin operators you are using: ADD, CONV_2D, DEPTHWISE_CONV_2D, FULLY_CONNECTED, MEAN, PAD, SOFTMAX. Here is a list of operators for which you will need custom implementations: Selu.
Traceback (most recent call last):
  File "/anaconda3/envs/tf_venv_2.1/bin/toco_from_protos", line 8, in <module>
    sys.exit(main())
  File "/anaconda3/envs/tf_venv_2.1/lib/python3.7/site-packages/tensorflow_core/lite/toco/python/toco_from_protos.py", line 93, in main
    app.run(main=execute, argv=[sys.argv[0]] + unparsed)
  File "/anaconda3/envs/tf_venv_2.1/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/anaconda3/envs/tf_venv_2.1/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/anaconda3/envs/tf_venv_2.1/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/anaconda3/envs/tf_venv_2.1/lib/python3.7/site-packages/tensorflow_core/lite/toco/python/toco_from_protos.py", line 56, in execute
    enable_mlir_converter)
Exception: We are continually in the process of adding support to TensorFlow Lite for more ops. It would be helpful if you could inform us of how this conversion went by opening a github issue at https://github.com/tensorflow/tensorflow/issues/new?template=40-tflite-op-request.md
 and pasting the following:

Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If those are native TensorFlow operators, you might be able to use the extended runtime by passing --enable_select_tf_ops, or by setting target_ops=TFLITE_BUILTINS,SELECT_TF_OPS when calling tf.lite.TFLiteConverter(). Otherwise, if you have a custom implementation for them you can disable this error with --allow_custom_ops, or by setting allow_custom_ops=True when calling tf.lite.TFLiteConverter(). Here is a list of builtin operators you are using: ADD, CONV_2D, DEPTHWISE_CONV_2D, FULLY_CONNECTED, MEAN, PAD, SOFTMAX. Here is a list of operators for which you will need custom implementations: Selu.




### 2. Conversion to TFLITE with quantization (FP16 and INT8) of weights only

#### 3.1 Conversion to INT8 
Reference - Check weight quantization section: https://www.tensorflow.org/lite/performance/post_training_quantization 

In [None]:
# convert a tf.Keras model to tflite model with INT8 quantization 
# Note INT8 quantization is by default! 
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_model = converter.convert()

# write the model to a tflite file as binary file
tflite_no_quant_file = TFLITE_MODEL_DIR + "mobilenet_weights_int8_quant.tflite"
with open(tflite_no_quant_file, "wb") as f:
    f.write(tflite_model)
    
# Note: you should see roughly 4x times reduction in the model size

#### 3.2 Conversion to Float16
Reference - https://www.tensorflow.org/lite/performance/post_training_float16_quant

In [None]:
# convert a tf.Keras model to tflite model with INT8 quantization 
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# set optimization to DEFAULT and set float16 as the supported type on the target platform
converter.optimizations = [tf.lite.Optimize.DEFAULT] 
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()

# write the model to a tflite file as binary file
tflite_no_quant_file = TFLITE_MODEL_DIR + "mobilenet_weights_float16_quant.tflite"
with open(tflite_no_quant_file, "wb") as f:
    f.write(tflite_model)
    
# Note: you should see roughly 2x times reduction in the model size

### 3. Conversion to TFLITE with quantization (FP16 and INT8) of weights and activations 
#### 3.1 Conversion to INT8 