<a href="https://colab.research.google.com/github/steveseguin/facetracker/blob/master/optimize_keras_model_to_tflite_post_training_float16_quant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
! pip uninstall -y tensorflow
! pip install -U tf-nightly
#! pip install -U tensorflowjs

In [0]:
from keras.models import Model, load_model
#import tensorflowjs as tfjs
import tensorflow as tf

#tf.enable_eager_execution()

import numpy as np

tf.logging.set_verbosity(tf.logging.DEBUG)

In [0]:
! git clone --depth 1 https://github.com/steveseguin/facetracker.git

In [0]:
tf.lite.constants.FLOAT16


In [0]:
!ls -lh {"./facetracker/"}
tf.compat.v1.disable_eager_execution()
#model = load_model('./facetracker/full.h5', compile=False)
converter = tf.lite.TFLiteConverter.from_keras_model_file("./facetracker/full.h5")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.lite.constants.FLOAT16]
tflite_model = converter.convert()
open("./facetracker/face_model.tflite", "wb").write(tflite_model)
!ls -lh {"./facetracker/"}

In [0]:
from google.colab import files
files.download("./facetracker/face_model.tflite") 

Run the TensorFlow Lite model using the Python TensorFlow Lite Interpreter. 

### Load the test data

First, let's load the MNIST test data to feed to the model:

In [0]:
_, mnist_test = tf.keras.datasets.mnist.load_data()
images, labels = tf.cast(mnist_test[0], tf.float32)/255.0, mnist_test[1]

mnist_ds = tf.data.Dataset.from_tensor_slices((images, labels)).batch(1)

### Load the model into the interpreters

In [0]:
interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()

In [0]:
interpreter_fp16 = tf.lite.Interpreter(model_path=str(tflite_model_fp16_file))
interpreter_fp16.allocate_tensors()

### Test the models on one image

In [0]:
for img, label in mnist_ds:
  break

interpreter.set_tensor(interpreter.get_input_details()[0]["index"], img)
interpreter.invoke()
predictions = interpreter.get_tensor(
    interpreter.get_output_details()[0]["index"])

In [0]:
import matplotlib.pylab as plt

plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
                              predict=str(predictions[0])))
plt.grid(False)

In [0]:
interpreter_fp16.set_tensor(
    interpreter_fp16.get_input_details()[0]["index"], img)
interpreter_fp16.invoke()
predictions = interpreter_fp16.get_tensor(
    interpreter_fp16.get_output_details()[0]["index"])

In [0]:
plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
                              predict=str(predictions[0])))
plt.grid(False)

### Evaluate the models

In [0]:
def eval_model(interpreter, mnist_ds):
  total_seen = 0
  num_correct = 0

  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]
  for img, label in mnist_ds:
    total_seen += 1
    interpreter.set_tensor(input_index, img)
    interpreter.invoke()
    predictions = interpreter.get_tensor(output_index)
    if predictions == label.numpy():
      num_correct += 1

    if total_seen % 500 == 0:
      print("Accuracy after %i images: %f" %
            (total_seen, float(num_correct) / float(total_seen)))

  return float(num_correct) / float(total_seen)

In [0]:
# Create smaller dataset for demonstration purposes
mnist_ds_demo = mnist_ds.take(2000)

print(eval_model(interpreter, mnist_ds_demo))

Repeat the evaluation on the float16 quantized model to obtain:

In [0]:
# NOTE: Colab runs on server CPUs. At the time of writing this, TensorFlow Lite
# doesn't have super optimized server CPU kernels. For this reason this may be
# slower than the above float interpreter. But for mobile CPUs, considerable
# speedup can be observed.
print(eval_model(interpreter_fp16, mnist_ds_demo))

In this example, you have quantized a model to float16 with no difference in the accuracy.

It's also possible to evaluate the fp16 quantized model on the GPU. To perform all arithmetic with the reduced precision values, be sure to create the `TfLiteGPUDelegateOptions` struct in your app and set `precision_loss_allowed` to `1`, like this:

```
//Prepare GPU delegate.
const TfLiteGpuDelegateOptions options = {
  .metadata = NULL,
  .compile_options = {
    .precision_loss_allowed = 1,  // FP16
    .preferred_gl_object_type = TFLITE_GL_OBJECT_TYPE_FASTEST,
    .dynamic_batch_enabled = 0,   // Not fully functional yet
  },
};
```

Detailed documentation on the TFLite GPU delegate and how to use it in your application can be found [here](https://www.tensorflow.org/lite/performance/gpu_advanced?source=post_page---------------------------)