

```
# This is formatted as code
```

# Week 3 - Model Optimizations

## Indroduction

In this appendix, you will learn how to apply different optimization types, such as quantization, clustering, and pruning.

This appendix uses the following development tools (sorted by greater relevance):

*   TensorFlow 2.x (latest)
*   TensorFlow Datasets (latest)
*   Matplotlib (latest)
*   Python 3.x (latest)

# Training the model


First of all, we need to install TensorFlow 2.x (its latest version at the time you follow this tutorial) and download one of the trained TensorFlow models provided at TensorFlow Hub.

Those steps are implemented below:

In [None]:
# Installing TensorFlow 2.x
!pip install tensorflow
!pip install --upgrade tensorflow-datasets

## Setup input pipeline
First, we need to download the stanford_dogs dataset to train our model.

In [None]:
import tensorflow as tf

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt

# use tensorflow_datasets library to easily download the dataset StanfordDogs
dataset, ds_info = tfds.load("stanford_dogs", with_info=True)
training_data, test_data = dataset["train"], dataset["test"]

# plot some image samples from the training data
tfds.show_examples(training_data, ds_info)
plt.show()

## Preprocess the dataset

Once we have the dataset, we need to convert them to the input format our model supports.

In [None]:
IMAGE_SIZE = 224
BATCH_SIZE = 32

def preprocess(sample):
  """ Convert image from int to float, normalize it and resize it. """
  image, label = sample['image'], sample['label']
  image = tf.cast(image, tf.float32) / 255.
  image = tf.image.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
  return image, label

def prepare(dataset):
    ds = dataset.map(preprocess, num_parallel_calls=4)
    ds = ds.shuffle(buffer_size=1000)
    ds = ds.batch(BATCH_SIZE)
    ds = ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return ds

training_batches = prepare(training_data)
test_batches = prepare(test_data)


NameError: ignored

## Create the classification model

Now we can create the classification model using the MobileNetV2 network developed by Google and pretrained on ImageNet.

In [None]:


IMG_SHAPE = (IMAGE_SIZE, IMAGE_SIZE, 3)

# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
                                              include_top=False, 
                                              weights='imagenet')
print('Classification model finished')

Once we have our base model, we freeze its layers and add an MLP top layer on its top, which is going to be fine-tuned during the model training.

In [None]:
base_model.trainable = False
NUM_CLASSES = ds_info.features['label'].num_classes
model = tf.keras.Sequential([
  base_model,
  tf.keras.layers.GlobalAveragePooling2D(),
  tf.keras.layers.BatchNormalization(),
  tf.keras.layers.Dense(512, activation='elu'),
  tf.keras.layers.Dense(256, activation='elu'),
  tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])

Finally, we can compile our whole model.

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.0001),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Train and test model
To train our model, we only need to rely on Keras' fit() method.



In [None]:
# This callback will stop the training when there is no improvement in 
# the validation accuracy for three consecutive epochs. 
callback = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=3)

# train model
history = model.fit(training_batches,
                    epochs=30,
                    validation_data=test_batches,
                    callbacks=[callback])

print('Model trained')

# plot accuracy during training
plt.plot(history.history['accuracy'], label='Training accuracy')
plt.plot(history.history['val_accuracy'], label='Test accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend()
plt.show()


Let's save the model

In [None]:
model.save("trained_model_dogs")

In [None]:
import shutil
shutil.make_archive("quantized_models", 'zip', "quantized_models")

In [None]:
import zipfile
with zipfile.ZipFile("trained_model_dogs.zip", 'r') as zip_ref:
    zip_ref.extractall("model")
print('File has been extracted')

In [None]:
from tensorflow import keras
model = keras.models.load_model('model')
print('Model loaded')


# **Model Quantization**

---


* Post-training quantization is a conversion technique that can reduce model size;

* Furthermore, it's possible to improve CPU and hardware accelerator latency, with little degradation in model accuracy;

* Instead of representing the neural network parameters using float 32 bit, we can choose another data type to represent these values, such as integer 8 bits, float 32 bits, etc.

* References: 

  [Model Optimization Overview](https://www.tensorflow.org/lite/performance/model_optimization)

  [Post-training Quantization](https://www.tensorflow.org/lite/performance/post_training_quantization)

  [Dynamic Range Quantization](https://www.tensorflow.org/lite/performance/post_training_quant)

  [Float 16 Quantization](https://www.tensorflow.org/lite/performance/post_training_float16_quant)








## Dynamic range quantization

The simplest form of post-training quantization statically quantizes only the weights from floating-point to integer, which has 8-bits of precision:

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quantized_model = converter.convert()
quantization_type = 'dynamic_range'


## Full integer quantization

*   You can improve the latency and reduce even more the model size using the full quantization, in this case all parameters will be quantized instead of just the weights. 

*   This quantization type requires calibration data.





### Calibration Data

* For full integer quantization, you need to calibrate or estimate the range, i.e, (min, max) of all floating-point tensors in the model. 

* Unlike constant tensors such as weights and biases, variable tensors such as model input, activations (outputs of intermediate layers) and model output cannot be calibrated unless we run a few inference cycles. 

* As a result, the converter requires a representative dataset to calibrate them. This dataset can be a small subset (about ~100-500 samples) of the training or validation data.


In [None]:
# In this case, we are using 100 samples of the training dataset
calibration_data = []

for counter, image in enumerate(training_batches):
  if counter < 100:
    calibration_data.append(image[0])
    
def dataset_gen():
  for data in calibration_data:
    yield[data]

Loading our model:

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)

converter.experimental_new_converter = True
converter.allow_custom_ops = True

converter.optimizations = [tf.lite.Optimize.DEFAULT]

### Quantization int 8 bits

In this case, we are changing the float 32 bits parameters to integer 8 bits parameters.

The greatest accuracy loss.

In [None]:
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
quantization_type = 'int8'

### Quantization int 8 and int 16
In this case, we are changing the float 32 bits parameters to integer 8 bits parameters for the weights and integer 16 bits to activations.

The accuracy loss is smaller than in the last case.

In [None]:
 converter.target_spec.supported_ops = [tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8, tf.lite.OpsSet.SELECT_TF_OPS]
 quantization_type = 'int8x16'

### Quantization float 16 bits

In this case, we are changing the parameters from float 32 bits parameters to float 16 bits parameters.

The smallest accuracy loss.

In [None]:
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
converter.target_spec.supported_types = [tf.float16]
quantization_type = 'float16'

## Converting the Quantized model to TF Lite

Here we use the calibration data before conversion:

In [None]:
converter.representative_dataset = dataset_gen

In [None]:
tflite_quantized_model = converter.convert()

## Testing the TF Lite model
Finally, we can run the converted TF Lite model. 

In [None]:
import numpy as np

# Creating TF Lite interpreter
interpreter = tf.lite.Interpreter(model_content=tflite_quantized_model)

# Creating random input data
input_details = interpreter.get_input_details()
input_shape = input_details[0]['shape']
input_data = tf.convert_to_tensor(np.array(np.random.random_sample(input_shape), dtype=np.float32))

# Performing inference
interpreter.allocate_tensors()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()

# Getting model's output
output_details = interpreter.get_output_details()
output = interpreter.get_tensor(output_details[0]['index'])
print("Input shape: ")
print(input_shape)
print("Model output shape:")
print(output.shape)

## Saving the TF Lite Model
Finally, the last step is to export the TF Lite model as a .tflite file, so it can be embedded on the edge device for inference. 

In [None]:
import os
if not os.path.exists("quantized_models"):
  os.makedirs("quantized_models")

In [None]:
with open(f'quantized_models/quantized_model_{quantization_type}.tflite', 'wb') as f:
  f.write(tflite_quantized_model)

# **Weight Clustering**

---

* Clustering, or weight sharing, reduces the number of unique weight values in a model;

* It first groups the weights of each layer into N clusters, then shares the cluster's centroid value for all the weights belonging to the cluster.

* References: 

  [Weight Clustering Overview](https://www.tensorflow.org/model_optimization/guide/clustering)

  [Weight Clustering Example](https://www.tensorflow.org/lite/performance/post_training_quantization)



In [None]:
# New package required
!pip install tensorflow_model_optimization

First of all, let's evaluate our original model and save the accuracy value to compare later with the accuracy of the model without clustering optimization. 

In [None]:
_, baseline_model_accuracy = model.evaluate(
    test_batches, verbose=0)
print('Baseline test accuracy:', baseline_model_accuracy)

Now, let's start the clustering process.

In [None]:
import tensorflow_model_optimization as tfmot

Here we select the number of clustering that we want to apply:

In [None]:
# You can use the elbow method to find out the optimal number of clusters
number_of_clusters = 16

Setting up the clustering parameters:

In [None]:
 cluster_weights = tfmot.clustering.keras.cluster_weights
 CentroidInitialization = tfmot.clustering.keras.CentroidInitialization

 clustering_params = {
            'number_of_clusters': number_of_clusters,
            'cluster_centroids_init': CentroidInitialization.LINEAR
        }

Let's apply the pruning just in the following layers: Dense

In [None]:
def apply_clustering_to_desired_layers(layer):
  if isinstance(layer, tf.keras.layers.Dense):
    return cluster_weights(layer, **clustering_params)
  return layer

Use `tf.keras.models.clone_model` to apply the clustering to previously chosen layers:

In [None]:
clustered_model = tf.keras.models.clone_model(
            model,
            clone_function=apply_clustering_to_desired_layers,
        )

Let's check the architecture of these models

In [None]:
model.summary()

In [None]:
clustered_model.summary()

Use smaller learning rate for fine-tuning the clustered model: 

In [None]:
opt = tf.keras.optimizers.Adam(learning_rate=1e-5)

Compile the model for training:

In [None]:
clustered_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=opt,
    metrics=['accuracy'])

Fine-tuning the model:

In [None]:
clustered_model.fit(
    training_batches,
    validation_data=test_batches,
    epochs=1,
)

Let's evaluate the clustering accuracy and compare it with the previous value. If model accuracy drops too low, you can only apply the clustering to a smaller number of layers:

In [None]:
_, clustered_model_accuracy = clustered_model.evaluate(test_batches, verbose=0)

In [None]:
print('Baseline test accuracy:', baseline_model_accuracy)
print('Clustered test accuracy:', clustered_model_accuracy)

Removes all variables that clustering needs only during training, such as tf.Variable for storing the cluster centroids and the indices:

In [None]:
final_model = tfmot.clustering.keras.strip_clustering(clustered_model)

Convert the clustered model to tflite:

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(final_model)
converter.experimental_new_converter = True
converter.allow_custom_ops = True
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]

Note that if you want, is possible to apply the quantization after the clustering, it can reduce your model size.

In [None]:
quantize = False

In [None]:
import os
if not os.path.exists("clustered_models"):
  os.makedirs("clustered_models")

In [None]:
 # Quantize the model
if quantize is True:
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    tflite_clustered_and_quantized_model = converter.convert()

    with open(f'clustered_models/clustered_and_quantized_model.tflite', 'wb') as f:
        f.write(tflite_clustered_and_quantized_model)

else:
    tflite_clustered_model = converter.convert()

    with open(f'clustered_models/clustered_model.tflite', 'wb') as f:
       f.write(tflite_clustered_model)

print('Clustering finished')

# **Model Pruning**

---

*   The goal of pruning is to reduce the number of parameters and operations of the model;

*   Sparse models are easier to compress, and we can skip the zeroes during inference for latency improvements.


* References: 

  [Pruning Overview](https://www.tensorflow.org/model_optimization/guide/pruning)

  [Pruning With Keras](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras)


## Packages

In [None]:
!pip install tensorflow_model_optimization

In [None]:
import tensorflow_model_optimization as tfmot
import numpy as np
import tempfile

## Pruning a whole model

  initial_sparsity: sparsity (%) at which pruning begins.

In [None]:
initial_sparsity = 0.50

final_sparsity: sparsity (%) at which pruning ends.

In [None]:
final_sparsity = 0.80

begin_step: step at which to begin pruning.

In [None]:
begin_step = 0

end_step: step at which to end pruning.

In [None]:
batch_size = 32 
epochs = 2
validation_split = 0.1 
number_of_images = len(training_batches) * (1 - validation_split)

print("Number of images: ", number_of_images)

In [None]:
end_step = np.ceil(number_of_images / batch_size).astype(np.int32) * epochs
print("End step: ", end_step)

In [None]:
pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=initial_sparsity,
                                                               final_sparsity=final_sparsity,
                                                               begin_step=begin_step,
                                                               end_step=end_step)
}


Here we will use a magnitude-based method to remove the low saliency parameters of the neural network.

In [None]:
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

Here we define the model for pruning based on the original pre-treined model:

In [None]:
model_for_pruning = prune_low_magnitude(model, **pruning_params)

It's necessary to recompile our model.

In [None]:
# `prune_low_magnitude` requires a recompile.
model_for_pruning.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

Let's take a look at the layers of the original model and model for pruning.

In [None]:
model.summary()

In [None]:
model_for_pruning.summary()

Finaly let's train our model 

In [None]:
logdir = tempfile.mkdtemp()
  
callbacks = [
  tfmot.sparsity.keras.UpdatePruningStep(),
  tfmot.sparsity.keras.PruningSummaries(log_dir=logdir),
]

model_for_pruning.fit(training_batches,
                      batch_size=batch_size, 
                      epochs=epochs, 
                      validation_data=test_batches,
                      callbacks=callbacks)

Pruned model accuracy:

In [None]:
_, original_model_accuracy = model.evaluate(test_batches, verbose=1)
_, pruned_model_accuracy = model_for_pruning.evaluate(test_batches, verbose=1)

In [None]:
print('Original model accuracy:', original_model_accuracy) 
print('Pruned model accuracy:', pruned_model_accuracy)

Let's remove the unecessary variables used during the pruning process. 

In [None]:
model_for_pruning.summary()

In [None]:
model_to_save = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

In [None]:
model_to_save.summary()

Let's convert and save the model

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(model_to_save)
pruned_whole_model = converter.convert()


Creating a folder for the pruned models

In [None]:
import os
if not os.path.exists("pruned_models"):
  os.makedirs("pruned_models")

Saving

In [None]:
with open(f'pruned_models/pruned_whole_model.tflite', 'wb') as f:
      f.write(pruned_whole_model)

## Pruning just some layers

* we can select just some layers to apply it and avoid this big accuracy drop.

Let's apply the pruning just in the Dense layers

In [None]:
def apply_pruning_to_layers(layer):
  if isinstance(layer, tf.keras.layers.Dense):
    return tfmot.sparsity.keras.prune_low_magnitude(layer)
  return layer

Defining the model for pruning 

In [None]:
model_for_pruning = tf.keras.models.clone_model(
    model,
    clone_function=apply_pruning_to_layers,
)

Let's take a look at the layers of the model for pruning:

In [None]:
model.summary()

In [None]:
model_for_pruning.summary()

Training the new model, using the prune low magnitude.

It's necessary to recompile our model:

In [None]:
model_for_pruning.compile(optimizer=tf.keras.optimizers.Adam(lr=0.0001),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Defining some parameters:

In [None]:
batch_size = 32
epochs = 2
validation_split = 0.1 

In [None]:
logdir = tempfile.mkdtemp()

callbacks = [
  tfmot.sparsity.keras.UpdatePruningStep(),
  tfmot.sparsity.keras.PruningSummaries(log_dir=logdir),
]

model_for_pruning.fit(training_batches,
                      batch_size=batch_size, 
                      epochs=epochs, 
                      validation_data=test_batches,
                      callbacks=callbacks
                      )

Evaluating the models:

In [None]:
_, original_model_accuracy = model.evaluate(test_batches, verbose=1)
_, model_for_pruning_accuracy = model_for_pruning.evaluate(test_batches, verbose=1)

In [None]:
print('Original model accuracy:', original_model_accuracy) 
print('Pruned model accuracy:', model_for_pruning_accuracy)

Removing unecessary variables 

In [None]:
model_to_save = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

Let's convert and save the model

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(model_to_save)
pruned_and_quantized_model = converter.convert()


Saving

In [None]:
with open(f'pruned_models/pruned_some_layers_model.tflite', 'wb') as f:
      f.write(pruned_and_quantized_model)

## Create a smaller model combining pruning and quantization

Converting the model with quantization:

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(model_to_save)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
pruned_and_quantized_model = converter.convert()


Saving the pruned and converted models:

In [None]:
with open(f'pruned_models/pruned_and_quantized_model.tflite', 'wb') as f:
      f.write(pruned_and_quantized_model)

# TF Lite models

---
## Original model

> *original_model.tflite*  ➡ *11.858 KB*

---
## Optimized models

> *1. quantized_model_dynamic_range.tflite* ➡ *3.389 KB*

> *2. quantized_model_int8.tflite* ➡ *3.454 KB*

> *3. quantized_model_int8x16.tflite* ➡ *3.662 KB*

> *4. quantized_model_float16.tflite* ➡ *5.971 KB*

> *5. clustered_model.tflite* ➡ *11.858 KB*

> *6. clustered_and_quantized_model.tflite* ➡ *3.389 KB*

> *7. pruned_whole_model.tflite* ➡ *11.858 KB*

> *8. pruned_some_layers_model.tflite* ➡ *11.858 KB*

> *9. pruned_and_quantized_model.tflite* ➡ *3.389 KB*

---



## How to evaluate the TF Lite model?

As was said, the optimizations can affect the model accuracy, then, it's always necessary to evaluate the model that was optimized.

You can analyse the accuracy of your model using the test data set.

Reference:

[Evaluating the TF Lite model](https://www.tensorflow.org/lite/performance/post_training_quant#evaluate_the_models)


