# Introduction

In this notebook, we show how to quantize a model using AutoQKeras.

As usual, let's first make sure we are using Python 3.

In [1]:
import sys
print(sys.version)

3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0]


Now, let's load some packages we will need to run AutoQKeras.

In [2]:
import warnings
warnings.filterwarnings("ignore")

import json
import pprint
import numpy as np
import six
import tempfile
import tensorflow.compat.v2 as tf
# V2 Behavior is necessary to use TF2 APIs before TF2 is default TF version internally.
tf.enable_v2_behavior()
from tensorflow.keras.optimizers import *

from qkeras.autoqkeras import *
from qkeras import *
from qkeras.utils import model_quantize
from qkeras.qtools import run_qtools
from qkeras.qtools import settings as qtools_settings

from tensorflow.keras.utils import to_categorical
import tensorflow_datasets as tfds

print("using tensorflow", tf.__version__)

2022-07-11 11:06:26.303619: I tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-07-11 11:06:26.309809: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-11 11:06:26.309826: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


using tensorflow 2.10.0-dev20220407


Let's define `get_data` and `get_model` as you may not have stand alone access to examples directory inside autoqkeras.

In [3]:
import tensorflow as tf
import keras
import numpy as np
import time
import random
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Activation, Flatten, LSTM, GRU, SimpleRNN, Conv2D, MaxPooling2D, Flatten, Dropout, Reshape
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau
from keras.regularizers import l2, l1, l1_l2
from collections import deque

from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error as mse
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from tensorflow import keras
from matplotlib import pyplot as plt
from IPython.display import clear_output

import qkeras
from qkeras import *

import hls4ml
import pickle


name convert optimizers ['fuse_bias_add', 'remove_useless_transpose', 'output_rounding_saturation_mode', 'qkeras_factorize_alpha', 'extract_ternary_threshold', 'fuse_consecutive_batch_normalization'] File: flow.py Line: 23
name optimize optimizers ['eliminate_linear_activation', 'fuse_consecutive_batch_normalization', 'fuse_batch_normalization', 'replace_multidimensional_dense_with_conv'] File: flow.py Line: 23
vivado:merge_batch_norm_quantized_tanh Get_Optimizer, optimizer/optimizer.py ligne: 168
vivado:quantize_dense_output Get_Optimizer, optimizer/optimizer.py ligne: 168
vivado:batchnormalizationquantizedtanh_config_template Get_Optimizer, optimizer/optimizer.py ligne: 168
vivado:batchnormalizationquantizedtanh_function_template Get_Optimizer, optimizer/optimizer.py ligne: 168
vivado:clone_output Get_Optimizer, optimizer/optimizer.py ligne: 168
vivado:clone_function_template Get_Optimizer, optimizer/optimizer.py ligne: 168
vivado:optimize_pointwise_conv Get_Optimizer, optimizer/opti

In [4]:
from tensorflow.keras.initializers import *
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import *
boosted_model =  tf.keras.models.load_model('../pb_file')

boosted_model.summary()




2022-07-11 11:06:32.538147: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-07-11 11:06:32.538199: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-07-11 11:06:32.538237: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (martop): /proc/driver/nvidia/version does not exist
2022-07-11 11:06:32.538831: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 8)                 80        
                                                                 
 dense (Dense)               (None, 1)                 9         
                                                                 
Total params: 89
Trainable params: 89
Non-trainable params: 0
_________________________________________________________________


`AutoQKeras` has some examples on how to run with `mnist`, `fashion_mnist`, `cifar10` and `cifar100`.

In [5]:
from nnlar.datashaper import DataShaper
ds = DataShaper.from_h5("../data/rdgap_mu140.h5")

x_train, x_val, x_test, y_train, y_val, y_test = ds()

shapes (1999995, 5, 1) (1999995, 1)
shapes (899992, 5, 1) (99995, 5, 1) (999998, 5, 1)


Before we create the model, let's see if we can perform distributed training.

In [6]:
physical_devices = tf.config.list_physical_devices()
for d in physical_devices:
  print(d)

PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')


In [7]:
has_tpus = np.any([d.device_type == "TPU" for d in physical_devices])

if has_tpus:
  TPU_WORKER = 'local'

  resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
      tpu=TPU_WORKER, job_name='tpu_worker')
  if TPU_WORKER != 'local':
    tf.config.experimental_connect_to_cluster(resolver, protocol='grpc+loas')
  tf.tpu.experimental.initialize_tpu_system(resolver)
  strategy = tf.distribute.experimental.TPUStrategy(resolver)
  print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

  cur_strategy = strategy
else:
  cur_strategy = tf.distribute.get_strategy()

Now we can create the model with the distributed strategy in place if TPUs are available. We have some test models that we can use, or you can build your own models. 

In [8]:
def normal_model (units_parameter):

    
    r_model = Sequential()
    r_model.add(SimpleRNN(units_parameter, activation='relu', input_shape=(5, 1), return_sequences=False, name='SimpleRNN'))
    r_model.add(Dense(1, activation='relu',name='dense'))
    r_model.summary()
    return r_model

refmodels_path="/atlas/bonnet/Desktop/code/internship_CPPM/autoqkeras_test/tests/models/optimized_model.h5"
   

In [9]:
with cur_strategy.scope():
  model =  tf.keras.models.load_model(refmodels_path)
  custom_objects = {}

Let's see the accuracy on a unquantized model.

In [10]:
patience_es=7
patience_rlr=5
delta = 0.00000001
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', 
                                                    patience=patience_es, 
                                                    restore_best_weights=True, 
                                                    min_delta=delta,
                                                    mode='min')

reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                                patience= patience_rlr, min_lr=0.000001, min_delta=delta, verbose=1)

In [11]:
"""
with cur_strategy.scope():
    optimizer = Adam(0.01)
    model.compile(loss="mean_squared_error", optimizer=optimizer)
    model.set_weights(model.get_weights())
    hist = model.fit(x_train,y_train,validation_data=(x_val,y_val), epochs=1, batch_size=64, shuffle=True, callbacks=[early_stopping, reduce_lr])
    lr_change = []
    for i in range (len(hist.history['lr'])-1):
    
        if (hist.history['lr'][i]==hist.history['lr'][i+1]):
            lr_change.append(None)
        else: 
            lr_change.append(hist.history['val_loss'][i+1])
    plt.plot(lr_change, 'X')
    plt.plot(hist.history['loss'])
    plt.plot(hist.history['val_loss'])
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['lr_changed','train', 'test'])
    plt.show()

"""


'\nwith cur_strategy.scope():\n    optimizer = Adam(0.01)\n    model.compile(loss="mean_squared_error", optimizer=optimizer)\n    model.set_weights(model.get_weights())\n    hist = model.fit(x_train,y_train,validation_data=(x_val,y_val), epochs=1, batch_size=64, shuffle=True, callbacks=[early_stopping, reduce_lr])\n    lr_change = []\n    for i in range (len(hist.history[\'lr\'])-1):\n    \n        if (hist.history[\'lr\'][i]==hist.history[\'lr\'][i+1]):\n            lr_change.append(None)\n        else: \n            lr_change.append(hist.history[\'val_loss\'][i+1])\n    plt.plot(lr_change, \'X\')\n    plt.plot(hist.history[\'loss\'])\n    plt.plot(hist.history[\'val_loss\'])\n    plt.ylabel(\'loss\')\n    plt.xlabel(\'epoch\')\n    plt.legend([\'lr_changed\',\'train\', \'test\'])\n    plt.show()\n\n'

For `mnist`, we should get 99% validation accuracy, and for `fashion_mnist`, we should get around 86% of validation accuracy. Let's get a metric for high-level estimation of energy of this model. 



In [12]:
  reference_internal = "fp32"
  reference_accumulator = "fp32"

  q = run_qtools.QTools(
      model,
      # energy calculation using a given process
      # "horowitz" refers to 45nm process published at
      # M. Horowitz, "1.1 Computing's energy problem (and what we can do about
      # it), "2014 IEEE International Solid-State Circuits Conference Digest of
      # Technical Papers (ISSCC), San Francisco, CA, 2014, pp. 10-14, 
      # doi: 10.1109/ISSCC.2014.6757323.
      process="horowitz",
      # quantizers for model input
      source_quantizers=[quantized_bits(8, 0, 1)],
      is_inference=False,
      # absolute path (including filename) of the model weights
      # in the future, we will attempt to optimize the power model
      # by using weight information, although it can be used to further
      # optimize QBatchNormalization.
      weights_path=None,
      # keras_quantizer to quantize weight/bias in un-quantized keras layers
      keras_quantizer=reference_internal,
      # keras_quantizer to quantize MAC in un-quantized keras layers
      keras_accumulator=reference_accumulator,
      # whether calculate baseline energy
      for_reference=True)
  
# caculate energy of the derived data type map.
energy_dict = q.pe(
    # whether to store parameters in dram, sram, or fixed
    weights_on_memory="sram",
    # store activations in dram or sram
    activations_on_memory="sram",
    # minimum sram size in number of bits. Let's assume a 16MB SRAM.
    min_sram_size=8*16*1024*1024,
    # whether load data from dram to sram (consider sram as a cache
    # for dram. If false, we will assume data will be already in SRAM
    rd_wr_on_io=False)

# get stats of energy distribution in each layer
energy_profile = q.extract_energy_profile(
    qtools_settings.cfg.include_energy, energy_dict)
# extract sum of energy of each layer according to the rule specified in
# qtools_settings.cfg.include_energy
total_energy = q.extract_energy_sum(
    qtools_settings.cfg.include_energy, energy_dict)

pprint.pprint(energy_profile)
print()
print("Total energy: {:.2f} uJ".format(total_energy / 1000000.0))

{'SimpleRNN': {'energy': {'inputs': 3.8,
                          'op_cost': 0.0,
                          'outputs': 3.8,
                          'parameters': 0.0},
               'total': 3.8},
 'dense': {'energy': {'inputs': 3.8,
                      'op_cost': 36.8,
                      'outputs': 3.8,
                      'parameters': 19.02},
           'total': 59.62}}

Total energy: 0.00 uJ


During the computation, we had a dictionary that outlines the energy per layer (`energy_profile`), and total energy (`total_energy`). The reader should remember that `energy_profile` may need additional filtering as implementations will fuse some
layers. When we compute the `total_energy`, we consider an approximation that some layers will be fused to compute the final energy number. For example, a convolution layer followed by an activation layer will be fused into a single layer so that the output of the convolution layer is not used.

You have to remember that our high-level model for energy has several assumptions:

The energy of a layer is estimated as `energy(layer) = energy(input) + energy(parameters) + energy(MAC) + energy(output)`.

1) Reading inputs, parameters and outputs consider only _compulsory_ accesses, i.e. first access to the data, which is independent of the hardware architecture. If you remember _The 3 C's of Caches_ (https://courses.cs.washington.edu/courses/cse410/99au/lectures/Lecture-10-18/tsld035.htm) other types of accesses will depend on the accelerator architecture.

2) For the multiply-and-add (MAC) energy estimation, we only consider the energy to compute the MAC, but not any other type energy. For example, in a real accelerator, you have registers, glue logic, pipeline logic that will affect the overall energy profile of the device.

Although this model is simple and provides an initial estimate on what to expect, it has high-variance with respect to actual energy numbers you will find in practice, especially with respect to different architectural implementations.

We assume that the real energy `Energy(layer)` is a linear combination of the high-level energy model, i.e.`Energy(layer) = k1 * energy(layer) + k2`, where `k1` and `k2` are constants that depend on the architecture of the accelerator. One can think of `k1` as the factor that accounts for the additional storage to keep the model running, and `k2` as the additional always on logic that is required to perform the operations. If we compare the energy of two implementations with different quantizations of the same layer, let's say `layer1` and `layer2`, `Energy(layer1) > Energy(layer2)` holds true iff `energy(layer1) > energy(layer2)` for the same architecture, but for different architectures, this will not be true in general.

Despite its limitations to predict a single energy number, this model is quite good to compare the energy of two different models, or different types of quantizations, when we restrict it to a single architecture, and that's how we use it here.

# Quantizing a Model With `AutoQKeras`

To quantize this model with `AutoQKeras`, we need to define the quantization for kernels, biases and activations; forgiving factors and quantization strategy.

Below we define which quantizers are allowed for kernel, bias, activations and linear. Linear is a proxy that we use to capture `Activation("linear")` to apply quantization without applying a non-linear operation.  In some networks, we found that this trick may be necessary to better represent the quantization space.


In [13]:
    
quantization_config = {
        "kernel": {
                
                "quantized_bits(8,0,1,alpha=1.0)": 8,
                "quantized_bits(10,0,1,alpha=1.0)": 10,
                "quantized_bits(12,0,1,alpha=1.0)": 12,
                "quantized_po2(8,1)": 8
        },
        "recurrent_kernel":{
                "quantized_bits(2,1,1,alpha=1.0)": 2,
                "quantized_bits(4,0,1,alpha=1.0)": 4,
                "quantized_bits(6,0,1,alpha=1.0)": 6,
                "quantized_bits(8,0,1,alpha=1.0)": 8,
                "quantized_bits(10,0,1,alpha=1.0)": 10,
                "quantized_bits(12,0,1,alpha=1.0)": 12},
        "bias": {
                
                "quantized_bits(8,0,1,alpha=1.0)": 8,
                "quantized_bits(10,0,1,alpha=1.0)": 10,
                "quantized_bits(12,0,1,alpha=1.0)": 12,
                "quantized_po2(8,0)": 8
        },
        "activation": {
                "quantized_relu_po2(8,0)": 8,
                
                "quantized_relu(8,0)": 8,
                "quantized_relu(10,0)": 10,
                "quantized_relu(12,0)": 12,
        },
        "linear": {
                 
                "quantized_bits(8,0,1,alpha=1.0)": 8,
                "quantized_bits(10,0,1,alpha=1.0)": 10,
                "quantized_bits(12,0,1,alpha=1.0)": 12
        }
}

Now let's define how to apply quantization. In the simplest form, we specify how many bits for kernels, biases and activations by layer types. Note that the entry `BatchNormalization` needs to be specified here, as we only quantize layer types specified by these patterns.  For example, a `Flatten` layer is not quantized as it does not change the data type of its inputs.

In [14]:
limit = {
    "SimpleRNN": [12, 12, 12, 12],
    "Dense": [16, 16, 16],
    "BatchNormalization": [],
    "Activation": [12]
}

Here, we are specifying that we want to use at most 4 bits for weights and activations, and at most 8 bits for biases in convolutional and depthwise convolutions, but we allow up to 8 bits for kernels in dense layers.

Let's define now the forgiving factor. We will consider energy minimization as a goal as follows.  Here, we are saying that we allow 8% reduction in accuracy for a 2x reduction in energy, both reference and trials have parameters and activations on SRAM, both reference model and quantization trials do not read/write from DRAM on I/O operations, and we should consider both experiments to use SRAMs with minimum tensor sizes (commonly called distributed SRAM implementation).

We also need to specify the quantizers for the inputs. In this case, we want to use `int8` as source quantizers. Other possible types are `int16`, `int32`, `fp16` or `fp32`, besides `QKeras` quantizer types.

Finally, to be fair, we want to compare our quantization against fixed-point 8-bit inputs, outputs, activations, weights and biases, and 32-bit accumulators.

Remember that a `forgiving factor` forgives a drop in a metric such as `accuracy` if the gains of the model are much bigger than the drop. For example, it corresponds to the sentence *we allow $\tt{delta}\%$ reduction in accuracy if the quantized model has $\tt{rate} \times$ smaller energy than the original model*, being a multiplicative factor to the metric. It is computed by $1 + \tt{delta} \times  \log_{\tt{rate}}(\tt{stress} \times \tt{reference\_cost} / \tt{trial\_cost})$.

In [15]:
goal = {
    "type": "energy",
    "params": {
        "delta_p": 8.0,
        "delta_n": 8.0,
        "rate": 2.0,
        "stress": 1.0,
        "process": "horowitz",
        "parameters_on_memory": ["sram", "sram"],
        "activations_on_memory": ["sram", "sram"],
        "rd_wr_on_io": [False, False],
        "min_sram_size": [0, 0],
        "source_quantizers": ["int8"],
        "reference_internal": "int8",
        "reference_accumulator": "int32"
        }
}

There are a few more things we need to define. Let's bundle them on a dictionary and pass them to `AutoQKeras`.  We will try a maximum of 10 trials (`max_trials`) just to limit the time we will spend finding the best quantization here.  Please note that this parameter is not valid if you are running in `hyperband` mode.

`output_dir` is the directory where we will store our results. Since we are running on a colab, we will let `tempfile` chooce a directory for us.

`learning_rate_optimizer` allows `AutoQKeras` to change the optimization function and the `learning_rate` to try to improve the quantization results. Since it is still experimental, it may be the case that in some cases it will get worse results. 

Because we are tuning filters as well, we should set `transfer_weights` to `False` as the trainable parameters will have different shapes.

In `AutoQKeras` we have three modes of operation: `random`, `bayesian` and `hyperband`. I recommend the user to refer to `KerasTuner` (https://keras-team.github.io/keras-tuner/) for a complete description of them.

`tune_filters` can be set to `layer`, `block` or `none`. If `tune_filters` is `block`, we change the filters by the same amount for all layers being quantized in the trial. If `tune_filters` is `layer`, we will possibly change the number of filters for each layer independently. Finally, if `tune_filters` is `none`, we will not perform filter tuning.

Together with `tune_filters`, `tune_filter_exceptions` allows the user to specify by a regular expression which filters we should not perform filter tuning, which is especially good for the last layers of the network.

Filter tuning is a very important feature of `AutoQKeras`. When we deep quantize a model, we may need less or more filters for each layer (and you can guess we do not know a priori how many filters we will need for each layer). Let me give you a rationale behind this.

- **less filters**: let us assume we have two set of filter coefficients we want quantize: $[-0.3, 0.2, 0.5, 0.15]$ and $[-0.5, 0.4, 0.1, 0.65]$. If we apply a $\tt{binary}$ quantizer with $\tt{scale} = \big\lceil \log_2(\frac{\sum |w|}{N}) \big\rceil$, where $w$ are the filter coefficients and $N$ is the number of coefficients, we will end up with the same filter $\tt{binary}([-0.3, 0.2, 0.5, 0.15]) = \tt{binary}([-0.5, 0.4, 0.1, 0.65]) = [-1,1,1,1] \times 0.5$. In this case we are assuming the $\tt{scale}$ is a power-of-2 number so that it can be efficiently implemented by a shift operation;

- **more filters**: it is clear that quantization will drop information (just look at the example above) and deep quantization will drop more information, so to recover some of the boundary regions in layers that perform feature extraction, we may need to add more filters to the layer when we quantize it.

We do not want to quantize the `softmax` layer, which is the last layer of the network. In `AutoQKeras`, you can specify the indexes that you want to perform quantization by specifying the corresponding index of the layer in `Keras`, i.e. if you can get the layer as `model.layers[i]` in `Keras`, `i` is the index of the layer.

Finally, for data parallel distributed training, we should pass the strategy in `distribution_strategy` to `KerasTuner`.

In [16]:
run_config = {
  "output_dir": tempfile.mkdtemp(),
  "goal": goal,
  "quantization_config": quantization_config,
  "learning_rate_optimizer": False,
  "transfer_weights": False,
  "mode": "random",
  "seed": 42,
  "limit": limit,
  "tune_filters": "layer",
  "tune_filters_exceptions": "^dense",
  "distribution_strategy": cur_strategy,
  # first layer is input, layer two layers are softmax and flatten
  "layer_indexes": range(1, len(model.layers) - 1),
  "max_trials": 20

}

print("quantizing layers:", [model.layers[i].name for i in run_config["layer_indexes"]])

quantizing layers: []


In [17]:
autoqk = AutoQKeras(model, metrics=["mse"], custom_objects=custom_objects, **run_config)
autoqk.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=64, epochs=100, callbacks=[early_stopping])

Limit configuration:{"SimpleRNN": [12, 12, 12, 12], "Dense": [16, 16, 16], "BatchNormalization": [], "Activation": [12]}
name SimpleRNN_kernel_quantizer
name dense_kernel_quantizer
abovetrial_size
target 4
self.trial_size 4
learning_rate: 7.812500371073838e-06
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 SimpleRNN (SimpleRNN)       (None, 8)                 80        
                                                                 
 dense (Dense)               (None, 1)                 9         
                                                                 
Total params: 89
Trainable params: 89
Non-trainable params: 0
_________________________________________________________________
self.trial_size 4
stats: delta_p=0.08 delta_n=0.08 rate=2.0 trial_size=4 reference_size=4
       delta=0.00%
Total Cost Reduction:
       4 vs 4 (0.00%)

Search space summary
Default search spa

In [None]:
import keras_tuner
keras_tuner.__version__


'1.0.3'

Now, let's see which model is the best model we got.


In [None]:
qmodel = autoqk.get_best_model()
qmodel.save_weights("qmodel.h5")

name SimpleRNN_kernel_quantizer
name dense_kernel_quantizer
abovetrial_size
target 4
self.trial_size 4
learning_rate: 7.812500371073838e-06
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 SimpleRNN (SimpleRNN)       (None, 8)                 80        
                                                                 
 dense (Dense)               (None, 1)                 9         
                                                                 
Total params: 89
Trainable params: 89
Non-trainable params: 0
_________________________________________________________________
self.trial_size 4
stats: delta_p=0.08 delta_n=0.08 rate=2.0 trial_size=4 reference_size=4
       delta=0.00%
Total Cost Reduction:
       4 vs 4 (0.00%)



We got here >90% reduction in energy when compared to 8-bit tensors and 32-bit accumulators. Remember that our original number was 3.3 uJ for fp32.  The end model has 11 nJ for the quantized model as opposed to 204 nJ for the 8-bit original quantized model. As these energy numbers are from high-level energy models, you should remember to consider the relations between them, and not the actual numbers.

Let's train this model to see how much accuracy we can get of it.

In [None]:
qmodel.load_weights("qmodel.h5")
with cur_strategy.scope():
  optimizer = Adam(lr=0.001)
  qmodel.compile(optimizer=optimizer, loss="mse", metrics=["mse"])
  qmodel.fit(x_train, y_train, epochs=100, batch_size=64, validation_data=(x_val, y_val),callbacks=[early_stopping])

Epoch 1/100

KeyboardInterrupt: 

One of problems of trying to quantize the whole thing in one shot is that we may end up with too many choices to make, which will make the entire search space very high. In order to reduce the search space, `AutoQKeras` has two methods to enable users to cope with the explosion of choices.

## Grouping Layers to Use the Same Choice

In this case, we can provide regular expressions to `limit` to specify layer names that should be grouped together. In our example, suppose we want to group  convolution layers (except the first one) and all activations except the last one to use the same quantization.

For the first convolution layer, we want to limit the quantization types to fewer choices as the input is already an 8-bit number.  The last activation will be fed to a feature classifier layer, so we may leave it with more bits. Because our `dense` is actually a `Conv2D` operation, we will enable 8-bits for the weights by layer name. 

We first need to look at the names of the layers for this. 

In [None]:
pprint.pprint([layer.name for layer in model.layers])

['SimpleRNN', 'dense']


Convolution layers for `mnist` have names specified as `conv2d_[01234]`. Activation layers have names specified as `act_[01234]`. So, we can create the following regular expressions to reduce the search space in our model.

Please note that layer class names always select different quantizers, so the user needs to specify a pattern for layer names if he/she wants to use the same quantization for the group of layers.

You can see here another feature of the limit. You can specify the maximum number of bits, or cherry pick which quantizers you want to try for a specific layer if instead of the maximum number of bits you specify a list of quantizers fron `quantization_config`.

In [None]:
limit = {
    "Dense": [8, 8, 4],
    "Conv2D": [4, 8, 4],
    "DepthwiseConv2D": [4, 8, 4],
    "Activation": [4],
    "BatchNormalization": [],

    "^conv2d_0$": [
                   ["binary", "ternary", "quantized_bits(2,1,1,alpha=1.0)"],
                   8, 4
    ],
    "^conv2d_[1234]$": [4, 8, 4],
    "^act_[0123]$": [4],
    "^act_4$": [8],
    "^dense$": [8, 8, 4]
}

In [None]:
run_config = {
  "output_dir": tempfile.mkdtemp(),
  "goal": goal,
  "quantization_config": quantization_config,
  "learning_rate_optimizer": False,
  "transfer_weights": False,
  "mode": "random",
  "seed": 42,
  "limit": limit,
  "tune_filters": "layer",
  "tune_filters_exceptions": "^dense",
  "distribution_strategy": cur_strategy,
  "layer_indexes": range(1, len(model.layers) - 1),
  "max_trials": 40
}

In [None]:
autoqk = AutoQKeras(model, metrics=["acc"], custom_objects=custom_objects, **run_config)
autoqk.fit(x_train, y_train, epochs=100, batch_size=64, validation_data=(x_val, y_val),callbacks=[early_stopping])

Trial 2 Complete [00h 01m 06s]
val_score: 0.17637935280799866

Best val_score So Far: 0.17637935280799866
Total elapsed time: 00h 02m 12s
INFO:tensorflow:Oracle triggered exit


INFO:tensorflow:Oracle triggered exit


Let's see the reduction now.

In [None]:
qmodel = autoqk.get_best_model()
qmodel.save_weights("qmodel.h5")

name ^dense$_kernel_quantizer
abovetrial_size
target 4
self.trial_size 4
learning_rate: 7.812500371073838e-06
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 SimpleRNN (SimpleRNN)       (None, 8)                 80        
                                                                 
 dense (Dense)               (None, 1)                 9         
                                                                 
Total params: 89
Trainable params: 89
Non-trainable params: 0
_________________________________________________________________
self.trial_size 4
stats: delta_p=0.08 delta_n=0.08 rate=2.0 trial_size=4 reference_size=4
       delta=0.00%
Total Cost Reduction:
       4 vs 4 (0.00%)



Let's train this model for more time to see how much we can get in accuracy.

In [None]:
qmodel.load_weights("qmodel.h5")
with cur_strategy.scope():
  optimizer = Adam(lr=0.02)
  qmodel.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["acc"])
  qmodel.fit(x_train, y_train, epochs=100, batch_size=64, validation_data=(x_val, y_val),callbacks=[early_stopping])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

## Quantization by Blocks

In the previous section, we enforced that all decisions were the same in order to reduce the number of options to quantize a model. 

Another approach is still to allow models to have each block of layers to makde their own choice, but quantizing the blocks sequentially, either from inputs to outputs, or by quantizing higher energy blocks first.

The rationale for this method is that if we quantize the blocks one by one, and assuming that each block has $N$ choices, and $B$ blocks, we end up trying $N B$ options, instead of $N^B$ choices.  The reader should note that this is an approximation as there is no guarantee that we will obtain the best quantization possible.

Should you do sequential from inputs to outputs or starting from the block that has the highest impact?

If you have a network like ResNet, and if you want to do filter tuning, you need to block the layers by the resnet definition of a block, i.e. including full identity or convolutional blocks, and quantize the model from inputs to outputs, so that you can preserve at each stage the number of channels for the residual block. 

In order to perform quantization by blocks, you need to specify two other parameters in our `run_config`. `blocks` is a list of regular expressions of the groups you want to quantize. If a layer does not match the block pattern, it will not be quantized.  `schedule_block` specifies the mode for block quantization scheduling. It can be `sequential` or `cost` if you want to schedule first the blocks by decreasing cost size (energy or bits).

In this model, there are a few optimizations that we perform automatically. First, we dynamically reduce the learning rate of the blocks that we have already quantized as setting them to not-trainable does not seem to work, so we still allow them to train, but at a slower pace. In addition, we try to dynamically adjust the learning rate for the layer we are trying to quantize as opposed to the learning rate of the unquantized layers. Finally, we transfer the weights of the models we have already quantized whenever we can do (if the shapes remain the same). 

Regardless on how we schedule the operations, we amortize the nubmer of trials for the cost of the block (energy or bits with respect to the total energy or number of bits of the network).

Instead of invoking `AutoQKeras` now, we will invoke `AutoQKeras` scheduler.

In [None]:
run_config = {
  "output_dir": tempfile.mkdtemp(),
  "goal": goal,
  "quantization_config": quantization_config,
  "learning_rate_optimizer": False,
  "transfer_weights": False,
  "mode": "random",
  "seed": 42,
  "limit": limit,
  "tune_filters": "layer",
  "tune_filters_exceptions": "^dense",
  "distribution_strategy": cur_strategy,
  "layer_indexes": range(1, len(model.layers) - 1),
  "max_trials": 40,

  "blocks": [
    "^.*_0$",
    "^.*_1$",
    "^.*_2$",
    "^.*_3$",
    "^.*_4$",
    "^dense"
  ],
  "schedule_block": "cost"
}

Because specifying regular expressions is error prone, we recommend that you first try to run `AutoQKerasScheduler` in debug mode to print the blocks.

In [None]:
pprint.pprint([layer.name for layer in model.layers])
autoqk = AutoQKerasScheduler(model, metrics=["acc"], custom_objects=custom_objects, debug=True, **run_config)
autoqk.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=1024, epochs=20)

['SimpleRNN', 'dense']
... block cost: 0 / 4
... adjusting max_trials for this block to 10


AssertionError: 

All blocks seem to be fine. Let's find the best quantization now.

In [None]:
autoqk = AutoQKerasScheduler(model, metrics=["acc"], custom_objects=custom_objects, **run_config)
autoqk.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=1024, epochs=20)

Trial 10 Complete [00h 00m 50s]
val_score: 1.4008370637893677

Best val_score So Far: 1.416741967201233
Total elapsed time: 00h 07m 44s
INFO:tensorflow:Oracle triggered exit


INFO:tensorflow:Oracle triggered exit


Results summary
Results in /tmp/tmp7td4spyh_1/5
Showing 10 best trials
<keras_tuner.engine.objective.Objective object at 0x7f43efea96d0>
Trial summary
Hyperparameters:
^(dense)$_kernel_quantizer: quantized_bits(8,0,1,alpha=1.0)
^(dense)$_bias_quantizer: quantized_bits(8,3,1)
Score: 1.416741967201233
Trial summary
Hyperparameters:
^(dense)$_kernel_quantizer: quantized_bits(4,0,1,alpha=1.0)
^(dense)$_bias_quantizer: quantized_bits(4,0,1)
Score: 1.4149357080459595
Trial summary
Hyperparameters:
^(dense)$_kernel_quantizer: stochastic_binary
^(dense)$_bias_quantizer: quantized_bits(8,3,1)
Score: 1.414434552192688
Trial summary
Hyperparameters:
^(dense)$_kernel_quantizer: binary
^(dense)$_bias_quantizer: quantized_bits(4,0,1)
Score: 1.4135743379592896
Trial summary
Hyperparameters:
^(dense)$_kernel_quantizer: stochastic_binary
^(dense)$_bias_quantizer: quantized_bits(4,0,1)
Score: 1.4087834358215332
Trial summary
Hyperparameters:
^(dense)$_kernel_quantizer: ternary
^(dense)$_bias_quantizer: 



INFO:tensorflow:Assets written to: /tmp/tmp7td4spyh_1/model_block_5/assets


INFO:tensorflow:Assets written to: /tmp/tmp7td4spyh_1/model_block_5/assets


In [None]:
qmodel = autoqk.get_best_model()
qmodel.save_weights("qmodel.h5")

stats: delta_p=0.08 delta_n=0.08 rate=2.0 trial_size=11506 reference_size=574175
       delta=45.13%
Total Cost Reduction:
       11506 vs 574175 (-98.00%)
conv2d_0             f=8 binary(alpha='auto_po2') 
bn_0                 QBN, mean=[0. 0. 0. 0. 0. 0. 0. 0.]
act_0                quantized_relu(4,2)
conv2d_1             f=16 binary(alpha='auto_po2') 
bn_1                 QBN, mean=[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
act_1                quantized_relu(3,1)
conv2d_2             f=24 quantized_bits(4,0,1,alpha=1.0) 
bn_2                 QBN, mean=[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
act_2                quantized_relu_po2(4,4)
conv2d_3             f=32 quantized_bits(4,0,1,alpha=1.0) 
bn_3                 QBN, mean=[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
act_3                quantized_relu(3,1)
conv2d_4             f=64 ternary(alpha='auto_po2') 
bn_4                 QBN, mea

In [None]:
qmodel.load_weights("qmodel.h5")
with cur_strategy.scope():
  optimizer = Adam(lr=0.02)
  qmodel.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["acc"])
  qmodel.fit(x_train, y_train, epochs=200, batch_size=4096, validation_data=(x_test, y_test))

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

Perfect! You have learned how to perform automatic quantization using AutoQKeras with QKeras.