# Part 4: Quantization

## Load Cifar10 dataset from tf.keras

In [None]:
import tensorflow as tf
import numpy as np
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
#x_train = np.expand_dims(x_train, -1)
#x_test = np.expand_dims(x_test, -1)
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

### Let's print some information about the dataset
Print the the dataset shape

In [None]:
print(x_train.shape, x_test.shape,y_train.shape, y_test.shape)

## Construct a model
This time we're going to use QKeras layers.
QKeras is "Quantized Keras" for deep heterogeneous quantization of ML models.

https://github.com/google/qkeras

It is maintained by Google and recently support for QKeras model is added to hls4ml.

In [None]:
from qkeras.qlayers import QDense, QActivation
from qkeras.quantizers import quantized_bits, quantized_relu
from qkeras.qconvolutional import QConv2D
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dropout,Flatten,InputLayer, MaxPooling2D, Activation

model = Sequential()
input_shape = (32, 32, 3)
model.add(InputLayer(input_shape=input_shape))
model.add(QConv2D(16, kernel_size=(3, 3),kernel_quantizer=quantized_bits(6,0,alpha=1),  bias_quantizer=quantized_bits(6,0,alpha=1)))
model.add(QActivation(activation=quantized_relu(6)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(QConv2D(16, kernel_size=(3, 3),kernel_quantizer=quantized_bits(6,0,alpha=1),  bias_quantizer=quantized_bits(6,0,alpha=1)))
model.add(QActivation(activation=quantized_relu(6)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(QConv2D(16, kernel_size=(3, 3),kernel_quantizer=quantized_bits(6,0,alpha=1),  bias_quantizer=quantized_bits(6,0,alpha=1)))
model.add(QActivation(activation=quantized_relu(6)))
model.add(Flatten())
model.add(QDense(10,kernel_quantizer=quantized_bits(6,0,alpha=1),  bias_quantizer=quantized_bits(6,0,alpha=1)))
model.add(Activation(activation='softmax'))

model.build()
model.summary()


## Train sparse
Let's train with model sparsity again, since QKeras layers are prunable.

In [None]:
from tensorflow_model_optimization.python.core.sparsity.keras import prune, pruning_callbacks, pruning_schedule
from tensorflow_model_optimization.sparsity.keras import strip_pruning
pruning_params = {"pruning_schedule" : pruning_schedule.ConstantSparsity(0.8, begin_step=0, frequency=100)}
model = prune.prune_low_magnitude(model, **pruning_params)


## Train the model
We'll use the same settings as the model for part 1: Adam optimizer with categorical crossentropy loss.
The callbacks will decay the learning rate and save the model into a directory 'model_mnist_cnn4'
The model isn't very complex, so this should just take a few minutes even on the CPU.
If you've restarted the notebook kernel after training once, set `train = False` to load the trained model rather than training again.

In [None]:
from tensorflow.keras.optimizers import Adam
from callbacks import all_callbacks

train =True


if train:
    adam = Adam(lr=0.0001)
    model.compile(optimizer=adam, loss=['categorical_crossentropy'], metrics=['accuracy'])
    callbacks = all_callbacks(stop_patience = 1000,
                              lr_factor = 0.5,
                              lr_patience = 10,
                              lr_epsilon = 0.000001,
                              lr_cooldown = 2,
                              lr_minimum = 0.0000001,
                              outputDir = 'model_cifar10_cnn4')
    callbacks.callbacks.append(pruning_callbacks.UpdatePruningStep())
    model.fit(x_train, y_train, batch_size=128,
              epochs=100, validation_split=0.2, shuffle=True,
              callbacks = callbacks.callbacks)
    model = strip_pruning(model)
    model.save('model_cifar10_cnn4/KERAS_check_best_model.h5')
else:
    from qkeras.utils import load_qmodel
    model = load_qmodel('model_cifar10_cnn4/KERAS_check_best_model.h5')

## Check performance
How does this model which was trained using 6-bits, and 75% sparsity model compare against the original model? Let's report the accuracy and make a ROC curve. The quantized, pruned model is shown with solid lines, the unpruned model from part 1 is shown with dashed lines.


We should also check that hls4ml can respect the choice to use 6-bits throughout the model, and match the accuracy. We'll generate a configuration from this Quantized model, and plot its performance as the dotted line.
The generated configuration is printed out. You'll notice that it uses 7 bits for the type, but we specified 6!? That's just because QKeras doesn't count the sign-bit when we specify the number of bits, so the type that actually gets used needs 1 more.

We also use the `OutputRoundingSaturationMode` optimizer pass of `hls4ml` to set the Activation layers to round, rather than truncate, the cast. This is important for getting good model accuracy when using small bit precision activations. And we'll set a different data type for the tables used in the Softmax, just for a bit of extra performance.


**Make sure you've trained the model from part 1**

In [None]:
import hls4ml
from hls4ml.converters.keras_to_hls import keras_to_hls
import plotting
import yaml

hls4ml.model.optimizer.OutputRoundingSaturationMode.layers = ['Activation']
hls4ml.model.optimizer.OutputRoundingSaturationMode.rounding_mode = 'AP_RND'
hls4ml.model.optimizer.OutputRoundingSaturationMode.saturation_mode = 'AP_SAT'



config = hls4ml.utils.config_from_keras_model(model, granularity='name')
config['Backend']='VivadoAccelerator'
config['OutputDir'] = 'cifar10-hls-test4'
config['ProjectName'] = 'myproject_cifar10_cnn4'
config['XilinxPart']= 'xczu7ev-ffvc1156-2-e'
config['Board'] = 'zcu104'
config['ClockPeriod'] = 5
config['IOType'] = 'io_stream'
config['HLSConfig']={}
config['HLSConfig']['Model']={}
config['HLSConfig']['Model']=config['Model']
config['HLSConfig']['LayerName']=config['LayerName']

del config['Model']
del config['LayerName']
config['AcceleratorConfig']={}
config['AcceleratorConfig']['Interface'] = 'axi_stream'
config['AcceleratorConfig']['Driver'] = 'python'
config['AcceleratorConfig']['Precision']={}
config['AcceleratorConfig']['Precision']['Input']= 'float'
config['AcceleratorConfig']['Precision']['Output']= 'float'
config['KerasModel'] = model
config['HLSConfig']['LayerName']['q_conv2d_29']['ReuseFactor'] = 8
config['HLSConfig']['LayerName']['q_conv2d_30']['ReuseFactor'] = 8
config['HLSConfig']['LayerName']['q_conv2d_31']['ReuseFactor'] = 8
config['HLSConfig']['LayerName']['q_dense_11']['ReuseFactor'] = 8

print("-----------------------------------")
print("Configuration")
plotting.print_dict(config)
print("-----------------------------------")
hls_model = keras_to_hls(config)
hls_model.compile()
y_qkeras = model.predict(x_test)
x_test = np.ascontiguousarray(x_test)
y_hls = hls_model.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import load_model

model_ref = load_model('model_cifar10_cnn/KERAS_check_best_model.h5')
y_ref = model_ref.predict(x_test)

print("Accuracy baseline:  {}".format(accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_ref, axis=1))))
print("Accuracy pruned, quantized: {}".format(accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_qkeras, axis=1))))
print("Accuracy hls4ml: {}".format(accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_hls, axis=1))))


In [None]:
import matplotlib.pyplot as plt
cifar10_classes=['0','1','2','3','4','5','6','7','8','9']

fig, ax = plt.subplots(figsize=(9, 9))
_ = plotting.makeRoc(y_test, y_ref, cifar10_classes)
plt.gca().set_prop_cycle(None) # reset the colors
_ = plotting.makeRoc(y_test, y_qkeras, cifar10_classes, linestyle='--')
plt.gca().set_prop_cycle(None) # reset the colors
_ = plotting.makeRoc(y_test, y_hls,cifar10_classes, linestyle=':')

from matplotlib.lines import Line2D
lines = [Line2D([0], [0], ls='-'),
         Line2D([0], [0], ls='--'),
         Line2D([0], [0], ls=':')]
from matplotlib.legend import Legend
leg = Legend(ax, lines, labels=['baseline', 'pruned, quantized', 'hls4ml'],
            loc='lower left', frameon=False)
ax.add_artist(leg)

# Synthesize
Now let's synthesize this quantized, pruned model.

**The synthesis will take a while**

While the C-Synthesis is running, we can monitor the progress looking at the log file by opening a terminal from the notebook home, and executing:

`tail -f mnist-hls-test4/vivado_hls.log`

In [None]:
import os
os.environ['PATH'] = '/workspace/home/Xilinx/Vivado/2019.2/bin:' + os.environ['PATH']
hls_model.build(csim=False,synth=True,export=True)

## Check the reports
Print out the reports generated by Vivado HLS. Pay attention to the Utilization Estimates' section in particular this time.

In [None]:
hls4ml.report.read_vivado_report(config['OutputDir'])

Print the report for the model trained in part 1. Now, compared to the model from part 1, this model has been trained with low-precision quantization, and 75% pruning. You should be able to see that we have saved a lot of resource compared to where we started in part 1. At the same time, referring to the ROC curve above, the model performance is pretty much identical even with this drastic compression!

**Note you need to have trained and synthesized the model from part 1**

In [None]:
hls4ml.report.read_vivado_report('cifar10-hls-test')

Print the report for the model trained in part 3. Both these models were trained with 75% sparsity, but the new model uses 6-bit precision as well. You can see how Vivado HLS has moved multiplication operations from DSPs into LUTs, reducing the "critical" resource usage.

**Note you need to have trained and synthesized the model from part 3**

In [None]:
hls4ml.report.read_vivado_report('cifar10-hls-test3')