# Intel® Neural Compressor Sample for TensorFlow*

## Agenda
- Train a convolutional neural network (CNN) model by using Keras
- Quantize the Keras model by using Intel® Neural Compressor
- Compare the quantized model with the original model

## Setup

Import python packages and verify that the correct versions are installed.

Required packages:
- TensorFlow 2.2 and later
- Intel® Neural Compressor 1.2.1 and later
- Matplotlib

**Note**: This code sample supports both the current package name for Intel® Neural Compressor (**neural_compressor**) and the old names (**lpot**, **ilit**).

In [1]:
import tensorflow as tf
print("Tensorflow version {}".format(tf.__version__))
tf.compat.v1.enable_eager_execution()

try:
    import neural_compressor as inc
    print("neural_compressor version {}".format(inc.__version__))  
except:
    try:
        import lpot as inc
        print("LPOT version {}".format(inc.__version__)) 
    except:
        import ilit as inc
        print("iLiT version {}".format(inc.__version__))       

import matplotlib.pyplot as plt
import numpy as np

from IPython import display

2023-03-19 16:32:13.347566: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Tensorflow version 2.11.0
neural_compressor version 1.14.2


### Environment Setting

For Intel Optimized TensorFlow 2.5.0 and later, you must set the **TF_ENABLE_MKL_NATIVE_FORMAT=0** environment variable before running Intel® Neural Compressor to quantize FP32 model or deploying the quantized model.

In [2]:
%env TF_ENABLE_MKL_NATIVE_FORMAT=0

env: TF_ENABLE_MKL_NATIVE_FORMAT=0


## Train a CNN Model Based on Keras

We prepared the `alexnet.py` script with the functions for training a CNN model.

### Dataset
Use the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset of handwritten digits. 
Load the dataset.

In [3]:
import alexnet
 
data = alexnet.read_data()
x_train, y_train, label_train, x_test, y_test, label_test = data
print('train', x_train.shape, y_train.shape, label_train.shape)
print('test', x_test.shape, y_test.shape, label_test.shape)


Loading data ...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Done
train (60000, 28, 28, 1) (60000, 10) (60000,)
test (10000, 28, 28, 1) (10000, 10) (10000,)


### Build Model

Build a CNN model like Alexnet by using Keras API based on TensorFlow.
Use the Keras `summary()` method to print the model structure.

In [4]:
classes = 10
width = 28
channels = 1

model = alexnet.create_model(width ,channels ,classes)

model.summary()

2023-03-19 16:33:14.105699: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 28, 28, 96)        11712     
                                                                 
 activation (Activation)     (None, 28, 28, 96)        0         
                                                                 
 max_pooling2d (MaxPooling2D  (None, 14, 14, 96)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 14, 14, 256)       614656    
                                                                 
 activation_1 (Activation)   (None, 14, 14, 256)       0         
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 7, 7, 256)        0         
 2D)                                                    

### Train the Model with the Dataset

Set the **epochs** parameter to **3**.

In [5]:
epochs = 1

alexnet.train_mod(model, data, epochs)

Test score: 0.05416478216648102
Test accuracy: 0.9810000061988831


### Freeze and Save Model to Single PB

Set the input node name to **x**.

In [6]:
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2

def save_frezon_pb(model, mod_path):
    # Convert Keras model to ConcreteFunction
    full_model = tf.function(lambda x: model(x))
    concrete_function = full_model.get_concrete_function(
        x=tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))

    # Get frozen ConcreteFunction
    frozen_model = convert_variables_to_constants_v2(concrete_function)

    # Generate frozen pb
    tf.io.write_graph(graph_or_graph_def=frozen_model.graph,
                      logdir=".",
                      name=mod_path,
                      as_text=False)
fp32_frezon_pb_file = "fp32_frezon.pb"
save_frezon_pb(model, fp32_frezon_pb_file)

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
2023-03-19 16:43:32.488053: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2023-03-19 16:43:32.488177: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session


In [7]:
%ls -la fp32_frezon.pb

-rw------- 1 u181188 u181188 27760621 Mar 19 16:43 fp32_frezon.pb


## Quantize FP32 Model by Using Intel® Neural Compressor

Intel® Neural Compressor can quantize the model with a validation dataset for tuning.
As a result, it returns a frozen quantized INT8 model.

We prepare a python script `inc_quantize_model.py` to call Intel® Neural Compressor to finish the all quantization job.
See the following code sample for explanations.

### Define Dataloader

The class **Dataloader** provides an iter function to return the image and label as batch size.
We uses the validation data of MNIST dataset.

In [8]:
import mnist_dataset
import math


class Dataloader(object):
  def __init__(self, batch_size):
    self.batch_size = batch_size


  def __iter__(self):
    x_train, y_train, label_train, x_test, y_test,label_test = mnist_dataset.read_data()
    batch_nums = math.ceil(len(x_test)/self.batch_size)

    for i in range(batch_nums-1):
        begin = i*self.batch_size
        end = (i+1)*self.batch_size
        yield x_test[begin: end], label_test[begin: end]

    begin = (batch_nums-1)*self.batch_size
    yield x_test[begin:], label_test[begin:]

### Define Load FP32 Model
Load the FP32 model that we saved in the previous step. See the `alexnet.load_pb(in_model)` function.

In [9]:
display.Code('alexnet.py')

### Define Yaml File

We created `alexnet.yaml` to save the necessary parameters for Intel® Neural Compressor.
In this case, we only need to change the input/output according to the FP32 model.

The input node name is **x**.

The output name is **Identity**.

In [10]:
display.Code('alexnet.yaml')

### Define Tuning Function
We follow the template to create the tuning function. The function will return a frozen quantized model (INT8 model).

In [11]:
def auto_tune(input_graph_path, yaml_config, batch_size):    
    fp32_graph = alexnet.load_pb(input_graph_path)
    quan = inc.Quantization(yaml_config)
    dataloader = Dataloader(batch_size)
    assert(dataloader)
    q_model = quan(
                        fp32_graph,
                        q_dataloader=dataloader,
                        eval_func=None,
                        eval_dataloader=dataloader)
    return q_model


def save_int8_frezon_pb(q_model, path):
    from tensorflow.python.platform import gfile
    f = gfile.GFile(path, 'wb')
    f.write(q_model.as_graph_def().SerializeToString())
    print("Save to {}".format(path))
    
yaml_file = "alexnet.yaml"
batch_size = 200
int8_pb_file = "alexnet_int8_model.pb"

### Call Function to Quantize the Model

Show the code in `inc_quantize_model.py`.

In [12]:
display.Code('inc_quantize_model.py')

We will execute `inc_quantize_model.py` to show the whole process of quantizing a model.

In [13]:
!python inc_quantize_model.py

Traceback (most recent call last):
  File "inc_quantize_model.py", line 11, in <module>
    import ilit as inc
ImportError: No module named ilit


The script creates the file `alexnet_int8_model.pb` that contains the quantized model.

## Compare Quantized Model

We prepare the script `profiling_inc.py` to test the performance of the PB model.

If we run the code in the jupyter notebook, we will not get the correct performance data. So we run the script as a process.

Let's take a look at `profiling_inc.py`.

In [14]:
display.Code('profiling_inc.py')

Execute `profiling_inc.py` with the FP32 model file:

In [15]:
!python profiling_inc.py --input-graph=./fp32_frezon.pb --omp-num-threads=4 --num-inter-threads=1 --num-intra-threads=4 --index=32

Traceback (most recent call last):
  File "profiling_inc.py", line 2, in <module>
    import tensorflow as tf
ImportError: No module named tensorflow


Execute `profiling_inc.py` with the INT8 model file:

In [16]:
!python profiling_inc.py --input-graph=./alexnet_int8_model.pb --omp-num-threads=4 --num-inter-threads=1 --num-intra-threads=4 --index=8

Traceback (most recent call last):
  File "profiling_inc.py", line 2, in <module>
    import tensorflow as tf
ImportError: No module named tensorflow


In [17]:
display.Code('32.json')
!echo " "
display.Code('8.json')

 


Execute the functions to load and show the performance data from the `32.json` and `8.json` files.

In [18]:
import json

def autolabel(ax, rects):
    """
    Attach a text label above each bar displaying its height
    """
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%0.2f' % float(height),
        ha='center', va='bottom')

def draw_bar(x, t, y, subplot, color, x_lab, y_lab, width=0.2):
    plt.subplot(subplot)
    plt.xticks(x, t)
    ax1 = plt.gca()
    ax1.set_xlabel(x_lab)
    ax1.set_ylabel(y_lab, color=color)
    rects1 = ax1.bar(x, y, color=color, width=width)
    ax1.tick_params(axis='y', labelcolor=color)
    autolabel(ax1, rects1)

def load_res(json_file):
    with open(json_file) as f:
        data = json.load(f)
        return data

res_32 = load_res('32.json')
res_8 = load_res('8.json')
   
accuracys = [res_32['accuracy'], res_8['accuracy']]
throughputs = [res_32['throughput'], res_8['throughput']]             
latencys = [res_32['latency'], res_8['latency']]

print('throughputs', throughputs)
print('latencys', latencys)
print('accuracys', accuracys)

accuracys_perc = [accu*100 for accu in accuracys]

t = ['FP32', 'INT8']
x = [0, 1]
plt.figure(figsize=(16,6))
draw_bar(x, t, throughputs, 131, 'tab:green', 'Throughput(fps)', '', width=0.2)
draw_bar(x, t,  latencys, 132, 'tab:blue', 'Latency(s)', '', width=0.2)
draw_bar(x, t,  accuracys_perc, 133, '#28a99d', 'Accuracys(%)', '', width=0.2)
plt.show()

FileNotFoundError: [Errno 2] No such file or directory: '32.json'

### FP32 vs INT8

Compare the performance data of the INT8 model with that of the FP32 model.

In [None]:
throughputs_times = [1, throughputs[1]/throughputs[0]]
latencys_times = [1, latencys[1]/latencys[0]]
accuracys_times = [0, accuracys_perc[1] - accuracys_perc[0]]

print('throughputs_times', throughputs_times)
print('latencys_times', latencys_times)
print('accuracys_times', accuracys_times)

plt.figure(figsize=(16,6))
draw_bar(x, t, throughputs_times, 131, 'tab:green', 'Throughput Comparison (big is better)', '', width=0.2)
draw_bar(x, t, latencys_times, 132, 'tab:blue', 'Latency Comparison (small is better)', '', width=0.2)
draw_bar(x, t, accuracys_times, 133, '#28a99d', 'Accuracys Loss(%)', '', width=0.2)
plt.show()

## Conclusion

In [None]:
print("[CODE_SAMPLE_COMPLETED_SUCCESFULLY]")

In this code sample we have compared performance of the FP32 and INT8 models and demonstrated that the INT8 model is faster.

The Second Generation Intel® Xeon® Scalable processors provide Intel® Deep Learning Boost that speeds up the INT8 inference.