# Tensorflow - TensorRT INT8 Inference example from checkpoint
In this notebook, we demonstrate the process to create a TF-TensorRT optimized model from a Tensorflow *saved model*.
This notebook has been successfully tested in the NVIDIA NGC Tensorflow container `nvcr.io/nvidia/tensorflow:19.04-py3` that can be downloaded from http://ngc.nvidia.com.

### Data
We use the ImageNet dataset that has been stored in TFrecords format. Google provide an excellent all-in-one script for downloading and preparing the ImageNet dataset at https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh.

### Checkpoint
We will run this demonstration with a saved model from the Tensorflow Resnet model zoo https://github.com/tensorflow/models/tree/master/official/resnet.

To run this notebook, start the NGC TF container providing correct path to ImageNet validation data and a TF saved checkpoint:

```bash
nvidia-docker run -it -p 8888:8888 -v /path/to/image_net/:/data  -v /path/to/saved_model:/saved_model --name TFTRT nvcr.io/nvidia/tensorflow:19.04-py3
```

This repository can then be cloned to `/workspace`:

```bash
git clone https://github.com/vinhngx/tftrt-examples
```

Then start jupyter notebook within the container with:

```bash
cd tftrt-examples
jupyter notebook --ip 0.0.0.0 --port 8888  --allow-root
```

Connect to Jupyter notebook web interface from your local host http://localhost:8888. 

We first install some extra packages and external dependencies. 

In [None]:
%%bash
pushd /workspace/nvidia-examples/tensorrt/tftrt/examples/object_detection
bash install_dependencies.sh;
popd

In [33]:
import tensorflow as tf
import tensorflow.contrib.tensorrt as trt
import numpy as np
import matplotlib.pyplot as plt
import os
import time
import logging

logging.getLogger("tensorflow").setLevel(logging.ERROR)

import os
os.environ['CUDA_VISIBLE_DEVICES']='1'

#check tensorRT version
!dpkg -l | grep nvinfer

ii  libnvinfer-dev                         5.0.2-1+cuda10.0                      amd64        TensorRT development libraries and headers
ii  libnvinfer5                            5.0.2-1+cuda10.0                      amd64        TensorRT runtime libraries


## Data
We verify that the correct data folder has been mounted.

In [2]:
VALIDATION_DATA_DIR = "/data"

def get_files(data_dir, filename_pattern):
    if data_dir == None:
        return []
    files = tf.gfile.Glob(os.path.join(data_dir, filename_pattern))
    if files == []:
        raise ValueError('Can not find any files in {} with '
                         'pattern "{}"'.format(data_dir, filename_pattern))
    return files

calibration_files = get_files(VALIDATION_DATA_DIR, 'validation*')
print('There are %d calibration files. \n%s\n%s\n...'%(len(calibration_files), calibration_files[0], calibration_files[-1]))

There are 128 calibration files. 
/data/validation-00114-of-00128
/data/validation-00094-of-00128
...


## TF model checkpoint
If not already downloaded, we will be downloading and working with a ResNet-50 v1 checkpoint from https://github.com/tensorflow/models/tree/master/official/resnet. 

In [6]:
%%bash
FILE=/saved_model/resnet_v1_50_2016_08_28.tar.gz
if [ -f $FILE ]; then
   echo "The file '$FILE' exists."
else
   echo "The file '$FILE' in not found. Downloading..."
   wget -P /saved_model/ http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz
fi


The file '/saved_model/resnet_v1_50_2016_08_28.tar.gz' exists.


In [5]:
!tar -xzvf /saved_model/resnet_v1_50_2016_08_28.tar.gz -C /saved_model 

resnet_v1_50.ckpt


In [3]:
#Define some global variables
BATCH_SIZE = 8
SAVED_MODEL_DIR = "/saved_model/"

## Helper functions
We define a few helper functions to read and preprocess Imagenet data from TFRecord files. 

In [41]:
def deserialize_image_record(record):
    feature_map = {
        'image/encoded':          tf.FixedLenFeature([ ], tf.string, ''),
        'image/class/label':      tf.FixedLenFeature([1], tf.int64,  -1),
        'image/class/text':       tf.FixedLenFeature([ ], tf.string, ''),
        'image/object/bbox/xmin': tf.VarLenFeature(dtype=tf.float32),
        'image/object/bbox/ymin': tf.VarLenFeature(dtype=tf.float32),
        'image/object/bbox/xmax': tf.VarLenFeature(dtype=tf.float32),
        'image/object/bbox/ymax': tf.VarLenFeature(dtype=tf.float32)
    }
    with tf.name_scope('deserialize_image_record'):
        obj = tf.parse_single_example(record, feature_map)
        imgdata = obj['image/encoded']
        label   = tf.cast(obj['image/class/label'], tf.int32)
        bbox    = tf.stack([obj['image/object/bbox/%s'%x].values
                            for x in ['ymin', 'xmin', 'ymax', 'xmax']])
        bbox = tf.transpose(tf.expand_dims(bbox, 0), [0,2,1])
        text    = obj['image/class/text']
        return imgdata, label, bbox, text

from preprocessing import vgg_preprocessing
def preprocess(record):
    # Parse TFRecord
    imgdata, label, bbox, text = deserialize_image_record(record)
    label -= 1 # Change to 0-based (don't use background class)
    try:    image = tf.image.decode_jpeg(imgdata, channels=3, fancy_upscaling=False, dct_method='INTEGER_FAST')
    except: image = tf.image.decode_png(imgdata, channels=3)

    image = vgg_preprocessing.preprocess_image(image, 224, 224, is_training=False)
    return image, label

dataset = tf.data.TFRecordDataset(calibration_files)    
dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=BATCH_SIZE, num_parallel_calls=8))

Next are two functions to benchmark models, either in a `graph_def` form or a `saved model` form.

In [42]:
def benchmark_frozen_graph(frozen_graph, SAVED_MODEL_DIR=None, dataset=dataset, BATCH_SIZE=8):
    with tf.Session(graph=tf.Graph()) as sess:
        # prepare dataset iterator
        iterator = dataset.make_one_shot_iterator()
        next_element = iterator.get_next()

        output_node = tf.import_graph_def(
            frozen_graph,
            return_elements=['classes'],
            name="")
        
        num_hits = 0
        num_predict = 0
        print('Warming up for 10 batches...')
        for _ in range (10):
            image_data = sess.run(next_element)    
            img = image_data[0]
            label = image_data[1].squeeze()
            output = sess.run(['classes:0'], feed_dict={"input:0": img})
            prediction = output[0]
            num_hits += np.sum(prediction == label)
            num_predict += len(prediction)
            
        start_time = time.time()
        try:
            while True:        
                image_data = sess.run(next_element)    
                img = image_data[0]
                label = image_data[1].squeeze()
                output = sess.run(['classes:0'], feed_dict={"input:0": img})
                prediction = output[0]
                num_hits += np.sum(prediction == label)
                num_predict += len(prediction)
        except tf.errors.OutOfRangeError as e:
            pass

        print('Accuracy: %.2f%%'%(100*num_hits/num_predict)) 
        print('Inference speed: %.2f samples/s'%(num_predict/(time.time()-start_time)))
        
        #Save model for serving
        if SAVED_MODEL_DIR:
            print('Saving model to %s'%SAVED_MODEL_DIR)
            tf.saved_model.simple_save(
                session=sess,
                export_dir=SAVED_MODEL_DIR,
                inputs={"input":tf.get_default_graph().get_tensor_by_name("input:0")},
                outputs={"classes":tf.get_default_graph().get_tensor_by_name("classes:0")},
                legacy_init_op=None
             )

In [43]:
def benchmark_saved_model(SAVED_MODEL_DIR, dataset=dataset, BATCH_SIZE=8):
    with tf.Session(graph=tf.Graph()) as sess:
        # Initialize all tfrecord paths
        dataset = tf.data.TFRecordDataset(calibration_files)    
        dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=BATCH_SIZE, num_parallel_calls=8))
        iterator = dataset.make_one_shot_iterator()
        next_element = iterator.get_next()

        tf.saved_model.loader.load(
            sess, [tf.saved_model.tag_constants.SERVING], SAVED_MODEL_DIR)

        print('Warming up for 10 batches...')
        for _ in range (10):
            image_data = sess.run(next_element)    
            img = image_data[0]
            output = sess.run(['classes:0'], feed_dict={"input:0": img})

        print('Benchmarking inference engine...')
        num_hits = 0
        num_predict = 0
        start_time = time.time()
        try:
            while True:        
                image_data = sess.run(next_element)    
                img = image_data[0]
                label = image_data[1].squeeze()
                output = sess.run(['classes:0'], feed_dict={"input:0": img})            
                prediction = output[0]
                num_hits += np.sum(prediction == label)
                num_predict += len(prediction)
        except tf.errors.OutOfRangeError as e:
            pass

        print('Accuracy: %.2f%%'%(100*num_hits/num_predict))
        print('Inference speed: %.2f samples/s'%(num_predict/(time.time()-start_time)))

## Benchmarking naitive Tensorflow model

We first load and benchmark the Resnet-v1-50 model from TF slim. Note that the checkpoint downloaded from http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz doesn't come with any meta data, therefore we will need to employ Slim net factory to get the model definition.

The newer checkpoints generated by Tensorflow generally comes with enough meta information to load the network from the achirve. 

In [44]:
import nets.nets_factory

FP32_SAVED_MODEL_DIR = SAVED_MODEL_DIR+"/data/Resnet_FP32/1"
!rm -rf $FP32_SAVED_MODEL_DIR

graph = tf.Graph()
with graph.as_default():
    with tf.Session() as sess:
        tf_input = tf.placeholder(tf.float32, [None, 224, 224, 3], name='input')
        network_fn = nets.nets_factory.get_network_fn('resnet_v1_50', 1000,
                is_training=False)
        tf_net, tf_end_points = network_fn(tf_input)
                
        saver = tf.train.Saver()
        saver.restore(sess, SAVED_MODEL_DIR+"resnet_v1_50.ckpt")
        
        tf_output = tf.identity(tf_net, name='logits')
        tf_output_classes = tf.argmax(tf_output, axis=1)        
        tf_output_classes = tf.reshape(tf_output_classes, (BATCH_SIZE,), name='classes')
        
        # freeze graph
        frozen_graph = tf.graph_util.convert_variables_to_constants(
            sess,
            sess.graph_def,
            output_node_names=['logits', 'classes']
        ) 
    
        #Save model for serving
        print('Saving FP32 model to %s'%FP32_SAVED_MODEL_DIR)
        tf.saved_model.simple_save(
            session=sess,
            export_dir=FP32_SAVED_MODEL_DIR,
            inputs={"input":tf.get_default_graph().get_tensor_by_name("input:0")},
            outputs={"classes":tf.get_default_graph().get_tensor_by_name("classes:0")},
            legacy_init_op=None
         )

Saving FP32 model to /saved_model//data/Resnet_FP32/1


In [48]:
!saved_model_cli show --all --dir /saved_model/data/Resnet_FP32/1/


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 224, 224, 3)
        name: input:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['classes'] tensor_info:
        dtype: DT_INT64
        shape: (8)
        name: classes:0
  Method name is: tensorflow/serving/predict


In [47]:
benchmark_saved_model(FP32_SAVED_MODEL_DIR)

Warming up for 10 batches...
Benchmarking inference engine...
Accuracy: 75.18%
Inference speed: 731.36 samples/s


In [49]:
benchmark_frozen_graph(frozen_graph)

Warming up for 10 batches...
Accuracy: 75.18%
Inference speed: 741.03 samples/s


## Benchmarking TF-TRT FP32  inference engine

Next, we convert the naitive TF FP32 model to TF-TRT FP32 and test the speed.

In [50]:
FP32_SAVED_MODEL_DIR = SAVED_MODEL_DIR+"/data/Resnet_FP32/1"

TRT_FP32_SAVED_MODEL_DIR = SAVED_MODEL_DIR+"/data/Resnet_TRT_FP32/1"
!rm -rf $TRT_FP32_SAVED_MODEL_DIR

trt_fp32_graph = trt.create_inference_graph(
    input_graph_def=None,
    outputs=None,
    max_batch_size=BATCH_SIZE,
    input_saved_model_dir=FP32_SAVED_MODEL_DIR,
    output_saved_model_dir=TRT_FP32_SAVED_MODEL_DIR,
    precision_mode="FP32")
len(trt_fp32_graph.SerializeToString())

240600781

In [51]:
!saved_model_cli show --all --dir $TRT_FP32_SAVED_MODEL_DIR


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 224, 224, 3)
        name: input:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['classes'] tensor_info:
        dtype: DT_INT64
        shape: (8)
        name: classes:0
  Method name is: tensorflow/serving/predict


In [52]:
benchmark_saved_model(TRT_FP32_SAVED_MODEL_DIR)

Warming up for 10 batches...
Benchmarking inference engine...
Accuracy: 75.18%
Inference speed: 706.74 samples/s


In [54]:
benchmark_frozen_graph(trt_fp32_graph)

Warming up for 10 batches...
Accuracy: 75.18%
Inference speed: 707.13 samples/s


## Benchmarking TF-TRT FP16 inference engine


Next, we convert the naitive TF FP32 model to TF-TRT FP16 and test the speed.

In [12]:
TRT_FP16_SAVED_MODEL_DIR = SAVED_MODEL_DIR+"/data/Resnet_TRT_FP16/1"
!rm -rf $TRT_FP16_SAVED_MODEL_DIR

trt_FP16 = trt.create_inference_graph(
    input_graph_def=None,
    outputs=['classes'],
    max_batch_size=BATCH_SIZE,
    input_saved_model_dir=FP32_SAVED_MODEL_DIR,
    output_saved_model_dir=TRT_FP16_SAVED_MODEL_DIR,
    precision_mode="FP16")

INFO:tensorflow:Running against TensorRT version 5.0.2
INFO:tensorflow:Restoring parameters from /saved_model//data/Resnet_FP32/1/variables/variables
INFO:tensorflow:Froze 267 variables.
INFO:tensorflow:Converted 267 variables to const ops.
INFO:tensorflow:No assets to save.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: /saved_model//data/Resnet_TRT_FP16/1/saved_model.pb


In [24]:
trt_FP16 = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=['classes'],
    max_batch_size=BATCH_SIZE,
    precision_mode="FP16")

INFO:tensorflow:Running against TensorRT version 5.0.2


In [55]:
!rm -rf $TRT_FP16_SAVED_MODEL_DIR
benchmark_frozen_graph(trt_FP16, TRT_FP16_SAVED_MODEL_DIR)

Warming up for 10 batches...
Accuracy: 75.20%
Inference speed: 1480.08 samples/s
Saving model to /saved_model//data/Resnet_TRT_FP16/1


In [34]:
benchmark_saved_model(TRT_FP16_SAVED_MODEL_DIR)

Warming up for 10 batches...
Benchmarking inference engine...
Accuracy: 75.20%
Inference speed: 1485.74 samples/s


## Creating TFTRT INT8 inference model

Creating TF-TRT INT8 inference model requires two steps:

- Step 1: creating the calibration graph, and run some training data through that graph for INT-8 calibration.

- Step 2: converting the calibration graph to the TF-TRT INT8 inference engine

### Step 1

In [35]:
#Now we create the TFTRT INT8 calibration graph
trt_int8_calib_graph = trt.create_inference_graph(
        input_graph_def=frozen_graph,
        outputs=['classes:0'],
        max_batch_size=BATCH_SIZE,
        max_workspace_size_bytes=1<<32,
        precision_mode='INT8')

#Then calibrate it with 2 batches of examples
N_runs=2
with tf.Session(graph=tf.Graph()) as sess:
    
    output_node = tf.import_graph_def(
        trt_int8_calib_graph,
        return_elements=['classes'],
        name='')
    
    # Initialize all tfrecord paths
    dataset = tf.data.TFRecordDataset(calibration_files)
    dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=BATCH_SIZE, num_parallel_calls=8))
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()

    #print([n.name for n in tf.get_default_graph().as_graph_def().node])
    
    print('Calibrate model on calibration data...')
    num_hits = 0
    num_predict = 0
    for _ in range(N_runs):
            image_data = sess.run(next_element)    
            img = image_data[0]
            label = image_data[1].squeeze()
            prediction = sess.run(output_node[0].outputs[0], feed_dict={"input:0": img})            
            num_hits += np.sum(prediction == label)
            num_predict += len(prediction)
    print('Calibration accuracy: %.2f%%'%(100*num_hits/num_predict)) 



Calibrate model on calibration data...
Calibration accuracy: 62.50%


### Step 2

Now we convert the INT8 calibration graph to the final TF-TRT INT8 inference engine, and benchmark its performance. We will also be saving this engine to a *saved model*, ready to be served elsewhere.

In [36]:
#Create Int8 inference model from the calibration graph and write to a saved session
trt_int8_calibrated_graph=trt.calib_graph_to_infer_graph(trt_int8_calib_graph)
output_node = tf.import_graph_def(
        trt_int8_calibrated_graph,
        return_elements=['classes'],
        name='')


In [38]:
INT8_SAVED_MODEL_DIR = SAVED_MODEL_DIR + '/data/Resnet_TRT_INT8/1'
!rm -rf $INT8_SAVED_MODEL_DIR

benchmark_frozen_graph(trt_int8_calibrated_graph, INT8_SAVED_MODEL_DIR)

Warming up for 10 batches...
Accuracy: 74.99%
Inference speed: 1493.75 samples/s
Saving model to /saved_model/INT8


In [39]:
!saved_model_cli show --all --dir $INT8_SAVED_MODEL_DIR


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 224, 224, 3)
        name: input:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['classes'] tensor_info:
        dtype: DT_INT64
        shape: unknown_rank
        name: classes:0
  Method name is: tensorflow/serving/predict


## Benchmarking INT8 saved model

Finally we reload and verify the performance of the INT8 saved model.

In [40]:
benchmark_saved_model(INT8_SAVED_MODEL_DIR)

Warming up for 10 batches...
Benchmarking inference engine...
Accuracy: 74.99%
Inference speed: 1504.92 samples/s


## Benchmarking with synthetic data

While benchmarking with real datasets, there are data reading and pre-processing procedures involved. As a result, the GPU is not fully loaded all the time. In this section, we test with synthetic data to test the throughput limit of the GPU.

### FP32

In [42]:
NUM_ITER = 1000
dummy_input = np.random.random_sample((BATCH_SIZE,224,224,3))

def benchmark_synthetic(frozen_graph):
    with tf.Session(graph=tf.Graph()) as sess:
        print('Praparing synthetic dataset...')
        #with tf.device('/device:GPU:0'):                
        inc=tf.constant(dummy_input, dtype=tf.float32)
        dataset=tf.data.Dataset.from_tensors(inc)
        dataset=dataset.repeat()
        iterator=dataset.make_one_shot_iterator()
        next_element=iterator.get_next()

        output_node = tf.import_graph_def(
            frozen_graph,
            return_elements=['classes:0'],
            input_map={'input:0':next_element},
            name="")

        print('Warming up for 10 batches...')
        for _ in range (10):
            sess.run(output_node)
            
        start_time = time.time()
        for _ in range(NUM_ITER):       
            sess.run(output_node)
        
        print('Inference speed: %.2f samples/s'%(NUM_ITER*BATCH_SIZE/(time.time()-start_time)))
        
benchmark_synthetic(frozen_graph)

Praparing synthetic dataset...
Warming up for 10 batches...
Inference speed: 715.82 samples/s


### TF-TRT FP32

In [26]:
benchmark_synthetic(frozen_graph)

Praparing synthetic dataset...
Warming up for 10 batches...
Inference speed: 747.97 samples/s


In [27]:
benchmark_synthetic(trt_FP32)

Praparing synthetic dataset...
Warming up for 10 batches...
Inference speed: 1068.86 samples/s


In [28]:
benchmark_synthetic(trt_FP16)

Praparing synthetic dataset...
Warming up for 10 batches...
Inference speed: 2067.87 samples/s


In [29]:
benchmark_synthetic(trt_int8_calibrated_graph)

Praparing synthetic dataset...
Warming up for 10 batches...
Inference speed: 2084.15 samples/s
