# Tensorflow - TensorRT INT8 Inference example from saved model

In this notebook, we demonstrate the process to create a TF-TensorRT optimized model from a Tensorflow *saved model*.
This notebook has been successfully tested in the NVIDIA NGC Tensorflow container `nvcr.io/nvidia/tensorflow:19.04-py3` that can be downloaded from http://ngc.nvidia.com.

### Data
We use the ImageNet dataset that has been stored in TFrecords format. Google provide an excellent all-in-one script for downloading and preparing the ImageNet dataset at https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh.

### Saved model
We will run this demonstration with a saved model from the Tensorflow Resnet model zoo https://github.com/tensorflow/models/tree/master/official/resnet.

To run this notebook, start the NGC TF container providing correct path to ImageNet validation data and a TF saved model:

```bash
nvidia-docker run --rm -it -p 8888:8888 -v /path/to/image_net/:/data  -v /path/to/saved_model:/saved_model --name TFTRT nvcr.io/nvidia/tensorflow:19.04-py3
```




Now, you can login this container by:

```bash
docker exec -it TFTRT /bin/bash```


Then start jupyter notebook within the container with:

```bash
cd /workspace/nvidia-examples/tensorrt/tftrt/examples/object_detection
jupyter notebook --ip 0.0.0.0 --port 8888  --allow-root
```

Connect to Jupyter notebook web interface from your local host http://localhost:8888. This notebook can then be uploaded to `/workspace/nvidia-examples/tensorrt/tftrt/examples/object_detection` inside the container.

We first install some extra packages and external dependencies. 

In [1]:
%%bash
bash install_dependencies.sh

Setup local variables...
Download protobuf...
/workspace/nvidia-examples/tensorrt/tftrt/examples/object_detection/protoc /workspace/nvidia-examples/tensorrt/tftrt/examples/object_detection
Archive:  protoc-3.5.1-linux-x86_64.zip
   creating: include/
   creating: include/google/
   creating: include/google/protobuf/
  inflating: include/google/protobuf/struct.proto  
  inflating: include/google/protobuf/type.proto  
  inflating: include/google/protobuf/descriptor.proto  
  inflating: include/google/protobuf/api.proto  
  inflating: include/google/protobuf/empty.proto  
   creating: include/google/protobuf/compiler/
  inflating: include/google/protobuf/compiler/plugin.proto  
  inflating: include/google/protobuf/any.proto  
  inflating: include/google/protobuf/field_mask.proto  
  inflating: include/google/protobuf/wrappers.proto  
  inflating: include/google/protobuf/timestamp.proto  
  inflating: include/google/protobuf/duration.proto  
  inflating: include/google/protobuf/source_cont


echo Download protobuf...
mkdir -p $PROTOC_DIR
pushd $PROTOC_DIR
ARCH=$(uname -m)
uname -m
if [ "$ARCH" == "aarch64" ] ; then
  filename="protoc-3.5.1-linux-aarch_64.zip"
elif [ "$ARCH" == "x86_64" ] ; then
  filename="protoc-3.5.1-linux-x86_64.zip"
else
  echo ERROR: $ARCH not supported.
  exit 1;
fi
wget --no-check-certificate ${PROTO_BASE_URL}${filename}
--2019-05-08 03:05:28--  https://github.com/google/protobuf/releases/download/v3.5.1/protoc-3.5.1-linux-x86_64.zip
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/protocolbuffers/protobuf/releases/download/v3.5.1/protoc-3.5.1-linux-x86_64.zip [following]
--2019-05-08 03:05:30--  https://github.com/protocolbuffers/protobuf/releases/download/v3.5.1/protoc-3.5.1-linux-x86_64.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Fo

In [3]:
import tensorflow as tf
import tensorflow.contrib.tensorrt as trt
import numpy as np
import matplotlib.pyplot as plt
import os
import time

import os
os.environ['CUDA_VISIBLE_DEVICES']='1'

## Data
We verify that the correct data folder has been mounted.

In [4]:
VALIDATION_DATA_DIR = "/data"

def get_files(data_dir, filename_pattern):
    if data_dir == None:
        return []
    files = tf.gfile.Glob(os.path.join(data_dir, filename_pattern))
    if files == []:
        raise ValueError('Can not find any files in {} with '
                         'pattern "{}"'.format(data_dir, filename_pattern))
    return files

calibration_files = get_files(VALIDATION_DATA_DIR, 'validation*')
print('There are %d calibration files. \n%s\n%s\n...'%(len(calibration_files), calibration_files[0], calibration_files[-1]))

There are 128 calibration files. 
/data/validation-00114-of-00128
/data/validation-00094-of-00128
...


## Helper functions
We define a few helper functions to read and preprocess Imagenet data from TFRecord files. 

In [5]:
def deserialize_image_record(record):
    feature_map = {
        'image/encoded':          tf.FixedLenFeature([ ], tf.string, ''),
        'image/class/label':      tf.FixedLenFeature([1], tf.int64,  -1),
        'image/class/text':       tf.FixedLenFeature([ ], tf.string, ''),
        'image/object/bbox/xmin': tf.VarLenFeature(dtype=tf.float32),
        'image/object/bbox/ymin': tf.VarLenFeature(dtype=tf.float32),
        'image/object/bbox/xmax': tf.VarLenFeature(dtype=tf.float32),
        'image/object/bbox/ymax': tf.VarLenFeature(dtype=tf.float32)
    }
    with tf.name_scope('deserialize_image_record'):
        obj = tf.parse_single_example(record, feature_map)
        imgdata = obj['image/encoded']
        label   = tf.cast(obj['image/class/label'], tf.int32)
        bbox    = tf.stack([obj['image/object/bbox/%s'%x].values
                            for x in ['ymin', 'xmin', 'ymax', 'xmax']])
        bbox = tf.transpose(tf.expand_dims(bbox, 0), [0,2,1])
        text    = obj['image/class/text']
        return imgdata, label, bbox, text

In [6]:
from preprocessing import inception_preprocessing, vgg_preprocessing
def preprocess(record):
        # Parse TFRecord
        imgdata, label, bbox, text = deserialize_image_record(record)
        label -= 1 # Change to 0-based (don't use background class)
        try:    image = tf.image.decode_jpeg(imgdata, channels=3, fancy_upscaling=False, dct_method='INTEGER_FAST')
        except: image = tf.image.decode_png(imgdata, channels=3)

        image = vgg_preprocessing.preprocess_image(image, 224, 224, is_training=False)
        return image, label

## Benchmarking naitive Tensorflow model

In [7]:
#Define some global variables
BATCH_SIZE = 128
SAVED_MODEL_DIR =  "/saved_model/resnet_v1_50_savedmodel/1"

In [8]:
#First we extract fp32 graphdef
tf.reset_default_graph()

fp32_graph_def = None
with tf.Session(graph=tf.Graph()) as sess:
    tf.saved_model.loader.load(
        sess, [tf.saved_model.tag_constants.SERVING], SAVED_MODEL_DIR)
    fp32_graph_def = sess.graph_def

Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a new function for importing SavedModels in Tensorflow 2.0.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:The specified SavedModel has no variables; no checkpoints were restored.


In [9]:
with tf.Session(graph=tf.Graph()) as sess:
    # Initialize all tfrecord paths
    dataset = tf.data.TFRecordDataset(calibration_files)    
    dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=BATCH_SIZE, num_parallel_calls=8))
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()

    output_node = tf.import_graph_def(
        fp32_graph_def,
        return_elements=['import/resnet_v1_50/predictions/Softmax'],
        name='')
    
    tf.saved_model.loader.load(
        sess, [tf.saved_model.tag_constants.SERVING], SAVED_MODEL_DIR)

    print('Warming up for 10 batches...')
    for _ in range (10):
        image_data = sess.run(next_element)    
        img = image_data[0]
        output = sess.run(['import/resnet_v1_50/predictions/Softmax:0'], feed_dict={"import/input:0": img})
    
    num_hits = 0
    num_predict = 0
    start_time = time.time()
    try:
        while True:        
            image_data = sess.run(next_element)    
            img = image_data[0]
            label = image_data[1]
            output = sess.run(output_node[0].outputs[0], feed_dict={"import/input:0": img})
            prediction = np.argmax(output, axis=1)
            num_hits += np.sum(prediction == label)
            num_predict += len(prediction)
    except tf.errors.OutOfRangeError as e:
        pass
            
    print('Naitive Tensorflow Accuracy: %.2f%%'%(100*num_hits/num_predict)) 
    print('Naitive Tensorflow Inference speed: %.2f samples/s'%(num_predict/(time.time()-start_time)))


Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:The specified SavedModel has no variables; no checkpoints were restored.
Warming up for 10 batches...
Naitive Tensorflow Accuracy: 87.87%
Naitive Tensorflow Inference speed: 1216.27 samples/s


## Benchmarking TF-TRT FP16 inference engine

In [10]:
#Now we create the TFTRT FP16 engine
trt_fp16_Graph = trt.create_inference_graph(
        input_graph_def=fp32_graph_def,
        outputs=['import/resnet_v1_50/predictions/Softmax'],
        max_batch_size=BATCH_SIZE,
        max_workspace_size_bytes=1<<32,
        precision_mode='FP16')
    
with tf.Session(graph=tf.Graph()) as sess:
    # Initialize all tfrecord paths
    dataset = tf.data.TFRecordDataset(calibration_files)    
    dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=BATCH_SIZE, num_parallel_calls=8))
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()

    output_node = tf.import_graph_def(
        trt_fp16_Graph,
        return_elements=['import/resnet_v1_50/predictions/Softmax'],
        name='')

    print('Warming up for 10 batches...')
    for _ in range (10):
        image_data = sess.run(next_element)    
        img = image_data[0]
        output = sess.run(['import/resnet_v1_50/predictions/Softmax:0'], feed_dict={"import/input:0": img})

    print('Benchmarking FP16...')
    num_hits = 0
    num_predict = 0
    start_time = time.time()
    try:
        while True:        
            image_data = sess.run(next_element)    
            img = image_data[0]
            label = image_data[1]
            output = sess.run(output_node[0].outputs[0], feed_dict={"import/input:0": img})
            prediction = np.argmax(output, axis=1)
            num_hits += np.sum(prediction == label)
            num_predict += len(prediction)
    except tf.errors.OutOfRangeError as e:
        pass
            
    print('FP16 TF-TRT Accuracy: %.2f%%'%(100*num_hits/num_predict))
    print('FP16 TF-TRT Inference speed: %.2f samples/s'%(num_predict/(time.time()-start_time)))


INFO:tensorflow:Running against TensorRT version 5.1.2
Warming up for 10 batches...
Benchmarking FP16...
FP16 TF-TRT Accuracy: 87.91%
FP16 TF-TRT Inference speed: 1534.23 samples/s


## Creating TFTRT INT8 inference model

Creating TF-TRT INT8 inference model requires two steps:

- Step 1: creating the calibration graph, and run some training data through that graph for INT-8 calibration.

- Step 2: converting the calibration graph to the TF-TRT INT8 inference engine

### Step 1

In [24]:
#Now we create the TFTRT INT8 calibration graph
trt_int8_calib_graph = trt.create_inference_graph(
        input_graph_def=fp32_graph_def,
        outputs=['import/resnet_v1_50/predictions/Softmax'],
        max_batch_size=BATCH_SIZE,
        max_workspace_size_bytes=1<<32,
        precision_mode='INT8')

#Then calibrate it with 2 batchs of examples
N_runs=2
with tf.Session(graph=tf.Graph()) as sess:
    # Initialize all tfrecord paths
    dataset = tf.data.TFRecordDataset(calibration_files)
    dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=BATCH_SIZE, num_parallel_calls=8))
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()

    output_node = tf.import_graph_def(
        trt_int8_calib_graph,
        return_elements=['import/resnet_v1_50/predictions/Softmax'],
        name='')

    print('Calibrate model on calibration data...')
    num_hits = 0
    num_predict = 0
    for _ in range(N_runs):
            image_data = sess.run(next_element)    
            img = image_data[0]
            label = image_data[1]
            output = sess.run(output_node[0].outputs[0], feed_dict={"import/input:0": img})
            prediction = np.argmax(output, axis=1)
            num_hits += np.sum(prediction == label)
            num_predict += len(prediction)
    print('Calibration accuracy: %.2f%%'%(100*num_hits/num_predict)) 



INFO:tensorflow:Running against TensorRT version 5.1.2
Calibrate model on calibration data...
Calibration accuracy: 89.06%


### Step 2

Now we convert the INT8 calibration graph to the final TF-TRT INT8 inference engine, and benchmark its performance. We will also be saving this engine to a *saved model*, ready to be served elsewhere.

In [25]:
SAVED_INT8_MODEL_DIR =  "/saved_model/resnet50_v1_int8_savedmodel_2"
!rm -rf $SAVED_INT8_MODEL_DIR

In [26]:
#Create Int8 inference model from the calibration graph and write to a saved session
trt_int8_calibrated_graph=trt.calib_graph_to_infer_graph(trt_int8_calib_graph)
output_node = tf.import_graph_def(
        trt_int8_calibrated_graph,
        return_elements=['import/resnet_v1_50/predictions/Softmax'],
        name='')

with tf.Session(graph=tf.Graph()) as sess:
    # Initialize all tfrecord paths
    dataset = tf.data.TFRecordDataset(calibration_files)    
    dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=BATCH_SIZE, num_parallel_calls=8))
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()

    output_node = tf.import_graph_def(
        trt_int8_calibrated_graph,
        return_elements=['import/resnet_v1_50/predictions/Softmax'],
        name='')

    print('Warming up for 10 batches...')
    for _ in range (10):
        image_data = sess.run(next_element)    
        img = image_data[0]
        output = sess.run(['import/resnet_v1_50/predictions/Softmax:0'], feed_dict={"import/input:0": img})
        
    print('Benchmarking TF-TRT INT8 inference engine...')
    num_hits = 0
    num_predict = 0
    start_time = time.time()
    try:
        while True:        
            image_data = sess.run(next_element)    
            img = image_data[0]
            label = image_data[1]
            output = sess.run(['import/resnet_v1_50/predictions/Softmax:0'], feed_dict={"import/input:0": img})
            prediction = np.argmax(output[0], axis=1)
            num_hits += np.sum(prediction == label)
            num_predict += len(prediction)
    except tf.errors.OutOfRangeError as e:
        pass
            
    print('INT8 TF-TRT Accuracy: %.2f%%'%(100*num_hits/num_predict))
    print('INT8 TF-TRT Inference speed: %.2f samples/s'%(num_predict/(time.time()-start_time)))

    #Save model for serving
    tf.saved_model.simple_save(
        session=sess,
        export_dir=SAVED_INT8_MODEL_DIR,
        inputs={"input":tf.get_default_graph().get_tensor_by_name("import/input:0")},
        outputs={"softmax":output_node[0].outputs[0]},
        legacy_init_op=None
     )

      

Warming up for 10 batches...
Benchmarking TF-TRT INT8 inference engine...
INT8 TF-TRT Accuracy: 87.84%
INT8 TF-TRT Inference speed: 1523.85 samples/s
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: /saved_model/resnet50_v1_int8_savedmodel_2/saved_model.pb


## Benchmarking INT8 saved model

Finally we reload and verify the performance of the INT8 saved model.

In [27]:
#benchmark int8 saved model
with tf.Session(graph=tf.Graph()) as sess:
    # Initialize all tfrecord paths
    dataset = tf.data.TFRecordDataset(calibration_files)    
    dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=BATCH_SIZE, num_parallel_calls=8))
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()

    tf.saved_model.loader.load(
        sess, [tf.saved_model.tag_constants.SERVING], SAVED_INT8_MODEL_DIR)
    
    print('Warming up for 10 batches...')
    for _ in range (10):
        image_data = sess.run(next_element)    
        img = image_data[0]
        output = sess.run(['import/resnet_v1_50/predictions/Softmax:0'], feed_dict={"import/input:0": img})
    
    print('Benchmarking TF-TRT INT8 inference engine...')
    num_hits = 0
    num_predict = 0
    start_time = time.time()
    try:
        while True:        
            image_data = sess.run(next_element)    
            img = image_data[0]
            label = image_data[1]
            output = sess.run(['import/resnet_v1_50/predictions/Softmax:0'], feed_dict={"import/input:0": img})
            prediction = np.argmax(output[0], axis=1)
            num_hits += np.sum(prediction == label)
            num_predict += len(prediction)
    except tf.errors.OutOfRangeError as e:
        pass
            
    print('INT8 TF-TRT Accuracy: %.2f%%'%(100*num_hits/num_predict))
    print('INT8 TF-TRT Inference speed: %.2f samples/s'%(num_predict/(time.time()-start_time)))


INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:The specified SavedModel has no variables; no checkpoints were restored.
Warming up for 10 batches...
Benchmarking TF-TRT INT8 inference engine...
INT8 TF-TRT Accuracy: 87.84%
INT8 TF-TRT Inference speed: 1498.27 samples/s


## Benchmarking with synthetic data

While benchmarking with real datasets, there are data reading and pre-processing procedures involved. As a result, the GPU is not fully loaded all the time. In this section, we test with synthetic data to test the throughput limit of the GPU.

In [49]:
NUM_ITER = 100
dummy_input = np.random.random_sample((BATCH_SIZE,224,224,3))

fp32_graph = tf.Graph()
with fp32_graph.as_default():
    inc=tf.constant(dummy_input, dtype=tf.float32)
    dataset=tf.data.Dataset.from_tensors(inc)
    dataset=dataset.repeat()
    iterator=dataset.make_one_shot_iterator()
    next_element=iterator.get_next()
    out = tf.import_graph_def(
      graph_def=fp32_graph_def,
      input_map={"import/input":next_element},
      return_elements=[ "import/resnet_v1_50/predictions/Softmax"]
    )

with tf.Session(graph=fp32_graph) as sess:
    print('Warming up for 10 batches...')
    for _ in range(10):
        sess.run(out)
    
    print('Benchmarking...')
    start_time = time.time()
    for _ in range(NUM_ITER):
        sess.run(out)
    print('Naitive FP32 Inference speed: %.2f samples/s'%(NUM_ITER*BATCH_SIZE/(time.time()-start_time)))
    

Warming up for 10 batches...
Benchmarking...
Naitive FP32 Inference speed: 1172.37 samples/s


In [39]:
fp16_graph = tf.Graph()
with fp16_graph.as_default():
    inc=tf.constant(dummy_input, dtype=tf.float32)
    dataset=tf.data.Dataset.from_tensors(inc)
    dataset=dataset.repeat()
    iterator=dataset.make_one_shot_iterator()
    next_element=iterator.get_next()
    out = tf.import_graph_def(
      graph_def=trt_fp16_Graph,
      input_map={"import/input":next_element},
      return_elements=[ "import/resnet_v1_50/predictions/Softmax"]
    )
    out = out[0].outputs[0]

with tf.Session(graph=fp16_graph) as sess:
    print('Warming up for 10 batches...')
    for _ in range(10):
        sess.run(out)
    
    print('Benchmarking...')
    start_time = time.time()
    for _ in range(NUM_ITER):
        sess.run(out)
    print('Naitive FP32 Inference speed: %.2f samples/s'%(NUM_ITER*BATCH_SIZE/(time.time()-start_time)))
    

Warming up for 10 batches...
Benchmarking...
Naitive FP32 Inference speed: 3997.21 samples/s


In [48]:
int8_graph = tf.Graph()
with int8_graph.as_default():
    inc=tf.constant(dummy_input, dtype=tf.float32)
    dataset=tf.data.Dataset.from_tensors(inc)
    dataset=dataset.repeat()
    iterator=dataset.make_one_shot_iterator()
    next_element=iterator.get_next()
    out = tf.import_graph_def(
      graph_def=trt_int8_calibrated_graph,
      input_map={"import/input":next_element},
      return_elements=[ "import/resnet_v1_50/predictions/Softmax"],
      name=''        
    )
        
with tf.Session(graph=int8_graph) as sess:

    print('Warming up for 10 batches...')
    for _ in range(10):
        sess.run(out)
    
    print('Benchmarking...')
    start_time = time.time()
    for _ in range(NUM_ITER):
        sess.run(out)
    print('Naitive FP32 Inference speed: %.2f samples/s'%(NUM_ITER*BATCH_SIZE/(time.time()-start_time)))

['Const', 'count', 'OneShotIterator', 'IteratorToStringHandle', 'IteratorGetNext', 'import/input', 'TRTEngineOp_0', 'import/resnet_v1_50/predictions/Softmax']
Warming up for 10 batches...
Benchmarking...
Naitive FP32 Inference speed: 3982.26 samples/s
