# TF-TRT Inference From Saved Model with Tensorflow <= 1.13

In this notebook, we demonstrate the process to create a TF-TRT optimized model from a Tensorflow *saved model*. This notebook was designed to run with TensorFlow version <=1.13 which is included as part of NVIDIA NGC Tensorflow containers from `nvcr.io/nvidia/tensorflow:19.03-py3` to `nvcr.io/nvidia/tensorflow:19.06-py3` that can be downloaded from http://ngc.nvidia.com.

## Notebook  Content
1. [Pre-requisite: data and model](#1)
1. [Verifying the orignal FP32 model](#2)
1. [Creating TF-TRT FP32 model](#3)
1. [Creating TF-TRT FP16 model](#4)
1. [Creating TF-TRT INT8 model](#5)
1. [Calibrating TF-TRT INT8 model with raw JPEG images](#6)
 
## Quick start
We will run this demonstration with a saved Resnet-v1-50 model stored at `/path/to/saved_model`.

The INT8 calibration process requires access to a small but representative sample of real training or valiation data.

We will use the ImageNet dataset that is stored in TFrecords format. Google provide an excellent all-in-one script for downloading and preparing the ImageNet dataset at 

https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh.


To run this notebook, start the NGC TF container, providing correct path to the ImageNet validation data `/path/to/image_net` and the folder `/path/to/saved_model` containing the TF saved model:

```bash
nvidia-docker run --rm -it -p 8888:8888 -v /path/to/image_net:/data  -v /path/to/saved_model:/saved_model --name TFTRT nvcr.io/nvidia/tensorflow:19.04-py3
```

Within the container, we then start Jupyter notebook with:

```bash
jupyter notebook --ip 0.0.0.0 --port 8888  --allow-root
```

Connect to Jupyter notebook web interface on your host http://localhost:8888.


<a id="1"></a>
## 1. Pre-requisite: data and model

We first install some extra packages and external dependencies needed for, e.g. preprocessing ImageNet data. 

In [None]:
%%bash
pushd /workspace/nvidia-examples/tensorrt/tftrt/examples/object_detection/ 
bash install_dependencies.sh;
popd

In [None]:
import tensorflow as tf
import tensorflow.contrib.tensorrt as trt
import numpy as np
import matplotlib.pyplot as plt
import os
import time

import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

config = tf.ConfigProto()
config.gpu_options.allow_growth=True

#check TensorRT version
!dpkg -l | grep nvinfer

### Data
We verify that the correct Imagenet data folder has been mounted and validation data files of the form `validation-00xxx-of-00128` are available.

In [None]:
def get_files(data_dir, filename_pattern):
    if data_dir == None:
        return []
    files = tf.gfile.Glob(os.path.join(data_dir, filename_pattern))
    if files == []:
        raise ValueError('Can not find any files in {} with '
                         'pattern "{}"'.format(data_dir, filename_pattern))
    return files

In [None]:
VALIDATION_DATA_DIR = "/data"
calibration_files = get_files(VALIDATION_DATA_DIR, 'validation*')
print('There are %d calibration files. \n%s\n%s\n...'%(len(calibration_files), calibration_files[0], calibration_files[-1]))


### TF saved model
If not already downloaded, we will be downloading and working with a ResNet-50 v1 checkpoint from https://github.com/tensorflow/models/tree/master/official/resnet

In [None]:
%%bash
FILE=/saved_model/resnet_v1_50_2016_08_28.tar.gz
if [ -f $FILE ]; then
   echo "The file '$FILE' exists."
else
   echo "The file '$FILE' in not found. Downloading..."
   wget -P /saved_model/ http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v1_fp32_savedmodel_NHWC.tar.gz
fi

tar -xzvf /saved_model/resnet_v1_fp32_savedmodel_NHWC.tar.gz -C /saved_model 

### Helper functions
We define a few helper functions to read and preprocess Imagenet data from TFRecord files. 

In [None]:
def deserialize_image_record(record):
    feature_map = {
        'image/encoded':          tf.FixedLenFeature([ ], tf.string, ''),
        'image/class/label':      tf.FixedLenFeature([1], tf.int64,  -1),
        'image/class/text':       tf.FixedLenFeature([ ], tf.string, ''),
        'image/object/bbox/xmin': tf.VarLenFeature(dtype=tf.float32),
        'image/object/bbox/ymin': tf.VarLenFeature(dtype=tf.float32),
        'image/object/bbox/xmax': tf.VarLenFeature(dtype=tf.float32),
        'image/object/bbox/ymax': tf.VarLenFeature(dtype=tf.float32)
    }
    with tf.name_scope('deserialize_image_record'):
        obj = tf.parse_single_example(record, feature_map)
        imgdata = obj['image/encoded']
        label   = tf.cast(obj['image/class/label'], tf.int32)
        bbox    = tf.stack([obj['image/object/bbox/%s'%x].values
                            for x in ['ymin', 'xmin', 'ymax', 'xmax']])
        bbox = tf.transpose(tf.expand_dims(bbox, 0), [0,2,1])
        text    = obj['image/class/text']
        return imgdata, label, bbox, text

In [None]:
from preprocessing import vgg_preprocessing
def preprocess(record):
        # Parse TFRecord
        imgdata, label, bbox, text = deserialize_image_record(record)
        #label -= 1 # Change to 0-based if not using background class
        try:    image = tf.image.decode_jpeg(imgdata, channels=3, fancy_upscaling=False, dct_method='INTEGER_FAST')
        except: image = tf.image.decode_png(imgdata, channels=3)

        image = vgg_preprocessing.preprocess_image(image, 224, 224, is_training=False)
        return image, label

In [None]:
#Define some global variables
BATCH_SIZE = 64

dataset = tf.data.TFRecordDataset(calibration_files)    
dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=BATCH_SIZE, num_parallel_calls=20))

<a id="2"></a>
## 2. Verifying the orignal FP32 model
We demonstrate the conversion process with a Resnet-50 v1 model. First, we inspect the original Tensorflow model.

In [None]:
SAVED_MODEL_DIR =  "/saved_model/resnet_v1_fp32_savedmodel_NHWC/1538686669/"

We employ `saved_model_cli` to inspect the inputs and outputs of the model.

In [None]:
!saved_model_cli show --all --dir $SAVED_MODEL_DIR

This give us information on the input and output tensors as `input_tensor:0` and `softmax_tensor:0` respectively. Also note that the number of output classes here is 1001 instead of 1000 Imagenet classes. This is because the network was trained with an extra background class. 

In [None]:
INPUT_TENSOR = 'input_tensor:0'
OUTPUT_TENSOR = 'softmax_tensor:0'

Next, we define a function to read in a saved mode, measuring its speed and accuracy on the validation data.

In [None]:
def benchmark_saved_model(SAVED_MODEL_DIR, dataset=dataset, BATCH_SIZE=64):
    with tf.Session(graph=tf.Graph(), config=config) as sess:
        # prepare dataset iterator
        iterator = dataset.make_one_shot_iterator()
        next_element = iterator.get_next()

        tf.saved_model.loader.load(
            sess, [tf.saved_model.tag_constants.SERVING], SAVED_MODEL_DIR)

        print('Warming up for 50 batches...')
        for _ in range (50):
            sess.run(OUTPUT_TENSOR, feed_dict={INPUT_TENSOR: sess.run(next_element)[0]})

        print('Benchmarking inference engine...')
        num_hits = 0
        num_predict = 0
        start_time = time.time()
        try:
            while True:        
                image_data = sess.run(next_element)    
                img = image_data[0]
                label = image_data[1].squeeze()
                output = sess.run([OUTPUT_TENSOR], feed_dict={INPUT_TENSOR: img})            
                prediction = np.argmax(output[0], axis=1)
                num_hits += np.sum(prediction == label)
                num_predict += len(prediction)
        except tf.errors.OutOfRangeError as e:
            pass

        print('Accuracy: %.2f%%'%(100*num_hits/num_predict))
        print('Inference speed: %.2f samples/s'%(num_predict/(time.time()-start_time)))

In [None]:
benchmark_saved_model(SAVED_MODEL_DIR, dataset=dataset, BATCH_SIZE=BATCH_SIZE)

<a id="3"></a>
## 3. Creating TF-TRT FP32 model

Next, we convert the naitive TF FP32 model to TF-TRT FP32, then verify model accuracy and inference speed.

In [None]:
FP32_SAVED_MODEL_DIR = SAVED_MODEL_DIR+"_TFTRT_FP32/1"
!rm -rf $FP32_SAVED_MODEL_DIR

#Now we create the TFTRT FP32 engine
trt.create_inference_graph(
    input_graph_def=None,
    outputs=None,
    max_batch_size=BATCH_SIZE,
    input_saved_model_dir=SAVED_MODEL_DIR,
    output_saved_model_dir=FP32_SAVED_MODEL_DIR,
    precision_mode="FP32")

benchmark_saved_model(FP32_SAVED_MODEL_DIR, dataset=dataset, BATCH_SIZE=BATCH_SIZE)

<a id="4"></a>
## 4. Creating TF-TRT FP16 model

Next, we convert the naitive TF FP32 model to TF-TRT FP16, then verify model accuracy and inference speed.

In [None]:
FP16_SAVED_MODEL_DIR = SAVED_MODEL_DIR+"_TFTRT_FP16/1"
!rm -rf $FP16_SAVED_MODEL_DIR

#Now we create the TFTRT FP32 engine
trt.create_inference_graph(
    input_graph_def=None,
    outputs=None,
    max_batch_size=BATCH_SIZE,
    input_saved_model_dir=SAVED_MODEL_DIR,
    output_saved_model_dir=FP16_SAVED_MODEL_DIR,
    precision_mode="FP16")

benchmark_saved_model(FP16_SAVED_MODEL_DIR, dataset=dataset, BATCH_SIZE=BATCH_SIZE)

<a id="5"></a>
## 5. Creating TF-TRT INT8 model

Creating TF-TRT INT8 inference model requires two steps:

- Step 1: creating the calibration graph, and run some training data through that graph for INT-8 calibration.

- Step 2: converting the calibration graph to the TF-TRT INT8 inference engine

### Step 1: Creating the calibration graph

In [None]:
#Now we create the TFTRT INT8 calibration graph
trt_int8_calib_graph = trt.create_inference_graph(
    input_graph_def=None,
    outputs=[OUTPUT_TENSOR],
    max_batch_size=BATCH_SIZE,
    input_saved_model_dir=SAVED_MODEL_DIR,    
    precision_mode="INT8")

#Then calibrate it with 10 batches of examples
N_runs=10
with tf.Session(graph=tf.Graph(), config=config) as sess:
    print('Preparing calibration data...')
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()

    print('Loading INT8 calibration graph...')
    output_node = tf.import_graph_def(
        trt_int8_calib_graph,
        return_elements=[OUTPUT_TENSOR],
        name='')

    print('Calibrate model on calibration data...')    
    for _ in range(N_runs):
        sess.run(output_node, feed_dict={INPUT_TENSOR: sess.run(next_element)[0]})            

### Step 2: Converting the calibration graph to inference graph

Now we convert the INT8 calibration graph to the final TF-TRT INT8 inference engine, then save this engine to a *saved model*, ready to be served elsewhere.

In [None]:
#Create Int8 inference model from the calibration graph and write to a saved session
print('Creating TF-TRT INT8 inference engine...')
trt_int8_calibrated_graph=trt.calib_graph_to_infer_graph(trt_int8_calib_graph)

In [None]:
#set a directory to write the saved model
INT8_SAVED_MODEL_DIR =  SAVED_MODEL_DIR + "_TFTRT_INT8/1"
!rm -rf $INT8_SAVED_MODEL_DIR

with tf.Session(graph=tf.Graph()) as sess:
    print('Loading TF-TRT INT8 inference engine...')
    output_node = tf.import_graph_def(
        trt_int8_calibrated_graph,
        return_elements=[OUTPUT_TENSOR],
        name='')

    #Save model for serving
    print('Saving INT8 model to %s'%INT8_SAVED_MODEL_DIR)
    tf.saved_model.simple_save(
        session=sess,
        export_dir=INT8_SAVED_MODEL_DIR,
        inputs={"input":tf.get_default_graph().get_tensor_by_name(INPUT_TENSOR)},
        outputs={"softmax":tf.get_default_graph().get_tensor_by_name(OUTPUT_TENSOR),
                 "classes":tf.get_default_graph().get_tensor_by_name("ArgMax:0")},
        legacy_init_op=None
     )    

### Benchmarking INT8 saved model

Finally we reload and verify the accuracy and performance of the INT8 saved model from disk.

In [None]:
benchmark_saved_model(INT8_SAVED_MODEL_DIR)

In [None]:
!saved_model_cli show --all --dir $INT8_SAVED_MODEL_DIR

<a id="6"></a>
## 6. Calibrating TF-TRT INT8 model with raw JPEG images

As an alternative to taking data in TFRecords format, in this section, we demonstrate the process of calibrating TFTRT INT-8 model from a directory of raw JPEG images. We asume that raw images have been mounted to the directory `/data/Calibration_data`.

As a rule of thumb, calibration data should be a small but representative set of images that is similar to what is expected in deployment. Empirically, for common network architectures trained on imagenet data, calibration data of size 500-1000 provide good accuracy. As such, a good strategy for a dataset such as imagenet is to choose one sample from each class. 

In [None]:
data_directory = "/data/Calibration_data"
calibration_files = [os.path.join(path, name) for path, _, files in os.walk(data_directory) for name in files]
print('There are %d calibration files. \n%s\n%s\n...'%(len(calibration_files), calibration_files[0], calibration_files[-1]))

We define a helper function to read and preprocess image from JPEG file.

In [None]:
def parse_file(filepath):
    image = tf.read_file(filepath)
    image = tf.image.decode_jpeg(image, channels=3)
    image = vgg_preprocessing.preprocess_image(image, 224, 224, is_training=False)
    return image

In [None]:
dataset = tf.data.Dataset.from_tensor_slices(calibration_files)
dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=parse_file, batch_size=BATCH_SIZE, num_parallel_calls=20))
dataset = dataset.repeat(None)

Next, we proceed with the two-stage process of creating and calibrating TFTRT INT8 model.

### Step 1: Creating the calibration graph

In [None]:
#Now we create the TFTRT INT8 calibration graph
trt_int8_calib_graph = trt.create_inference_graph(
    input_graph_def=None,
    outputs=[OUTPUT_TENSOR],
    max_batch_size=BATCH_SIZE,
    input_saved_model_dir=SAVED_MODEL_DIR,    
    precision_mode="INT8")

#Then calibrate it with 10 batches of examples
N_runs=10
with tf.Session(graph=tf.Graph(), config=config) as sess:
    print('Preparing calibration data...')
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()

    print('Loading INT8 calibration graph...')
    output_node = tf.import_graph_def(
        trt_int8_calib_graph,
        return_elements=[OUTPUT_TENSOR],
        name='')

    print('Calibrate model on calibration data...')    
    for _ in range(N_runs):
        sess.run(output_node, feed_dict={INPUT_TENSOR: sess.run(next_element)})            

### Step 2: Converting the calibration graph to inference graph

In [None]:
#Create Int8 inference model from the calibration graph and write to a saved session
print('Creating TF-TRT INT8 inference engine...')
trt_int8_calibrated_graph=trt.calib_graph_to_infer_graph(trt_int8_calib_graph)

#set a directory to write the saved model
INT8_SAVED_MODEL_DIR =  SAVED_MODEL_DIR + "_TFTRT_INT8_JPEG/1"
!rm -rf $INT8_SAVED_MODEL_DIR

with tf.Session(graph=tf.Graph()) as sess:
    print('Loading TF-TRT INT8 inference engine...')
    output_node = tf.import_graph_def(
        trt_int8_calibrated_graph,
        return_elements=[OUTPUT_TENSOR],
        name='')

    #Save model for serving
    print('Saving INT8 model to %s'%INT8_SAVED_MODEL_DIR)
    tf.saved_model.simple_save(
        session=sess,
        export_dir=INT8_SAVED_MODEL_DIR,
        inputs={"input":tf.get_default_graph().get_tensor_by_name(INPUT_TENSOR)},
        outputs={"softmax":tf.get_default_graph().get_tensor_by_name(OUTPUT_TENSOR),
                 "classes":tf.get_default_graph().get_tensor_by_name("ArgMax:0")},
        legacy_init_op=None
     )    

As before, we can benchmark the speed and accuracy of the resulting model.

In [None]:
benchmark_saved_model(INT8_SAVED_MODEL_DIR)

## Conclusion
In this notebook, we have demonstrated the process of creating TF-TRT inference model from an original TF FP32 *saved model*. In every case, we have also verified the accuracy and speed to the resulting model. 
