# Deep Learning Network Deployment

By Jon Barker and Ryan Olson

## Introduction

Welcome to NVIDIA's deep learning network deployment lab.  This lab  will use DIGITS, Caffe and the GPU Inference Engine (GIE) for deploying deep neural networks trained in DIGITS. You will learn some of the factors that affect data throughput and latency during neural network inference.  You will also see an example of how to use a neural network for efficient image classification within an easily deployable web service using the GPU Inference Engine (GIE).

## Part 1: Inference using DIGITS

Deep-learning networks typically have two primary phases of development: training and inference

### Neural network training and inference

Solving a supervised machine learning problem with deep neural networks involves a two-step process.

The first step is to train a deep neural network on massive amounts of labeled data using GPUs. During this step, the neural network learns millions of weights or parameters that enable it to map input data examples to correct responses. Training requires iterative forward and backward passes through the network as the objective function is minimized with respect to the network weights. Often several models are trained and accuracy is validated against data not seen during training in order to estimate real-world performance.

![](files/dnn.png)

The next step – inference – uses the trained model to make predictions from new data. During this step, the best trained model is used in an application running in a production environment such as a data center, an automobile, or an embedded platform. For some applications, such as autonomous driving, inference is done in real time and therefore high throughput is critical.  The typical training and inference cycle is depicted below.

<img src="files/digits_workflow.png" alt="Drawing" style="width: =800px;"/>

Due to the size of the datasets and deep neural networks, training is typically very resource-intensive and can take weeks or months on traditional compute architectures. However, using GPUs greatly accelerates this process down to days or hours.

Due to the depth of deep neural networks, inference also requires significant compute resources to process in realtime on imagery or other high-volume sensor data. However, GPU acceleration can also be utilized during inference whether it in a data center processing incoming web queries or on the [Jetson TX1](http://www.nvidia.com/object/jetson-tx1-module.html)'s integrated NVIDIA GPU deployed onboard embedded platforms. This enables real-time inference in robotics applications like picking, autonomous navigation, agriculture, and industrial inspection.

Using DIGITS, anyone can easily get started and interactively train their networks with GPU acceleration. 
DIGITS is an open-source project contributed by NVIDIA, located here: https://github.com/NVIDIA/DIGITS.  However, DIGITS is also the starting point for deploying a trained neural network.

### Inference using DIGITS

Now click [here](/digits/) to open DIGITS in a separate tab.  If at any time DIGITS asks you to login you just make up a username and proceed.

The DIGITS server you will see running contains two neural networks listed under the `"Pretrained Models"` tab:

![](files/pretrained.png)

- *GoogleNet* is a well known convolutional neural network (CNN) architecture that has been trained for image classification using  the [ilsvrc12 Imagenet](http://www.image-net.org/challenges/LSVRC/2012/) dataset.  This network can assign one of 1000 class class labels to an entire image based on the dominant object present. The classes this model can recognize can be found [here](http://image-net.org/challenges/LSVRC/2012/browse-synsets).

- *pedestrian_detectNet* is another CNN architecture that is not only able to assign a global classification to an image but can go further and detect multiple objects within the image and draw bounding boxes around them.  The pre-trained model provided has been trained for the task of pedestrian detection using a large dataset of pedestrians in a variety of indoor and outdoor scenes.

Any pre-trained model in DIGITS can be immediately applied to a new test image through the same browser interface. Click on the name of one of the models to see the model summary view.  At the bottom of this view you will see the `"Inference Options"` section.  Here you can select to apply the model to a single test image, a whole LMDB database of test images or a list of files referenced in a text file.  

**Exercise:** Complete the following steps to test a single image that resides on the same server as the running DIGITS instance:

![](files/digits_inference.png)

If you successfully follow these steps you should see one of the inference outputs below:

![](files/digits_inference_outputs.png)

(**Optional Exercise**) Carry out the same inference task again using DetectNet but this time select a version of the model from an earlier epoch in the training process using `"Select Model`" dropdown.  You should find that models from earlier epochs produce less accurate detection results.

## Part 2: Inference using pycaffe

In production deployment scenarios it is desirable to carry out inference programatically through a deep learning framework API. The inference job that DIGITS just completed made use of the underlying Caffe deep learning framework through it's Python API pycaffe. We will now walk through a simple example Python application that uses pycaffe to detect pedestrians in a test image using the pretrained DetectNet model introduced in Part 1.  Note that this example is representative of the way that inference is carried out in the majority of deep learning frameworks.

A model trained in DIGITS can be imported into pycaffe as an instance of the `caffe.Net` class.  You can find the path of the directory containing the model definition files and trained model weights listed under **Job Directory
** at the top of the model's page in DIGITS.  For example:

<img src="files/digits_job.png" alt="Drawing" style="width: 600px;"/>

Simlarly if you look at the corresponding dataset in DIGITS you can find the **Job Directory** containing the dataset mean image.

**Exercise:** Find the job directory for the pedestrian_detectnet model and the pedestrian_dummy_dataset and use them as the `MODEL_JOB_DIR` and `DATA_JOB_DIR` parameters in the code cells below to apply DetectNet to a sequence of test frames from the video `melbourne.mp4` using pycaffe.

In [None]:
# Import required Python libraries
%pylab inline
pylab.rcParams['figure.figsize'] = (15, 9)
import caffe
import numpy as np
import time
import os
import cv2
from IPython.display import clear_output

In [None]:
# Configure Caffe to use the GPU for inference
caffe.set_mode_gpu()

In [None]:
# Set the model job directory from DIGITS here
MODEL_JOB_DIR='/home/ubuntu/digits/digits/jobs/20160905-143028-2f08'
# Set the data job directory from DIGITS here
DATA_JOB_DIR='/home/ubuntu/digits/digits/jobs/20160905-135347-01d5'

In [None]:
# We need to find the iteration number of the final model snapshot saved by DIGITS
for root, dirs, files in os.walk(MODEL_JOB_DIR):
    for f in files:
        if f.endswith('.solverstate'):
            last_iteration = f.split('_')[2].split('.')[0]
print 'Last snapshot was after iteration: ' + last_iteration

In [None]:
# Load the dataset mean image file
mean = np.load(os.path.join(DATA_JOB_DIR,'train_db','mean.npy'))

In [None]:
# Instantiate a Caffe model in GPU memory
# The model architecture is defined in the deploy.prototxt file
# The pretrained model weights are contained in the snapshot_iter_<number>.caffemodel file
classifier = caffe.Net(os.path.join(MODEL_JOB_DIR,'deploy.prototxt'), 
                       os.path.join(MODEL_JOB_DIR,'snapshot_iter_' + last_iteration + '.caffemodel'),
                       caffe.TEST)

# Instantiate a Caffe Transformer object that wil preprocess test images before inference
transformer = caffe.io.Transformer({'data': classifier.blobs['data'].data.shape})
transformer.set_transpose('data', (2,0,1))
transformer.set_mean('data',mean.mean(1).mean(1)/255)
transformer.set_raw_scale('data', 255)
transformer.set_channel_swap('data', (2,1,0))

BATCH_SIZE, CHANNELS, HEIGHT, WIDTH = classifier.blobs['data'].data[...].shape

print 'The input size for the network is: (' + \
        str(BATCH_SIZE), str(CHANNELS), str(HEIGHT), str(WIDTH) + \
         ') (batch size, channels, height, width)'

In [None]:
# Create opencv video object
vid = cv2.VideoCapture('/home/ubuntu/deployment_lab/melbourne.mp4')

# We will just use every n-th frame from the video
every_nth = 10
counter = 0

try:
    while(True):
        # Capture video frame-by-frame
        ret, frame = vid.read()
        counter += 1
        
        if not ret:
            
            # Release the Video Device if ret is false
            vid.release()
            # Mesddage to be displayed after releasing the device
            print "Released Video Resource"
            break
        if counter%every_nth == 0:
            
            # Resize the captured frame to match the DetectNet model
            frame = cv2.resize(frame, (1024, 512), 0, 0)
            
            # Use the Caffe transformer to preprocess the frame
            data = transformer.preprocess('data', frame.astype('float16')/255)
            
            # Set the preprocessed frame to be the Caffe model's data layer
            classifier.blobs['data'].data[...] = data
            
            # Measure inference time for the feed-forward operation
            start = time.time()
            # The output of DetectNet is an array of bounding box predictions
            bounding_boxes = classifier.forward()['bbox-list'][0]
            end = (time.time() - start)*1000
            
            # Convert the image from OpenCV BGR format to matplotlib RGB format for display
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            
            # Create a copy of the image for drawing bounding boxes
            overlay = frame.copy()
            
            # Loop over the bounding box predictions and draw a rectangle for each bounding box
            for bbox in bounding_boxes:
                if  bbox.sum() > 0:
                    cv2.rectangle(overlay, (bbox[0],bbox[1]), (bbox[2],bbox[3]), (255, 0, 0), -1)
                    
            # Overlay the bounding box image on the original image
            frame = cv2.addWeighted(overlay, 0.5, frame, 0.5, 0, frame)
            
            # Display the inference time per frame
            cv2.putText(frame,"Inference time: %dms per frame" % end,
                        (10,500), cv2.FONT_HERSHEY_SIMPLEX,1,(255,255,255),2)

            # Display the frame
            imshow(frame)
            show()
            # Display the frame until new frame is available
            clear_output(wait=True)
            
# At any point you can stop the video playback and inference by  
# clicking on the stop (black square) icon at the top of the notebook
except KeyboardInterrupt:
    # Release the Video Device
    vid.release()
    # Message to be displayed after releasing the device
    print "Released Video Resource"            

As you will see pycaffe can carry out DetectNet inference on a 1024x512 video frame in a time of about 73ms.  Clearly the video playback is much slower than this due to the overhead of reading frames from the video and drawing the output frame with bounding box overlays.  There is also a startup cost associated with initializing the model in memory.  In a production system you would often try to have the data ingest and output postprocessing taking place in a separate parallel thread to the Caffe inference.  You would also keep the model initialized in GPU memory to accept new data at any time without incurring the startup overhead again.

In some applications we are less worried about the inference time for an indvidual frame (latency) and more concerned with how many frames we can process in a unit of time (throughput).  In these situations we can carry out inference as a batch process, just like during training.  In the case of video this would mean buffering frames until a batch is full and then carrying out inference.

**Exercise:** Modify the code cell below to carry out inference on the same video as before but with batch sizes 1, 2, 5 and 8.  Compare the per frame inference time and throughput.  

NOTE: If you try a batch size greater than 8 you will run out of GPU memory and see an error from the notebook.  This is fine to do, but you will have to re-execute the previous code cells to get back to this point.

In [None]:
NEW_BATCH_SIZE = 2

# Resize the input data layer for the new batch size
OLD_BATCH_SIZE, CHANNELS, HEIGHT, WIDTH = classifier.blobs['data'].data[...].shape
classifier.blobs['data'].reshape(NEW_BATCH_SIZE, CHANNELS, HEIGHT, WIDTH)
classifier.reshape()

# Create opencv video object
vid = cv2.VideoCapture('/home/ubuntu/deployment_lab/melbourne.mp4')

counter = 0

batch = np.zeros((NEW_BATCH_SIZE, HEIGHT, WIDTH, CHANNELS))

try:
    while(True):
        # Capture video frame-by-frame
        ret, frame = vid.read()
        
        if not ret:
            # Release the Video Device if ret is false
            vid.release()
            # Mesddage to be displayed after releasing the device
            print "Released Video Resource"
            break
            
        # Resize the captured frame to match the DetectNet model
        frame = cv2.resize(frame, (WIDTH, HEIGHT), 0, 0)
        
        # Add frame to batch array
        batch[counter%NEW_BATCH_SIZE,:,:,:] = frame
        counter += 1
        
        if counter%NEW_BATCH_SIZE==0:
        
            # Use the Caffe transformer to preprocess the frame
            data = transformer.preprocess('data', frame.astype('float16')/255)
            
            # Set the preprocessed frame to be the Caffe model's data layer
            classifier.blobs['data'].data[...] = data
            
            # Measure inference time for the feed-forward operation
            start = time.time()
            # The output of DetectNet is now an array of bounding box predictions
            # for each image in the input batch
            bounding_boxes = classifier.forward()['bbox-list']
            end = (time.time() - start)*1000
            
            print 'Inference time: %dms per batch, %dms per frame, output size %s' % \
                    (end, end/NEW_BATCH_SIZE, bounding_boxes.shape)
            
# At any point you can stop the video playback and inference by  
# clicking on the stop (black square) icon at the top of the notebook
except KeyboardInterrupt:
    # Release the Video Device
    vid.release()
    # Message to be displayed after releasing the device
    print "Released Video Resource"      

You should have found that per frame inference time decreases as batch size increases (higher throughput) but the time to process a whole batch also increases (higher latency).

## Part 3: NVIDIA GPU Inference Engine

NVIDIA [GPU Inference Engine](https://developer.nvidia.com/gpu-inference-engine) (GIE) is a high-performance deep learning inference solution for production environments. Power efficiency and speed of response are two key metrics for deployed deep learning applications, because they directly affect the user experience and the cost of the service provided. GIE automatically optimizes trained neural networks for run-time performance and delivers GPU-accelerated inference for web/mobile, embedded and automotive applications.

![GIE](GIE_Graphics_FINAL-1.png)

There are two phases in the use of GIE: build and deployment. In the build phase, GIE performs optimizations on the network configuration and generates an optimized plan for computing the forward pass through the deep neural network. The plan is an optimized object code that can be serialized and stored in memory or on disk.

The GIE runtime needs three files to deploy a classification neural network:

1. a network architecture file (deploy.prototxt),
2. trained weights (net.caffemodel), and
3. a label file to provide a name for each output class.

These can all be obtained from a trained DIGITS model.  In addition, you must define the batch size and the output layer.

GIE supports the following layer types:

- Convolution: 2D
- Activation: ReLU, tanh and sigmoid
- Pooling: max and average
- ElementWise: sum, product or max of two tensors
- LRN: cross-channel only
- Fully-connected: with or without bias
- SoftMax: cross-channel only
- Deconvolution

We are going to build a GIE runtime for the GoogleNet model trained in DIGITS introduced in Part 1.  

The `InferenceEngine.cpp` file in the editor below takes a Caffe model and generates a GIE object.  In particular, the builder (lines 36-42) are responsible for reading the network information from the Caffe .prototxt and .caffemodel files.  The builder can also be used to define network information directly if no .prototxt file is available.

Other significant lines to inspect in this file are:

- line 46 - the output layer from the network is chosen, we are using the "softmax" layer as we want to classify images.  If we wished to use GIE to extract features from an intermediate layer of the network we could specify that layer by name here.
- line 49 - we choose the maximium input batch size that will be used.  In this example we simply process one image at a time (batch size 1), but if we wished to maximize throughput at the expense of latency we could process larger batches.
- line 50 - GIE performs layer optimizations to reduce inference time.  While this is transparent to the user, analyzing the network layers requires memory, so you must specify the maximum workspace size.

<iframe id="task1" src="task1" width="100%" height="400px">
    <p>Your browser does not support iframes.</p>
</iframe>

Once you have finished inspecting the InferenceEngine.cpp code execute the command below to build the simple create_plan application which will call InferenceEngine and serialize the generated GIE object to disk.

In [None]:
!cd /home/ubuntu/GRE_Web_Demo/GIEEngine/src/ && make && \
echo "create_plan app created"

We can now use the create_plan application to generate a GIE object from our GoogleNet model trained in DIGITS.  Execute the cell below to do this - the final line specifies where the GIE object (sometimes called a plan) will be written.

In [None]:
!/home/ubuntu/GRE_Web_Demo/GIEEngine/src/create_plan \
~/digits/digits/jobs/20160908-203605-e46b/deploy.prototxt \
~/digits/digits/jobs/20160908-203605-e46b/snapshot_iter_32.caffemodel \
~/GRE_Web_Demo/GIEEngine/src/imagenet.plan && \
echo "Plan created"

GIE performs several important transformations and optimizations to the neural network graph. First, layers with unused output are eliminated to avoid unnecessary computation. Next, where possible convolution, bias, and ReLU layers are fused to form a single layer. Layer fusion improves the efficiency of running GIE-optimized networks on the GPU.

Another transformation is horizontal layer fusion, or layer aggregation, along with the required division of aggregated layers to their respective outputs. Horizontal layer fusion improves performance by combining layers that take the same source tensor and apply the same operations with similar parameters, resulting in a single larger layer for higher computational efficiency.

For more information on these transformations see the GIE Parallel Forall blog post [here](https://devblogs.nvidia.com/parallelforall/production-deep-learning-nvidia-gpu-inference-engine/).

The full scope of batching and streaming data to and from the runtime inference engine is beyond the scope of this lab.  However, the `classification.cpp` file in the editor below contains the key steps required to use the inference engine to process a batch of input data and generate a result.

<iframe id="task2" src="task2" width="100%" height="400px">
    <p>Your browser does not support iframes.</p>
</iframe>

For the purposes of this lab, we have created a [Docker](https://www.docker.com/) container called gre_gie that contains an application called GIE that uses `classifiation.cpp` to instantiate a GIE inference engine from the plan we created and process images that it is passed through a REST service.

Execute the cell below to use [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) to launch the gre_gie container.

In [None]:
!nvidia-docker run \
    -d --name gre_gie -p 8085:8000 \
    -v /home/ubuntu/GRE_Web_Demo/GIEEngine/src:/inference-engine:ro \
    -v /home/ubuntu/caffe/data/ilsvrc12:/imagenet:ro \
    gre_gie \
    /inference-engine/imagenet.plan \
    /imagenet/imagenet_mean.binaryproto \
    /imagenet/synset_words.txt && \
echo "GIE container running"

We have also created a second container called gre_caffegpu that carries out inference on images passed through a REST servcice but uses GPU accelerated Caffe instead of GIE.  The Caffe container uses the same native Caffe model for inference as we used to create the GIE plan above.

In [None]:
!nvidia-docker run \
    -d --name gre_caffegpu -p 8082:8000 \
    -v /home/ubuntu/digits/digits/jobs/20160908-203605-e46b:/model:ro \
    -v /home/ubuntu/caffe/data/ilsvrc12:/imagenet:ro \
    gre_caffegpu \
    /model/deploy.prototxt \
    /model/snapshot_iter_32.caffemodel \
    /imagenet/imagenet_mean.binaryproto \
    /imagenet/synset_words.txt && \
echo "Caffe GPU container running"

Finally, we created a Docker container called gre_frontend that generates a simple web interface that you can upload an image to and have it classified using either the GIE application in the gre_gie container or native GPU Caffe in the gre_caffegpu container.

Execute the cell below to start the web interface.

In [None]:
!nvidia-docker run --name gre_frontend -d --net=host gre_frontend FrontEnd \
    "0.0.0.0:8080" \
    "0.0.0.0:8081" \
    "0.0.0.0:8082" \
    "0.0.0.0:8083" \
    "0.0.0.0:8084" \
    "0.0.0.0:8085" \
    "0.0.0.0:8086" && \
echo "GIE web demo container running"

**Once you have received the "GIE container running", "Caffe GPU container running" and "GIE web demo container running" messages above**, click [here](/gre/) to connect to the GIE web interface in a separate tab.

You should see a screen like this:

![](files/GRE.png)

You can now classify an image provided through the web interface through either the running GIE or Caffe containers. Either click [here](files/test_images/GN_test.png) to open our GoogleNet test image and save it to disk or if you wish you can use your own test image in .png or .jpeg format:

1. Click on "Image Classification" in the top left corner
2. Click choose file and point to your test image
3. Click either the "Caffe" or "GIE" buttons to pass the image to the running GIE or Caffe container for classification
4. Click both the "profile" buttons to plot the time taken for making the web request, image preprocessing and inference using GIE and Caffe

You should see an output like this:

![](files/GREoutput.png)

Note that the inference time here includes image preprocessing and the overhead of making a web request, so it is not a true representation of the maximum possible inference speed of GIE or Caffe and it is also running an older K520 GPU.  However, we see that the GIE inference time (dark green bar) is lower than the Caffe time (light green bar).

At the end of the day, the success of GIE comes down to the performance it provides for inference. To measure the performance benefits we compared the per-layer timings of the GoogLeNet network using Caffe and GIE on NVIDIA Tesla M4 GPUs with a batch size of 1 averaged over 1000 iterations with GPU clocks fixed in the P0 state.

![](files/GIE_GoogLeNet_top10kernels-1.png)

In [None]:
# stop and remove any existing docker containers
!sudo docker stop $(docker ps -a -q)
!sudo docker rm $(docker ps -a -q)