<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# Computer Vision for Industrial Inspection #

## Part 3 - Model Deployment for Inference ##
In this notebook, we will take our previously trained classification model, export it as a TensorRT engine, and deploy it on Triton Inference Server. TensorRT is a highly optimized package that takes trained models and optimizes them for inference. We'll learn how to create the model directory structures and configuration files within Triton Inference Server and how to send inference requests to the models deployed within it.

**Table of Contents**
<br>
This notebook covers the below sections: 
1. [Setting Up Environment](#s3-1)
    * [Set Up Environment Variables](#s3-1.1)
    * [TAO Toolkit Model Export](#s3-1.2)
    * [TensorRT - Programmable Inference Accelerator](#s3-1.3)
    * [Export the Trained Model](#s3-1.4)
2. [Introduction to Triton Inference Server](#s3-2)
    * [Server](#s3-2.1)
    * [Client](#s3-2.2)
    * [Model Repository](#s3-2.3)
    * [Exercise #1 - Model Configuration](#s3-e1)
3. [Run Inference on Triton Inference Server](#s3-3)
    * [Server Health Status](#s3-3.1)
    * [Prepare Data](#s3-3.2)
    * [Exercise #2 - Pre-process Inputs](#s3-e2)
    * [Send Inference Request to Server](#s3-3.3)
    * [Measure Performance](#s3-3.4)
4. [Run Batch Inference](#s3-4)
5. [Run FP16 Inference](#s3-5)
6. [Conclusion](#s3-6)

<a name='s3-1'></a>
## Set Up Environment ##

<a name='s3-1.1'></a>
### Set Up Environment Variables ###
We set up a couple of environment variables to help us mount the local directories to the TAO container. Specifically, we want to set paths for the `$LOCAL_TRAINING_DATA`, `$LOCAL_SPEC_DIR`, and `$LOCAL_PROJECT_DIR` for the output of the TAO experiment with their respective paths in the TAO container. In doing so, we can make sure that the TAO experiment generated collaterals such as checkpoints, model files (e.g. `.tlt` or `.etlt`), and logs are output to `$LOCAL_PROJECT_DIR/classification`. 

_Note that users will be able to define their own export encryption key when training from a general-purpose model. This is to protect proprietary IP and used to decrypt the `.etlt` model during deployment._

In [1]:
# DO NOT CHANGE THIS CELL
# set environment variables
import os
import pandas as pd
import time
import shutil
import json
import numpy as np
from PIL import Image
import warnings
warnings.filterwarnings("ignore")

%set_env KEY=my_model_key

%set_env LOCAL_PROJECT_DIR=/dli/task/tao_project
%set_env LOCAL_SPECS_DIR=/dli/task/tao_project/spec_files
os.environ["LOCAL_EXPERIMENT_DIR"]=os.path.join(os.getenv("LOCAL_PROJECT_DIR"), "classification")

%set_env TAO_PROJECT_DIR=/workspace/tao-experiments
%set_env TAO_SPECS_DIR=/workspace/tao-experiments/spec_files
os.environ['TAO_EXPERIMENT_DIR']=os.path.join(os.getenv("TAO_PROJECT_DIR"), "classification")

# # unzip
!unzip -qq data/viz_BYD_new.zip -d data

# # remove zip file
!rm data/viz_BYD_new.zip

env: KEY=my_model_key
env: LOCAL_PROJECT_DIR=/dli/task/tao_project
env: LOCAL_SPECS_DIR=/dli/task/tao_project/spec_files
env: TAO_PROJECT_DIR=/workspace/tao-experiments
env: TAO_SPECS_DIR=/workspace/tao-experiments/spec_files


The cell below maps the project directory on your local host to a workspace directory in the TAO docker instance, so that the data and the results are mapped from in and out of the docker. This is done by creating a `.tao_mounts.json` file. For more information, please refer to the [launcher instance](https://docs.nvidia.com/tao/tao-toolkit/tao_launcher.html) in the user guide. Setting the `DockerOptions` ensures that you don't have permission issues when writing data into folders created by the TAO docker.

In [2]:
# DO NOT CHANGE THIS CELL
# mapping up the local directories to the TAO docker
mounts_file = os.path.expanduser("~/.tao_mounts.json")

drive_map = {
    "Mounts": [
            # Mapping the data directory
            {
                "source": os.environ["LOCAL_PROJECT_DIR"],
                "destination": "/workspace/tao-experiments"
            },
        ],
    "DockerOptions": {
        "user": "{}:{}".format(os.getuid(), os.getgid())
    }
}

# writing the mounts file
with open(mounts_file, "w") as mfile:
    json.dump(drive_map, mfile, indent=4)

<a name='s3-1.2'></a>
### TAO Toolkit Model Export ###
Once we are satisfied with our model, we can move to deployment. `classification` includes an `export` subtask to export and prepare a trained classification model for deployment. Exporting the model decouples the training process from deployment and allows conversion to TensorRT engines outside the TAO environment. TensorRT engines are specific to each hardware configuration and should be generated for each unique inference environment. This may be interchangeably referred to as the `.trt` or `.engine` file. The same exported TAO model may be used universally across training and deployment hardware. This is referred to as the `.etlt` file, or encrypted TAO file. 

<a name='s3-1.3'></a>
### TensorRT - Programmable Inference Accelerator

NVIDIA [TensorRT](https://developer.nvidia.com/tensorrt) is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. 

With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.

How does TensorRT enable optimizations on the layer graph: 
1. Elimination of layers whose outputs are not used
2. Fusion of convolution, bias and ReLU operations
3. Aggregation of operations with sufficiently similar parameters and the same source tensor 
    (for example, the 1x1 convolutions in GoogleNet’ s inception module)
4. Merging of concatenation layers by directing layer outputs to the correct eventual destination.

Here are some great resources to learn more about TensorRT:
 
* Main Page: https://developer.nvidia.com/tensorrt
* Blogs: https://devblogs.nvidia.com/speed-up-inference-tensorrt/
* Download: https://developer.nvidia.com/nvidia-tensorrt-download
* Documentation: https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html
* Sample Code: https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html
* GitHub: https://github.com/NVIDIA/TensorRT
* NGC Container: https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt

<a name='s3-1.4'></a>
### Export the Trained Model ###

When using the `export` subtask, the `-m` argument indicates the path to the `.tlt` model file to be exported, the `-e` argument indicates the path to the spec file, and `-k` argument indicates the key to _load_ the model. There are two optional arguments, `--gen_ds_config` and `--engine_file` that are useful for us. The `--gen_ds_config` argument indicates whether to generate a template inference configuration file and requires the `--classmap_json` argument if used. The `--engine_file` indicates the path to the serialized TensorRT engine file. 
<p><img src='images/important.png' width=720></p>

Note that the TensorRT file is hardware specific and cannot be generalized across GPUs. Since a TensorRT engine file is hardware specific, you cannot use an engine file for deployment unless the deployment GPU is identical to the training GPU. This is true in our case since the Triton Inference Server will be deployed from the same hardware. 

In [3]:
# DO NOT CHANGE THIS CELL
# remove any previous exports if exists
!mkdir -p $LOCAL_EXPERIMENT_DIR/export
!rm -rf $LOCAL_EXPERIMENT_DIR/export/*

In [4]:
# DO NOT CHANGE THIS CELL
# show trained model
!ls -ltrh $LOCAL_EXPERIMENT_DIR/resnet50/weights

total 291M
-rw-rw-rw- 1 root root 291M Jan 27  2023 resnet_025.tlt


In [5]:
# DO NOT CHANGE THIS CELL
# export model and TensorRT engine
!tao classification export -m $TAO_EXPERIMENT_DIR/resnet50/weights/resnet_025.tlt \
                           -o $TAO_EXPERIMENT_DIR/export/resnet50_fp32.etlt \
                           -k $KEY \
                           --engine_file $TAO_EXPERIMENT_DIR/export/resnet50_fp32.engine \
                           --classmap_json $TAO_EXPERIMENT_DIR/resnet50/classmap.json \
                           --gen_ds_config

2024-05-21 06:23:18,037 [INFO] root: Registry: ['nvcr.io']
2024-05-21 06:23:18,208 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2024-05-21 06:23:18,209 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn't exist locally/the manifest has changed. Pulling a new docker.
2024-05-21 06:23:18,210 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you're doing this for the first time. Please wait here.
...
Pulling from repository: nvcr.io/nvidia/tao/tao-toolkit-tf
2024-05-21 06:26:04,350 [INFO] tlt.components.docker_handler.docker_handler: Container pull complete.
Using TensorFlow backend.
Using TensorFlow backend.
2024-05-21 06:26:13,389 [INFO] root: Building exporter object.
2024-05-21 06:26:22,991 [INFO] root: Exporting the model.
2024-05-21 06:26:22,991 [INFO] root: Using input nodes: ['input_1']
2024-

<p><img src='images/check.png' width=720></p>

Did you get the below error message? This is likely due to a bad NGC CLI configuration. Please check the NGC CLI and Docker Registry section of the [introduction notebook](00_introduction.ipynb).

In [6]:
# DO NOT CHANGE THIS CELL
# check that the TensorRT engine was successfully created. 
!ls -al $LOCAL_EXPERIMENT_DIR/export

total 377284
drwxr-xr-x 2 root root      4096 May 21 06:27 .
drwxrwxrwx 1 root root      4096 May 21 06:23 ..
-rw-r--r-- 1 root root        17 May 21 06:27 labels.txt
-rw-r--r-- 1 root root       243 May 21 06:27 nvinfer_config.txt
-rw-r--r-- 1 root root 233472772 May 21 06:27 resnet50_fp32.engine
-rw-r--r-- 1 root root 152843776 May 21 06:27 resnet50_fp32.etlt


<a name='s3-2'></a>
## Introduction to Triton Inference Server ##
NVIDIA [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) simplifies the deployment of AI models at scale in production. Triton is an open-source, inference-serving software that lets teams deploy trained AI models from any framework, from local storage, or from Google Cloud Platform or Azure on any GPU or CPU-based infrastructure, cloud, data center, or edge. The below figure shows the Triton Inference Server high-level architecture. The model repository is a _file-system based repository_ of the models that Triton will make available for inferencing. Inference requests arrive at the server via either [HTTP/REST](https://en.wikipedia.org/wiki/Representational_state_transfer), [gRPC](https://en.wikipedia.org/wiki/GRPC), or by the C API and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model's scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs. The outputs are then returned.
<p><img src='images/triton_server_architecture.png' width='720'/></p>

<a name='s3-2.1'></a>
### Server ###
Setting up the Triton Inference Server requires software for the server and the client. One can get started with Triton Inference Server by pulling the [container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver) from the NVIDIA NGC catalog. In this lab, we already have Triton Inference Server instance running. The code to run a Triton Server Instance is shown below. More details can be found in the [QuickStart Documentation](https://github.com/triton-inference-server/server/blob/r20.12/docs/quickstart.md) and [Build Documentation](https://github.com/triton-inference-server/server/blob/r20.12/docs/build.md). 

```
docker run \
  --gpus=1 \
  --ipc=host --rm \
  --shm-size=1g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /models:/models \
  nvcr.io/nvidia/tritonserver:20.12-py3 \
  tritonserver \
  --model-repository=/models \
  --exit-on-error=false \
  --model-control-mode=poll \
  --repository-poll-secs 30
```

<a name='s3-2.2'></a>
### Client ###
We've also installed the Triton Inference Server Client libraries to provide APIs that make it easy to communicate with Triton from your C++ or Python application. Using these libraries, you can send either HTTP/REST or gRPC requests to Triton to access all its capabilities: inferencing, status and health, statistics and metrics, model repository management, etc. These libraries also support using system and CUDA shared memory for passing inputs to and receiving outputs from Triton. The easiest way to get the Python client library is to use `pip` to install the `tritonclient` module, as detailed below. For more details on how to download or build the Triton Inference Server Client libraries, you can find the documentation [here](https://github.com/triton-inference-server/server/blob/r20.12/docs/client_libraries.md), as well as examples that show the use of both the C++ and Python libraries.

```
pip install nvidia-pyindex
pip install tritonclient[all]
```

<a name='s3-2.3'></a>
### Model Repository ###
Triton Inference Server serves models within a model repository. When you first run Triton Inference Server, you'll specify the model repository where the models reside:

```
tritonserver --model-repository=/models
```

Each model resides in its own model subdirectory within the model repository - i.e., each directory within `/models` represents a unique model. For example, in this notebook we'll be deploying our `classification_model`. All models typically follow a similar directory structure. Within each of these directories, we'll create a configuration file `config.pbtxt` that details information about the model - e.g. _batch size_, _input shapes_, _deployment backend_ (PyTorch, ONNX, TensorFlow, TensorRT, etc.) and more. Additionally, we can create one or more versions of our model. Each version lives under a subdirectory name with the respective version number, starting with `1`. It is within this subdirectory where our model files reside. 

```
root@server:/models$ tree
.
├── defect_classification_model
│   ├── 1
│   │   └── model.plan
│   └── config.pbtxt
│

```

We can also add a file representing the names of the outputs. We have omitted this step in this notebook for the sake of brevity. For more details on how to work with model repositories and model directory structures in Triton Inference Server, please see the [documentation](https://github.com/triton-inference-server/server/blob/r20.12/docs/model_repository.md). Below, we'll create the model directory structure for our defect classification model.

In [7]:
# DO NOT CHANGE THIS CELL
# create directory for model
!mkdir -p models/defect_classification_model_fp32/1

# copy resnet50_fp32.engine from model export to the model repository
!cp $LOCAL_EXPERIMENT_DIR/export/resnet50_fp32.engine models/defect_classification_model_fp32/1/model.plan

<a name='s3-e1'></a>
### Exercise #1 - Model Configuration ###
With our model directory set up, we now turn our attention to creating the configuration file for our model. A minimal model configuration must specify the name of the model, the `platform` and/or backend properties, the `max_batch_size` property, and the `input` and `output` tensors of the model (name, data type, and shape). We can get the `output` tensor name from the `nvinfer_config.txt` [file](tao_project/classification/export/nvinfer_config.txt) we generated before under `output-blob-names`. For more details on how to create model configuration files within Triton Inference Server, please see the [documentation](https://github.com/triton-inference-server/server/blob/r20.12/docs/model_configuration.md). 

**Instructions**:<br>
* Modify the `<FIXME>`s only and execute the cell to create the `config.pbtxt` file for the defect classification model. 

In [8]:
configuration = """
name: "defect_classification_model_fp32"
platform: "tensorrt_plan"
input: [
 {
    name: "input_1"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output: {
    name: "predictions/Softmax"
    data_type: TYPE_FP32
    dims: [ 2, 1, 1 ]
  }
"""

with open('models/defect_classification_model_fp32/config.pbtxt', 'w') as file:
    file.write(configuration)

Click ... to show **solution**. 

<a name='s3-3'></a>
## Run Inference on Triton Inference Server ##
With our model directory structures created, models defined and exported, and configuration files created, we will now wait for Triton Inference Server to load our models. We have set up this lab to use Triton Inference Server in **polling** mode. This means that Triton Inference Server will continuously poll for modifications to our models or for newly created models - once every 30 seconds. Please run the cell below to allow time for Triton Inference Server to poll for new models/modifications before proceeding. Due to the asynchronous nature of this step, we have added 15 seconds to be safe.

<a name='s3-3.1'></a>
### Server Health Status ###

In [9]:
# DO NOT CHANGE THIS CELL
!sleep 45

At this point, our models should be deployed and ready to use! To confirm Triton Inference Server is up and running, we can send a `curl` request to the below URL. The HTTP request returns status _200_ if Triton is ready and _non-200_ if it is not ready. We can also send a `curl` request to our model endpoints to confirm our models are deployed and ready to use. Additionally, we will also see information about our models such:
* The name of our model
* The versions available for our model
* The backend platform (e.g., tensort_rt, pytorch_libtorch, onnxruntime_onnx)
* The inputs and outputs, with their respective names, data types, and shapes

In [10]:
# DO NOT CHANGE THIS CELL
!curl -v triton:8000/v2/health/ready

*   Trying 172.18.0.3:8000...
* TCP_NODELAY set
* Connected to triton (172.18.0.3) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: triton:8000
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host triton left intact


In [11]:
# DO NOT CHANGE THIS CELL
!curl -v triton:8000/v2/models/defect_classification_model_fp32

*   Trying 172.18.0.3:8000...
* TCP_NODELAY set
* Connected to triton (172.18.0.3) port 8000 (#0)
> GET /v2/models/defect_classification_model_fp32 HTTP/1.1
> Host: triton:8000
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 232
< 
* Connection #0 to host triton left intact
{"name":"defect_classification_model_fp32","versions":["1"],"platform":"tensorrt_plan","inputs":[{"name":"input_1","datatype":"FP32","shape":[3,224,224]}],"outputs":[{"name":"predictions/Softmax","datatype":"FP32","shape":[2,1,1]}]}

<a name='s3-3.2'></a>
### Prepare Data ###

In [12]:
# DO NOT CHANGE THIS CELL
capacitor_df=pd.read_csv('capacitor_df.csv', converters={'img_shape': pd.eval})
capacitor_df.head()

Unnamed: 0,true_defect,defect_img_path,date,board,comp_id,img_shape,defect_image_name,comp_type
0,notdefect,/dli/task/data/AOI_DL_data_0908/0423318026324/...,908,423318026324,C1090,"[54, 27, 3]",D0_C1090.jpg,C
1,notdefect,/dli/task/data/AOI_DL_data_0908/0423318026269/...,908,423318026269,C1090,"[54, 27, 3]",D1_C1090.jpg,C
2,notdefect,/dli/task/data/AOI_DL_data_0908/0423318026523/...,908,423318026523,C1090,"[54, 27, 3]",D1_C1090.jpg,C
3,notdefect,/dli/task/data/AOI_DL_data_0908/0423318026331/...,908,423318026331,C1090,"[54, 27, 3]",D1_C1090.jpg,C
4,notdefect,/dli/task/data/AOI_DL_data_0908/0423318026211/...,908,423318026211,C1090,"[53, 27, 3]",D1_C1090.jpg,C


<a name='s3-e2'></a>
### Exercise #2 - Pre-process Inputs ###
Triton itself does not do anything with your input tensors, it simply feeds them to the model, same for outputs. Ensuring that the preprocessing operations used for inference are defined identically as they were when the model was trained is key to achieving high accuracy. In our case, we need to perform normalization and mean subtraction to produce the final float planar data to the TensorRT engine for inferencing. We can get the `offsets` and `net-scale-factor` from the `nvinfer_config.txt` [file](tao_project/classification/export/nvinfer_config.txt). The pre-processing function is:

<b>y = net scale factor * (x-mean)</b>

where: 
* **x** is the input pixel value. It is an int8 with range [0,255]. 
* **mean** is the corresponding mean value, read either from the mean file or as offsets[c], where c is the channel to which the input pixel belongs, and offsets is the array specified in the configuration file. It is a float. 
* **net-scale-factor** is the pixel scaling factor specified in the configuration file. It is a float.
* **y** is the corresponding output pixel value. It is a float.

**Instructions**:<br>
* Execute the below cell to load one random **defect** sample. 
* Modify the `<FIXME>`s only and execute the cell below to pre-process the input image. 

In [13]:
# DO NOT CHANGE THIS CELL
sample_img_file=capacitor_df[capacitor_df['true_defect']=='defect'].sample(1)['defect_img_path'].values[0]

In [14]:
def preprocess_image(file_path): 
    image=Image.open(file_path).resize((224, 224))
    image_ary=np.asarray(image).astype(np.float32)

    image_ary[:, :, 0]=(image_ary[:, :, 0]-103.939)*1
    image_ary[:, :, 1]=(image_ary[:, :, 1]-116.779)*1
    image_ary[:, :, 2]=(image_ary[:, :, 2]-123.68)*1

    image_ary=np.transpose(image_ary, [2, 0, 1])
    return image_ary

sample_image_ary=preprocess_image(sample_img_file)
sample_image_ary.shape

(3, 224, 224)

Click ... to show **solution**. 

<a name='s3-3.3'></a>
### Send Inference Request to Server ###
With our models deployed, it is now time to send inference requests to our models. First, we'll load the `tritonclient.http` module. We will also define the input and output names of our model, the name of our model, the URL where our models are deployed with Triton Inference Server (in this case the host `triton:8000`), and our model version.

In [15]:
# DO NOT CHANGE THIS CELL
import tritonclient.http as tritonhttpclient

# set parameters
VERBOSE=False
input_name='input_1'
input_shape=(3, 224, 224)
input_dtype='FP32'
output_name='predictions/Softmax'
model_name='defect_classification_model_fp32'
url='triton:8000'
model_version='1'

In [16]:
# DO NOT CHANGE THIS CELL
# set output labels
with open(os.path.join(os.environ['LOCAL_EXPERIMENT_DIR'], 'export', 'labels.txt'), 'r') as f: 
    labels=f.readlines()
labels={v: k.strip() for v, k in enumerate(labels)}
labels

{0: 'defect', 1: 'notdefect'}

We'll instantiate our client `triton_client` using the `tritonhttpclient.InferenceServerClient` class access the model metadata with the `get_model_metadata()` method as well as get our model configuration with the `get_model_config()` method.

In [17]:
# DO NOT CHANGE THIS CELL
triton_client=tritonhttpclient.InferenceServerClient(url=url, verbose=VERBOSE)
model_metadata=triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config=triton_client.get_model_config(model_name=model_name, model_version=model_version)

We'll instantiate a placeholder for our input data using the input name, shape, and data type expected. We'll set the data of the input to be the NumPy array representation of our image. We'll also instantiate a placeholder for our output data using just the output name. Lastly, we'll submit our input to the Triton Inference Server using the `triton_client.infer()` method, specifying our model name, model version, inputs, and outputs and convert our result to a NumPy array.

In [18]:
# DO NOT CHANGE THIS CELL
inference_input=tritonhttpclient.InferInput(input_name, input_shape, input_dtype)
inference_input.set_data_from_numpy(sample_image_ary)

output=tritonhttpclient.InferRequestedOutput(output_name)
response=triton_client.infer(model_name, 
                             model_version=model_version, 
                             inputs=[inference_input], 
                             outputs=[output])
predictions=response.as_numpy(output_name)
predictions

array([[[0.91693795]],

       [[0.08306203]]], dtype=float32)

We can iterate through our manifest to see how quickly Triton is able to perform inference. 

In [19]:
# DO NOT CHANGE THIS CELL
time_list=[]

for idx, row in capacitor_df.iterrows(): 
    image_ary=preprocess_image(row['defect_img_path'])
    inference_input.set_data_from_numpy(image_ary)
    # time the process
    start=time.time()
    response=triton_client.infer(model_name, 
                                 model_version=model_version, 
                                 inputs=[inference_input], 
                                 outputs=[output])
    time_list.append(time.time()-start)
    predictions=response.as_numpy(output_name)
    capacitor_df.loc[idx, 'prediction']=labels[np.argmax(predictions)].strip()

print('It took {} seconds to infer {} images.'.format(round(sum(time_list), 2), len(capacitor_df)))

It took 13.33 seconds to infer 1903 images.


<a name='s3-3.4'></a>
### Measure Performance ###

In [20]:
# DO NOT CHANGE THIS CELL
confusion_df=pd.crosstab(capacitor_df['true_defect'], capacitor_df['prediction'])
confusion_df.head()

prediction,defect,notdefect
true_defect,Unnamed: 1_level_1,Unnamed: 2_level_1
defect,96,3
notdefect,12,1792


<a name='s3-4'></a>
## Run Batch Inference ##

In [21]:
# DO NOT CHANGE THIS CELL
# create directory for model
!mkdir -p models/defect_classification_batch_model/1

# copy resnet-50 engine to the model repository
!cp $LOCAL_EXPERIMENT_DIR/export/resnet50_fp32.engine models/defect_classification_batch_model/1/model.plan

In [22]:
# DO NOT CHANGE THIS CELL
configuration = """
name: "defect_classification_batch_model"
platform: "tensorrt_plan"
max_batch_size: 16
input: [
 {
    name: "input_1"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output: {
    name: "predictions/Softmax"
    data_type: TYPE_FP32
    dims: [ 2, 1, 1 ]
  }
"""

with open('models/defect_classification_batch_model/config.pbtxt', 'w') as file:
    file.write(configuration)

In [23]:
# DO NOT CHANGE THIS CELL
!sleep 45

In [24]:
# DO NOT CHANGE THIS CELL
!curl -v triton:8000/v2/models/defect_classification_batch_model

*   Trying 172.18.0.3:8000...
* TCP_NODELAY set
* Connected to triton (172.18.0.3) port 8000 (#0)
> GET /v2/models/defect_classification_batch_model HTTP/1.1
> Host: triton:8000
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 239
< 
* Connection #0 to host triton left intact
{"name":"defect_classification_batch_model","versions":["1"],"platform":"tensorrt_plan","inputs":[{"name":"input_1","datatype":"FP32","shape":[-1,3,224,224]}],"outputs":[{"name":"predictions/Softmax","datatype":"FP32","shape":[-1,2,1,1]}]}

In [25]:
# DO NOT CHANGE THIS CELL
# set parameters
VERBOSE=False
input_name='input_1'
input_shape=(16, 3, 224, 224)
input_dtype='FP32'
output_name='predictions/Softmax'
model_name='defect_classification_batch_model'
url='triton:8000'
model_version='1'

In [26]:
# DO NOT CHANGE THIS CELL
triton_client=tritonhttpclient.InferenceServerClient(url=url, verbose=VERBOSE)
model_metadata=triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config=triton_client.get_model_config(model_name=model_name, model_version=model_version)

We can iterate through our manifest to see how quickly Triton is able to perform inference. 

In [27]:
# DO NOT CHANGE THIS CELL
inference_input=tritonhttpclient.InferInput(input_name, input_shape, input_dtype)
output=tritonhttpclient.InferRequestedOutput(output_name)

# time the process
start=time.time()

batch_ary=np.empty((16, 3, 224, 224)).astype(np.float32)
images_list=[]

time_list=[]

for idx, row in capacitor_df.iterrows(): 
    image_ary=preprocess_image(row['defect_img_path'])
    batch_ary[len(images_list)]=image_ary
    images_list.append(idx)
    if len(images_list)%16==0: 
        inference_input.set_data_from_numpy(batch_ary)
        # time the process
        start=time.time()
        response=triton_client.infer(model_name, 
                                     model_version=model_version, 
                                     inputs=[inference_input], 
                                     outputs=[output])
        time_list.append(time.time()-start)
        predictionss=response.as_numpy(output_name)
        
        capacitor_df.loc[images_list, 'prediction']=[*map(labels.get, np.argmax(predictionss, axis=1).flatten())]
        batch_ary=np.empty((16, 3, 224, 224)).astype(np.float32)
        images_list=[]

print('It took {} seconds to infer {} images.'.format(round(sum(time_list), 2), len(capacitor_df)))
print('On average it took {} seconds per inference.'.format(round(np.array(time_list).mean()/16, 4)))

It took 9.38 seconds to infer 1903 images.
On average it took 0.005 seconds per inference.


In [28]:
# DO NOT CHANGE THIS CELL
confusion_df=pd.crosstab(capacitor_df['true_defect'], capacitor_df['prediction'])
confusion_df.head()

prediction,defect,notdefect
true_defect,Unnamed: 1_level_1,Unnamed: 2_level_1
defect,96,3
notdefect,12,1792


<a name='s3-5'></a>
## Run FP16 Inference ##

In [29]:
# DO NOT CHANGE THIS CELL
# show trained model
!ls -ltrh $LOCAL_EXPERIMENT_DIR/resnet50/weights

total 291M
-rw-rw-rw- 1 root root 291M Jan 27  2023 resnet_025.tlt


In [None]:
# DO NOT CHANGE THIS CELL
# export model and TensorRT engine
!tao classification export -m $TAO_EXPERIMENT_DIR/resnet50/weights/resnet_025.tlt \
                           -o $TAO_EXPERIMENT_DIR/export/resnet50_fp16.etlt \
                           -k $KEY \
                           --data_type fp16 \
                           --engine_file $TAO_EXPERIMENT_DIR/export/resnet50_fp16.engine \
                           --classmap_json $TAO_EXPERIMENT_DIR/resnet50/classmap.json \
                           --gen_ds_config

2024-05-21 06:31:13,213 [INFO] root: Registry: ['nvcr.io']
2024-05-21 06:31:13,390 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
Using TensorFlow backend.
Using TensorFlow backend.
2024-05-21 06:31:20,682 [INFO] root: Building exporter object.
2024-05-21 06:31:28,071 [INFO] root: Exporting the model.
2024-05-21 06:31:28,071 [INFO] root: Using input nodes: ['input_1']
2024-05-21 06:31:28,071 [INFO] root: Using output nodes: ['predictions/Softmax']
2024-05-21 06:31:28,071 [INFO] iva.common.export.keras_exporter: Using input nodes: ['input_1']
2024-05-21 06:31:28,071 [INFO] iva.common.export.keras_exporter: Using output nodes: ['predictions/Softmax']
NOTE: UFF has been tested with TensorFlow 1.14.0.
DEBUG: convert reshape to flatten node
DEBUG [/usr/local/lib/python3.6/dist-packages/uff/converters/tensorflow/converter.py:96] Marking ['predictions/Softmax'] as outputs
2024-05-21 06:32:40,141 [INF

In [None]:
# DO NOT CHANGE THIS CELL
# create directory for model
!mkdir -p models/defect_classification_model_fp16/1

# copy resnet-50 engine to the model repository
!cp $LOCAL_EXPERIMENT_DIR/export/resnet50_fp16.engine models/defect_classification_model_fp16/1/model.plan

<p><img src='images/important.png' width=720></p>
We'll also create a configuration file for the TensorRT Fp16 model. Note that our input and output data types still remain in their FP32 representation - the internal layers and activations of our neural network will use the FP16 data type but our input and output data will still be in FP32.

In [None]:
# DO NOT CHANGE THIS CELL
configuration = """
name: "defect_classification_model_fp16"
platform: "tensorrt_plan"
input: [
 {
    name: "input_1"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output: {
    name: "predictions/Softmax"
    data_type: TYPE_FP32
    dims: [ 2, 1, 1 ]
  }
"""

with open('models/defect_classification_model_fp16/config.pbtxt', 'w') as file:
    file.write(configuration)

In [None]:
# DO NOT CHANGE THIS CELL
!sleep 45

In [None]:
# DO NOT CHANGE THIS CELL
!curl -v triton:8000/v2/models/defect_classification_model_fp16

In [None]:
# DO NOT CHANGE THIS CELL
# set parameters
VERBOSE=False
input_name='input_1'
input_shape=(3, 224, 224)
input_dtype='FP32'
output_name='predictions/Softmax'
model_name='defect_classification_model_fp16'
url='triton:8000'
model_version='1'

In [None]:
# DO NOT CHANGE THIS CELL
triton_client=tritonhttpclient.InferenceServerClient(url=url, verbose=VERBOSE)
model_metadata=triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
model_config=triton_client.get_model_config(model_name=model_name, model_version=model_version)

We can iterate through our manifest to see how quickly Triton is able to perform inference. 

In [None]:
# DO NOT CHANGE THIS CELL
inference_input=tritonhttpclient.InferInput(input_name, input_shape, input_dtype)
output=tritonhttpclient.InferRequestedOutput(output_name)

time_list=[]

for idx, row in capacitor_df.iterrows(): 
    image_ary=preprocess_image(row['defect_img_path'])
    inference_input.set_data_from_numpy(image_ary)
    # time the process
    start=time.time()
    response=triton_client.infer(model_name, 
                                 model_version=model_version, 
                                 inputs=[inference_input], 
                                 outputs=[output])
    time_list.append(time.time()-start)
    predictions=response.as_numpy(output_name)
    capacitor_df.loc[idx, 'prediction']=labels[np.argmax(predictions)].strip()

print('It took {} seconds to infer {} images.'.format(round(sum(time_list), 2), len(capacitor_df)))

In [None]:
# DO NOT CHANGE THIS CELL
confusion_df=pd.crosstab(capacitor_df['true_defect'], capacitor_df['prediction'])
confusion_df.head()

<a name='s3-6'></a>
## Conclusion ##
Automating the inspection process with highly accurate, fast, and easy-to-use systems help save time, reduce costs, and improve yields. For manufacturing use cases, AI can deliver advantages for both vendors and users of automated optical inspection equipment: 
* Algorithm development is simplified using deep-learning based computer vision - unlike traditional rules-based algorithms that require defining every product and acceptance criteria, using deep-learning can reduce time-to-market for new equipment and ongoing software-support costs. 
* Better performance - AI enhanced automated optical inspection can deliver greater accuracy, reliability, and lower false positive rate than traditional systems. 
* Greater flexibility - deep-learning algorithms can be quickly trained to perform new tasks. 

**Well Done!** When you're finished, please complete the assessment before moving onto the assessment. 

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>