<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 3.0 Server Performance

In this notebook, you'll implement the optimization techniques you've learned, and profile the resulting model in a more formal way.

**[3.1 Assessing the impact of Optimizations](#3.1-Assessing-the-impact-of-Optimizations)**<br>
&nbsp; &nbsp; &nbsp; &nbsp; [3.1.1 Exercise: Profile the Model](#3.1.1-Exercise:-Profile-the-Model)<br>
**[3.2 Monitoring and Responding to Performance Fluctuations](#3.2-Monitoring-and-Responding-to-Performance-Fluctuations)**<br>
&nbsp; &nbsp; &nbsp; &nbsp; [3.2.1 Viewing Prometheus Metrics](#3.2.1-Viewing-Prometheus-Metrics)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [3.2.2 Interpreting the Metrics](#3.2.2-Interpreting-the-Metrics)<br>

We'll analyze the impact of our configuration changes, as well as how the nature of the request pattern affects our inferencing capability. We will generate structured reports aimed at comparing the performance of a TorchScript-based model with no advanced Triton features activated, to a TensorRT ONNX model with the key features you've learned enabled. 

We will not only focus on the basic metrics that we have analyzed in the previous parts of the class (throughput and latency), but also try to understand which factors affect the latency of our solution (e.g. network communication).

Finally, we will look at the tools that can be used to monitor and manage the performance of our solution in production, and look at how they can be used to implement more advanced functionality like auto-scaling.

# 3.1 Assessing the impact of Optimizations
The performance tool that we've been using has an additional feature: not only does it display the results on the screen, it also saves the data in a tabular format to the following location: 

<code>"./results/${MODEL_NAME}/results${RESULTS_ID}_${TIMESTAMP}.csv"</code>

To assess the impact of the various optimizations, let's take advantage of the previously generated log files.

## 3.1.1 Exercise: Profile the Model
We executed <code>bertQA-torchscript</code> as well as <code>bertQA-onnx-trt-dynbatch</code> earlier, so we should already have the logs from that execution saved. Let's look at the content of the appropriate log folders. If you have executed the performance tool more than once, you might see multiple log files with different time stamps created.

In [None]:
!ls ./results/bertQA-torchscript/
!ls ./results/bertQA-onnx-trt-dynbatch

Please download both of the CSV files (browse in the left pane and right-click to find "download"). In order to generate the execution reports follow the steps below to import the log files of the <code>bertQA-onnx-trt-dynbatch</code>:

<!-- - Open [this spreadsheet](Triton%20Inference%20Server%20Performance%20Results.xlsx) -->
- Open <a href="https://docs.google.com/spreadsheets/d/1S8h0bWBBElHUoLd2SOvQPzZzRiQ55xjyqodm_9ireiw/edit#gid=1572240508">this spreadsheet</a>
- Make a copy from the File menu "Make a copy…"
- Open the copy
- Select the A1 cell on the "Raw Data" tab
- From the File menu select "Import…"
- Select "Upload" and upload the file
- Select "Replace data at selected cell" and then select the "Import data" button

Once you have completed the above steps you should be presented with the following plots in the "Components of Latency" tab and "Latency vs. Throughput" tab, respectively: <br/>
<img width=600 src="images/ComponentsOfLatency1.png"/> <img width=600 src="images/LatencyVsThrughput1.png"/> <br/>

Please repeat the above for the <code>bertQA-torchscript</code> model. (Remember that the TorchScript variant was executed at batch 8). <br>
How do those compare? Discuss with the instructor.

Images of the analysis for the `bertQA-torchscript` model can also be found <a href="images/torchscript_latency1.png">here</a> and <a href="images/torchscript_latency2.png">here</a>.

# 3.2 Monitoring and Responding to Performance Fluctuations

Understanding the performance of your inference server is not only critical at the initial planning stage but equally important throughout the lifetime of the application. The ability to capture metrics describing server performance is not only central to the ability to respond to issues, but also is a foundation of more advanced features like automatic scaling.  The diagram below demonstrates a simplified view of the Triton deployment architecture. By combining Triton with technologies like [Kubernetes](https://kubernetes.io/docs/home/), you can, with relative ease, create a configuration that will automatically scale with the increased demand within your data center or, if necessary, burst the excess workload to the cloud/clouds. <br/>

<img width=700 src="images/DeploymentArchitecture.png"/>

## 3.2.1 Viewing Prometheus Metrics
Triton exposes [Prometheus](https://prometheus.io/) performance metrics for monitoring on port 8002 by default. These include metrics on GPU power usage, GPU memory, request counts, and latency measures.  More documentation on individual metrics can be found in the <a href="https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/metrics.html">Triton Metrics documentation</a>. For now, let's query the metrics captured throughout our performance runs:

In [None]:
# Set the server hostname and check it - you should get a message that "Triton Server is ready!"
tritonServerHostName = "triton"
!./utilities/wait_for_triton_server.sh {tritonServerHostName}

In [None]:
# Use a curl command to request the metrics
prometheus_url = tritonServerHostName + ":8002/metrics"
!curl -v {prometheus_url}

## 3.2.2 Interpreting the Metrics
The Prometheus metrics output is a list of metrics, where each is provided with the form:

```
# HELP <metric_name and description>
# TYPE <metric_name and type>
metric_name{gpu_uuid="GPU-xxxxxx",...} <data>
```

For example, if the inference server models includes two models, you should see among the list some metrics that are specific to each model, and other metrics that are more general about the GPU they both share.<br>

#### Count Example
The following example indicates that the inference count for the `bertQA-onnx-trt-dynbatch` model is 10,105 so far, while the inference count for `bertQA-torchscript` model is 717.<br>What do your results show?
```
# HELP nv_inference_count Number of inferences performed
# TYPE nv_inference_count counter
nv_inference_count{gpu_uuid="GPU-640c6e00-43dd-9fae-9f9a-cb6af82df8e9",model="bertQA-onnx-trt-dynbatch",version="1"} 10105.000000
nv_inference_count{gpu_uuid="GPU-640c6e00-43dd-9fae-9f9a-cb6af82df8e9",model="bertQA-torchscript",version="1"} 717.000000
```

#### GPU Power Example
The following example indicates that current GPU power usage is about 40 watts.<br>What do your results show?
```
# HELP nv_gpu_power_usage GPU power usage in watts
# TYPE nv_gpu_power_usage gauge
nv_gpu_power_usage{gpu_uuid="GPU-640c6e00-43dd-9fae-9f9a-cb6af82df8e9"} 39.958000
```

#### What Do Your Results Indicate?

* Can you identify the current utilization rate? 
* Why is it zero? 
* How much memory are we using? 
* Why do you think we are using the GPU memory even though there are no requests executed against our server? 

Discuss with the instructor.

<h3 style="color:green;">Congratulations!</h3><br>
You've successfully configured optimizations and learned how to profile the model.<br>

Please move to the last part of the class to learn how to build custom applications that take advantage of Triton features:<br>
[4.0 Using the Model](040_UsingTheModel.ipynb)

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>