## Lab - TensorRT + Profiling - Profiling
## E6692 Spring 2022

In this part you will write two Python scripts: one to do inference using the PyTorch YOLOv4-Tiny model `pytorch_inference.py` and one to do inference using the TensorRT YOLOv4-Tiny model `trt_inference.py`. Use the following guidelines when writing these scripts:

* Model weights and configuration file paths should be passed as command line arguments. Use `sys.argv` to manage the command line arguments.
* Use the OpenCV function`cv2.VideoCapture()` to read frames from the original video and `cv2.VideoWriter()` to write frames to the output file. 
* Measure the inference speed of the model and the end-to-end speed of the script including **reading/frame preprocess/inference/postprocess/frame write** with the `time` module. You're welcome to do more in depth timing, but only end-to-end and inference timing are required. Record the measurements by populating the table below.
* Generate a detected version of the 1st floor intersection video **test-lowres.mp4**. The output video names should be **test-lowres-pytorch-detected.mp4** and **test-lowres-tensorrt-detected.mp4**, respectively.

| Model Type | Model Input Size | Inference Speed (FPS) | End-to-end speed (FPS) |
| --- | --- | --- | --- |
| PyTorch | (960,540,3) | 1.09 | 0.92 |
| TensorRT | (960,540,3) | 10.45 | 4.99 |

After you've written the video detection scripts and visually inspected the output for correctness, the next step is to perform CUDA profiling to give some insights into how each program is performing. For the lab we will use the `nvprof` command line profiling tool. Go through the [user guide](https://docs.nvidia.com/cuda/profiler-users-guide/index.html) to familiarize yourself with `nvprof`.

Profiling tools give insights into specific metrics pertaining to memory usage, computational bottlenecks, and power consumption. 

**TODO:** Enter the command `nvprof --query-metrics` to list metrics available for profiling. Choose three that you think could be useful for our use case and describe what they indicate about the program.

A useful feature for identifying where a program could be further optimized is the [dependency analysis](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#dependency-analysis) tool. Briefly explain what the dependency analysis tool does.


**TODO:** Your answer here. 

The dependency analysis tells us the timeline of each API calls and CUDA kernels on cpu and gpu. Two important variables are <strong>critical path</strong> and <strong>waiting time</strong>.

<strong>Critical path</strong> denotes the longest path through an event graph that does not contain wait states. So optimizing activities on this path can directly improve the execution time.

<strong>Waiting time</strong> denotes the duration for which an activity is blocked waiting on an event in another thread or stream. Waiting time is an inidicator for load-imbalances between execution streams.

This provide us a view that wether the cpu or gpu are waiting for results when the other is running some process which cost an extra waiting time. We need to focus on those functions that takes high portion of critical path or waiting time. Try to overlap the waiting time and find ways to improve the time of critical path.

Next, you will profile your scripts `pytorch_inference.py` and `trt_inference.py`. To profile from the command line enter `nvprof <profiling_options> python3 <script_options>`. You should specify `--unified-memory-profiling off` to disable unified memory profiling (not supported for Jetson Nano) and `--dependency-analysis` to generate the dependency analysis report. Output the profiling results to text files `profiling_torch_log.txt` and `profiling_trt_log.txt` by including `--log-file <txt_file_path>` in the profiling options. 

**TODO:** Profile `pytorch_inference.py` and `trt_inference.py` to the specifications outlined above.

## Discussion

### Provide commentary on the results of the inference speed and the end-to-end speed measurements for the two detection scripts.

**TODO:** Your answer here.

TensorRT has a significant acceleration on inferencing speed, almost 10x faster! Also about 5x faster on end-to-end inferencing. As the time for inference is a samll portion of end-to-end process for trt model, the resizing, reading and writing of frames has a significant effect on the end-to-end speed.

### Identify some differences between the TensorRT and the PyTorch script profile output.

**TODO:** Your answer here.

Except for Profiling result and Dependency Analysis, the profile of trt has an additional section "NVTX result" which gives an detailed profile of NVIDIA® Tools Extension SDK. In this section, the information of each trt layer is shown in TensorRT domain so that we can directly look into this domain to see the performance of that package.

There's no single ReLu layer in the trt model which indicates that trt has fused conv layer and relu activation funciton. Also the fused layer cost less time than the conv layer in pytorch.





### What, if anything, does the dependency analysis indicate can be optimized in each of the detection scripts?

**TODO:** Your answer here.

Picked functions:

<center><strong>TensorRT Model:</strong></center>

|Critical path(%)|  Critical path|  Waiting time|  Name|
| ---|---|---|---|
|37.41%|    197.397023s|           0ns|  cudaMalloc|
|33.19%|    175.129550s|           0ns|  cuCtxDetach|
|14.01%|     73.946650s|           0ns|  \<Other\>|
|7.82%|     41.253897s|           0ns|  cudaStreamCreateWithFlags_v5000|
|4.06%|     21.420142s|           0ns|  cudaFree|
|1.19%|      6.273914s|           0ns|  trt_maxwell_fp16x2_hcudnn_winograd_fp16x2_128x128_ldg1_ldg4_relu_tile148m_nt_v1|
|1.09%|      5.726921s|           0ns|  cudaMemGetInfo|
|0.01%|    48.155303ms|   16.082453ms|  cudaStreamSynchronize|
|0.00%|    29.532000us|     8.653162s|  cuStreamSynchronize|


<center><strong>Pytorch Model:</strong></center>

|Critical path(%)|  Critical path|  Waiting time|  Name|
| ---|---|---|---|
|41.29%|    223.927519s|           0ns|  cudaMalloc|
|17.94%|     97.284251s|           0ns|  \<Other\>|
|14.51%|     78.691248s|           0ns|  cudaLaunchKernel_v7000|
|11.59%|     62.867639s|           0ns|  cudaStreamCreateWithFlags_v5000|
|10.43%|     56.589771s|           0ns|  cuModuleUnload|
|2.19%|     11.884091s|           0ns|  maxwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt_v0|
|0.33%|      1.776917s|           0ns|  maxwell_scudnn_128x64_relu_medium_nn_v1|
|0.12%|   645.788588ms|   53.231000us|  cudaMemcpyAsync|
|0.01%|    48.755587ms|    12.524525s|  cudaStreamSynchronize|

For both model, the device memory allocation, context detachment and kernel launch cost largest portion of time, but those are not included in the inferencing time, it's a kind of overhead that every time launches a model. 

Synchronizations are the only functions that has a waiting time, but they takes reletively small portion of time in the critical path.

We can see that the trt fused conv layer and relu and takes less time than pytorch model which is an great improvement. 

I'm kind of curious about the meaning of \<other\>, I guess those are not nvidia functions. Everything's managed by either trt or pytorch, without a further investigation of each functions and memory allocations, it's hard to further improve the running time.

<strong>Commands:</strong>

trt:

python3 trt_inference.py -t "./engines/yolov4-tiny-person-vehicle.trt"

nvprof --unified-memory-profiling off --dependency-analysis --log-file "profiling_trt_log.txt" python3 trt_inference.py -t "./engines/yolov4-tiny-person-vehicle.trt"

nvprof --metrics global_load_requests,flop_sp_efficiency,flop_dp_efficiency --unified-memory-profiling off --dependency-analysis --log-file "profiling_trt_log.txt" python3 trt_inference.py -t "./engines/yolov4-tiny-person-vehicle.trt"


pytorch:

python3 pytorch_inference.py -w "./weights/yolov4-tiny-person-vehicle_best.weights" -c "./cfg/yolov4-tiny-person-vehicle.cfg"

nvprof --unified-memory-profiling off --dependency-analysis --log-file "profiling_torch_log.txt" python3 pytorch_inference.py -w "./weights/yolov4-tiny-person-vehicle_best.weights" -c "./cfg/yolov4-tiny-person-vehicle.cfg"

nvprof --metrics global_load_requests,flop_sp_efficiency,flop_dp_efficiency --unified-memory-profiling off --dependency-analysis --log-file "profiling_torch_log.txt" python3 pytorch_inference.py -w "./weights/yolov4-tiny-person-vehicle_best.weights" -c "./cfg/yolov4-tiny-person-vehicle.cfg"


<strong>Metrics:</strong>

shared_load_transactions_per_request:  Average number of shared memory load transactions performed for each shared memory load

warp_execution_efficiency:  Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor

double_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute double-precision floating-point instructions on a scale of 0 to 10

single_precision_fu_utilization:  The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer instructions on a scale of 0 to 10

shared_efficiency:  Ratio of requested shared memory throughput to required shared memory throughput expressed as percentage

global_load_requests:  Total number of global load requests from Multiprocessor

local_load_requests:  Total number of local load requests from Multiprocessor

local_memory_overhead:  Ratio of local memory traffic to total memory traffic between the L1 and L2 caches expressed as percentage

flop_sp_efficiency:  Ratio of achieved to peak single-precision floating-point operations

flop_dp_efficiency:  Ratio of achieved to peak double-precision floating-point operations