## Step 4: Multi-Report Analysis

In this notebook, we will learn about Nsight Systems' multi-report analysis.

When profiling parallel applications, it might be sufficient to profile just one rank (or process tree) as representative of the work done on the remaining ranks.
However, to evaluate aspects such as load balancing across nodes and ranks, communication between the nodes and ranks, etc., we need to profile all the ranks of the application.
In such cases, you might end up with tens or hundreds of report files to analyze, which may not be practicable using the Nsight Systems timeline (even though multiple reports can be opened into a single timeline).

The Nsight Systems **Multi-Report Analysis System** allows you to do **statistical analysis across multiple result files**.
The workflow is illustrated in the following diagram.

<img src=images/step4/multi_node_analysis.jpg width=70%>

Nsight Systems provides a **library of Python scripts** which are referred to as **recipes**.
The recipe analysis can be run immediately after the collection or post-mortem.
The output is usually Jupyter notebooks that can be viewed within the Nsight Systems GUI or in a browser.
It is intended to visually **highlight outliers** and other issues that should to be investigated further.

Nsight Systems ships with several built-in recipes.
A tutorial on how to write your own recipes is included in the [documentation](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#tutorial-create-a-user-defined-recipe). We encourage users to share any custom recipes that they believe would be beneficial for the [Nsight Systems community](https://devtalk.nvidia.com/default/board/308/nsight-systems/), so we can include them in future versions of the recipe library.

To see the list of available recipes that are included with Nsight Systems, execute the code cell below.

In [1]:
!nsys recipe --help


usage: nsys recipe [<args>] <recipe name> [<recipe args>]

	-h, --help

	    Print the command's help menu.

	-q, --quiet

           Only display errors.

The following built-in recipes are available:

  cuda_api_sum -- CUDA API Summary
  cuda_api_sync -- CUDA Synchronization APIs
  cuda_gpu_kern_hist -- CUDA GPU Kernel Duration Histogram
  cuda_gpu_kern_pace -- CUDA GPU Kernel Pacing
  cuda_gpu_kern_sum -- CUDA GPU Kernel Summary
  cuda_gpu_mem_size_sum -- CUDA GPU MemOps Summary (by Size)
  cuda_gpu_mem_time_sum -- CUDA GPU MemOps Summary (by Time)
  cuda_gpu_time_util_map -- CUDA GPU Time Utilization Heatmap
  cuda_memcpy_async -- CUDA Async Memcpy with Pageable Memory
  cuda_memcpy_sync -- CUDA Synchronous Memcpy
  cuda_memset_sync -- CUDA Synchronous Memset
  diff -- Statistics Diff
  dx12_mem_ops -- DX12 Memory Operations
  gpu_gaps -- GPU Gaps
  gpu_metric_util_map -- GPU Metric Utilization Heatmap
  gpu_metric_util_sum -- GPU Metrics Utilization Summary
  gpu_time_util -- GPU

### 4.1 NVTX GPU Projection Summary Recipe

We would like to tune the input parameters so that we get the fastest execution time.
We will focus on tuning the batch size, which reflects the number of frames that are processed as a single batch.
The NVTX annotations that have been added to the code will help us to easily identify the **fastest pipeline** and the **most time-consuming step** in each batch execution.

Execute the cell below to create profiles for the batch sizes 1, 2, 4, 8 and 16, which will result in in five report files. Be patient, this will take a little while.

In [10]:
!mkdir -p reports/cvcuda_batchsize

!for bs in 1 2 4 8 16; do \
  printf -v bs_zfill "%02d" ${bs}; \
  nsys profile --trace cuda,nvtx \
    --cuda-event-trace=false \
    --output reports/cvcuda_batchsize/cvcuda_bs${bs_zfill} \
    --force-overwrite=true \
  python video_segmentation/main_nvtx-cvcuda-nvcodec.py ${bs} ; \
done

Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Invalid plugin configuration: Executable path does not exist: /opt/nvidia/nsight-systems/2025.1.3/target-linux-x64/plugins/mynvml/mynvml_plugin
Collecting data...
Using batch size 1
{<NV_ENC_CAPS.NUM_MAX_BFRAMES: 0>: 4, <NV_ENC_CAPS.SUPPORTED_RATECONTROL_MODES: 1>: 63, <NV_ENC_CAPS.SUPPORT_FIELD_ENCODING: 2>: 0, <NV_ENC_CAPS.SUPPORT_MONOCHROME: 3>: 0, <NV_ENC_CAPS.SUPPORT_FMO: 4>: 0, <NV_ENC_CAPS.SUPPORT_QPELMV: 5>: 1, <NV_ENC_CAPS.SUPPORT_BDIRECT_MODE: 6>: 1, <NV_ENC_CAPS.SUPPORT_CABAC: 7>: 1, <NV_ENC_CAPS.SUPPORT_ADAPTIVE_TRANSFORM: 8>: 1, <NV_ENC_CAPS.SUPPORT_STEREO_MVC: 9>: 1, <NV_ENC_CAPS.NUM_MAX_TEMPORAL_LAYERS: 10>: 4, <NV_ENC_CAPS.SUPPORT_HIERARCHICAL_PFRAMES: 11>: 1, <NV_ENC_CAPS.SUPPORT_HIERARCHICAL_BFRAMES: 12>: 1, <NV_ENC_CAPS.LEVEL_MAX: 13>: 62, <NV_ENC_CAPS.LEVEL_MIN: 14>: 10, <NV_ENC_CAPS.SEPARATE_COLOUR_PLANE: 15>: 1, <NV_ENC_CAPS.WIDTH_MAX: 16>: 4096,

The `printf` line in the code box above creates a two-digit representation of the batch size at the end of the report file name.
This gives us an ascending sorting of the report files by batch size, which simply reads better in the notebook output.

Let's use the  **nvtx_gpu_proj_sum** recipe on the five reports in *reports/cvcuda_batchsize*.
It will help us identify the slowest step in the algorithm.

<img src=images/step4/topN_explanation.jpg width=70%>

Recall that there is only one **pipeline** NVTX range in the timeline and it spans over all the batches in the workload.
Examining the duration of this range across all report files will give us the execution time of the video segmentation pipeline.

The following mockup of a timeline illustrates how NVTX ranges are projected from the CPU onto the GPU.

<img src=images/step4/gpu_projection_mockup_v2.jpg width=75%>

Users add NVTX ranges on the CPU thread to annotate the various phases of their code’s algorithms. Nsight Systems automatically projects a NVTX range onto the GPU by analyzing any CUDA work launched from within that range on the same CPU thread. The projection refits the range's start and end time to tightly wrap the CUDA launches, memory copies and memset operations invoked within it. You will see the NVTX projection on the GPU under the CUDA HW timeline row as highlighted in the screenshot below.

<img src=images/step4/nsys-timeline-nvtx-projection_arrows.png>

Execute the cell below the see the command line options for the **nvtx_gpu_proj_sum** recipe.

In [5]:
!nsys recipe nvtx_gpu_proj_sum --help

usage: nvtx_gpu_proj_sum [-h] [--output OUTPUT] [--force-overwrite] --input
                         INPUT [INPUT ...] [--csv]
                         [--filter-time [start_time]/[end_time] |
                         --filter-nvtx range[@domain][/index] |
                         --filter-projected-nvtx range[@domain][/index]]
                         [--mode {none,concurrent,dask-futures}]

This recipe provides a summary of NVTX time ranges projected from the CPU onto
the GPU, and their execution times.

options:
  -h, --help            Show this help message and exit.

Context:
  --mode {none,concurrent,dask-futures}
                        Mode to run tasks.

Recipe:
  --output OUTPUT       Output directory name.
                        Any %q{ENV_VAR} pattern in the filename will be
                        substituted with the value of the environment
                        variable.
                        Any %h pattern in the filename will be substituted
                      


Execute the code cell below to run the *NVTX GPU projection summary* recipe.

In [15]:
# !mkdir reports/cvcuda_batchsize

!nsys recipe nvtx_gpu_proj_sum \
--output reports/cvcuda_batchsize \
--force-overwrite \
--log-level=error \
--input reports/cvcuda_batchsize

# Copy the customized notebook that uses report names instead of ranks.
!cp nsys/recipes/nvtx_gpu_proj_sum/topN_files.ipynb reports/cvcuda_batchsize/results_nvtx_gpu_proj_sum

Generated:
    reports/cvcuda_batchsize


After the recipe finishes, open the notebook [reports/cvcuda_batchsize/results_nvtx_gpu_proj_sum/topN_files.ipynb](reports/cvcuda_batchsize/results_nvtx_gpu_proj_sum/topN_files.ipynb) and follow the explanation given in the notebook. You can **use the >> icon** at the top of the tab to execute all the code cells one after another. 

Below is a screenshot of the statistics table for the top NVTX ranges on the **CPU**.

<img src=images/step4/nvtx_recipe_table_cpu_pytorch_.png width=55%>

From the table, we can conclude the following:
1. The _pipeline_ is fastest when the batch size is 4.
2. Inference is the slowest step in the algorithm for batch sizes up to 4. From batch size 8, the encoder step seems to be the slowest step. *(This is the case on the CPU side.)*

Going back to the statistics table from the [reports/cvcuda_batchsize/results_nvtx_gpu_proj_sum/topN_files.ipynb](reports/cvcuda_batchsize/results_nvtx_gpu_proj_sum/topN_files.ipynb) to see the top NVTX ranges when projected on the GPU.

<img src=images/step4/nvtx_recipe_table_gpu_pytorch_.png width=55%>

This shows a clearer picture on the biggest contributor to the _pipeline_'s duration. If we can speed up the inference step's CUDA kernels, we can speed up the _pipeline_.

## 4.2 Speedup Inference and Verify the Optimization

Inference for the segmentation step is done using PyTorch. Alternatively, we can also use [TensorRT](https://developer.nvidia.com/tensorrt) to speed up inference. TensorRT has several accelerations for AI workloads to run faster on NVIDIA GPUs.

Execute the cell below to compare the [main_nvtx-cvcuda-nvcodec.py](video_segmentation/main_nvtx-cvcuda-nvcodec.py) and [main_nvtx-cvcuda-trt-nvcodec.py](video_segmentation/main_nvtx-cvcuda-trt-nvcodec.py) files.


In [16]:
!diff -U1 -d --color=always video_segmentation/main_nvtx-cvcuda-nvcodec.py video_segmentation/main_nvtx-cvcuda-trt-nvcodec.py

[1m--- video_segmentation/main_nvtx-cvcuda-nvcodec.py	2025-05-28 18:13:08.227691625 +0100[0m
[1m+++ video_segmentation/main_nvtx-cvcuda-trt-nvcodec.py	2025-05-28 17:17:45.113438177 +0100[0m
[36m@@ -23,3 +23,2 @@[0m
 from nvcodec_utils import BatchEncoder, BatchDecoder
[31m-from multithreading_utils import DecodeThread, EncodeThread[0m
 
[36m@@ -29,3 +28,3 @@[0m
 # Select inference backend -----------------------------
[31m-from torch_utils import Segmentation[0m
[32m+from trt_utils import Segmentation[0m
 


<div class="alert alert-block alert-info">
<b>Exercise:</b>
<p>Profile the TensorRT version of our sample application <i>video_segmentation/main_nvtx-cvcuda-trt-nvcodec.py</i> with batch size 4.
<br>What is the runtime and projected GPU runtime of pipeline, inference and encode now?</p>
<p> You can use the following empty code cell.</p>
</div>

In [17]:
!nsys profile \
--trace cuda,nvtx,nvvideo \
--cuda-event-trace=false \
--output reports/optimized_cvcuda_trt-nvcodec.nsys-rep \
--force-overwrite=true \
python video_segmentation/main_nvtx-cvcuda-trt-nvcodec.py

Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Invalid plugin configuration: Executable path does not exist: /opt/nvidia/nsight-systems/2025.1.3/target-linux-x64/plugins/mynvml/mynvml_plugin
Collecting data...
[05/28/2025-19:03:23] [TRT] [E] IRuntime::deserializeCudaEngine: Error Code 1: Internal Error (Failed due to an old deserialization call on a newer plan file. This might happen when the plan file was built from an older TensorRT version. You can use `trtexec --getPlanVersionOnly` to check the version of TensorRT that was used to create the plan file.)
Traceback (most recent call last):
  File "/home/sanjay42/sanjay/cuda/AcceleratedPythonProgramming/video_segmentation/main_nvtx-cvcuda-trt-nvcodec.py", line 72, in <module>
    inference = Segmentation("cat", batch_size, inference_size)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sanjay42/sanjay/cuda/AcceleratedPythonProgrammin

In [18]:
!nsys stats \
    --report nvtx_gpu_proj_sum \
    --force-export=true \
    reports/optimized_cvcuda_trt-nvcodec.nsys-rep

Generating SQLite file reports/optimized_cvcuda_trt-nvcodec.sqlite from reports/optimized_cvcuda_trt-nvcodec.nsys-rep
Processing [reports/optimized_cvcuda_trt-nvcodec.sqlite] with [/opt/nvidia/nsight-systems/2025.1.3/host-linux-x64/reports/nvtx_gpu_proj_sum.py]... 

 ** NVTX GPU Projection Summary (nvtx_gpu_proj_sum):

 Range    Style   Total Proj Time (ns)  Total Range Time (ns)  Range Instances  Proj Avg (ns)  Proj Med (ns)  Proj Min (ns)  Proj Max (ns)  Proj StdDev (ns)  Total GPU Ops  Avg GPU Ops  Avg Range Lvl  Avg Num Child
 ------  -------  --------------------  ---------------------  ---------------  -------------  -------------  -------------  -------------  ----------------  -------------  -----------  -------------  -------------
 :total  PushPop               492,510          1,820,007,294                1      492,510.0      492,510.0        492,510        492,510               0.0              2          2.0            0.0            0.0



In [19]:
!nsys recipe \
    nvtx_gpu_proj_sum \
    --input=reports/optimized_cvcuda_trt-nvcodec.nsys-rep \
    --output=reports/results-nvtx-trt \
    --force-overwrite

INFO: reports/optimized_cvcuda_trt-nvcodec.nsys-rep: Exporting ['StringIds', 'CUPTI_ACTIVITY_KIND_RUNTIME', 'CUDA_GRAPH_NODE_EVENTS', 'CUDA_GRAPH_EVENTS', 'CUPTI_ACTIVITY_KIND_KERNEL', 'CUPTI_ACTIVITY_KIND_MEMCPY', 'CUPTI_ACTIVITY_KIND_MEMSET', 'CUPTI_ACTIVITY_KIND_GRAPH_TRACE', 'NVTX_EVENTS'] to reports/optimized_cvcuda_trt-nvcodec_pqtdir...
Generated:
    reports/results-nvtx-trt


The two commands above can both be used to determine the NVTX GPU projection times.

Statistics for a report can be created and visualized via the `nsys stats` command on the command line or via the _Stats System View_ in the Nsight Systems GUI.
The following screenshot shows the NVTX GPU projection statistics view in the Nsight Systems GUI:
<img src=images/step4/nvtx_gpu_proj_stats_trt.png>

`nsys recipe` can also be used on a single report.
This is the output of the NVTX GPU projection recipe:
<table style="float: left">
<colgroup>
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 25%;">
    </colgroup>
<tr>
<td>
<img src=images/step4/nvtx_gpu_proj_trt_recipe_barchart.png>
</td>
<td>
<img src=images/step4/nvtx_gpu_proj_trt_recipe_table.png>
</td>
</tr>
</table>

With increasing batch sizes the _encode_ step becomes the bottleneck. We will stop here as we have already achieved a significant speedup.
The _pipeline_ stage is now down to ~2.7s which is a speedup of 2.1x compared to the previous optimization step and 27.8x to the baseline code.

<div class="alert alert-block alert-success">
    <b>Summary</b>
    <p>
        We used the multi-report analysis feature of Nsight Systems to compare the performance of runs with different batch sizes.
    </p>
    <p>
        We learned about how Nsight Systems is doing NVTX range projection to the GPU workload.
    </p>
    <p>
        We applied another optimization to speed-up the CV-CUDA pipeline.
    </p>
</div>

Click [here](step5.ipynb) to move to the next step.