## Step 2: Data Transfers between Host and GPU

Any communication between the host and GPU devices usually happens on a PCIe link which is very slow, so it is important that we optimize any data transfers between the host and the GPU.


## 2.1 Analyze the Profile

Use the Nsight Systems GUI to analyze the profile that we generated at the end of the last notebook for the code [main_nvtx-cvcuda.py](video_segmentation/main_nvtx-cvcuda.py). If you closed the browser tab with the Nsight Streamer, go back to the [step 1](step1.ipynb) notebook and click the link generated in the last code cell of section 1.2.

The screenshot below shows the timeline view, zoomed into one batch in the pipeline.

<center><img src=images/step2/nsys_timeline_memcpy_merged.png></center>

Some observations:
1. If you select the *Memory* timeline row and right-click to select the *Show in Events View* option, you can see the list of memory operations in the *Events View*, sorted according to their start time. Select the _pipeline_ NVTX range and right-click to select the *Apply Filter* option. This will filter the *Events View* to show only those events that occurred within the _pipeline_ NVTX range. You will see a list of alternating *Memcpy HtoD* and *Memcpy DtoH* operations.
2. The data is being copied from host to device before any of the CUDA kernels for the batch are executed on the GPU. This is because the first step in the algorithm, decoding, is still being done on the CPU.
3. The data is being copied out from device to host after the CUDA kernels for a batch finish executing on the GPU. This is because the final encoding step for the batch is still being done on the CPU.
4. The CudaMemcpyAsync calls are actually blocking the CPU thread until the data is transferred to/from the GPU. (For advanced CUDA users, it is because pageable memory is being used. See the *CUDA Async Memcpy with Pageable Memory* rule in *Expert Systems View* for the explanation and how to address it. This will not be covered by the instructor.)

## 2.2 Optimize Code to Address the Bottleneck
If we can move the decoding and encoding steps of the algorithm to the GPU as well and keep the data on the GPU until the full pipeline is complete, that would help us avoid the memory transfers.

NVIDIA GPUs contain one or more hardware-based decoder and encoder(s) (separate from the CUDA cores) which provides fully-accelerated hardware-based video decoding and encoding for several popular codecs. With decoding/encoding offloaded, the graphics engine and the CPU are free for other operations.

<center><img src=images/Nvenc_dec.JPG></center>

[NVIDIAâ€™s Video Codec SDK](https://developer.nvidia.com/video-codec-sdk) offers hardware-accelerated video encoding and decoding through highly optimized C/C++ APIs.
Video encoding and decoding is useful for a wide range of users, including computer vision experts, researchers and Deep Learning developers.
[PyNvVideoCodec](https://docs.nvidia.com/video-technologies/pynvvideocodec) provides Python bindings for harnessing such video encoding and decoding capabilities when working with videos in Python.

Execute the cell below to see the code changes in the main Python program needed for the optimization. It shows the diff between the [main_nvtx-cvcuda.py](video_segmentation/main_nvtx-cvcuda.py) and [main_nvtx-cvcuda-nvcodec.py](video_segmentation/main_nvtx-cvcuda-nvcodec.py) files.

In [1]:
!diff -U1 -d --color=always video_segmentation/main_nvtx-cvcuda.py video_segmentation/main_nvtx-cvcuda-nvcodec.py

[1m--- video_segmentation/main_nvtx-cvcuda.py	2025-05-28 17:17:45.284432419 +0100[0m
[1m+++ video_segmentation/main_nvtx-cvcuda-nvcodec.py	2025-05-28 17:17:44.696452219 +0100[0m
[36m@@ -22,3 +22,3 @@[0m
 # Select codec backend ---------------------------------
[31m-from opencv_utils import BatchEncoder, BatchDecoder[0m
[32m+from nvcodec_utils import BatchEncoder, BatchDecoder[0m
 


## 2.3 Profile to Verify the Optimization

So far we have used the CLI to profile the application. Another option is NVIDIA's Jupyterlab extension which enables profiling of code cells directly. See https://pypi.org/project/jupyterlab-nvidia-nsight/
The extension is pre-installed in this Jupyterlab notebook. Let's use it to profile the optimized code.

<img src=images/step2/jupyterlab-nvidia-nsight-extension.png>

The following code cell has a simple Python command to run the optimized code. To profile it, use the following instructions:
- Click on the **NVIDIA Nsight** menu option
- Select the **Profiling with Nsight Systems...** option
- Set the _nsys launch_ command options to `--trace=cuda,nvtx,osrt,nvvideo`. These are the same options as used in the previous notebook with the addition of the *nvvideo* trace option which will make Nsight Systems trace the NVIDIA Video Codec API calls.

<img src=images/extension_defaults_change.jpg>

- Hit _Restart_ to restart the kernel
- Click on the code cell to profile and from the NVIDIA Nsight menu select the **Run and profile selected cells...** option (green arrow in the toolbar). You will see a popup to _Set nsys command options_, which you can leave blank to use the default and click _OK_.

In [7]:
!python video_segmentation/main_nvtx-cvcuda-nvcodec.py

{<NV_ENC_CAPS.NUM_MAX_BFRAMES: 0>: 4, <NV_ENC_CAPS.SUPPORTED_RATECONTROL_MODES: 1>: 63, <NV_ENC_CAPS.SUPPORT_FIELD_ENCODING: 2>: 0, <NV_ENC_CAPS.SUPPORT_MONOCHROME: 3>: 0, <NV_ENC_CAPS.SUPPORT_FMO: 4>: 0, <NV_ENC_CAPS.SUPPORT_QPELMV: 5>: 1, <NV_ENC_CAPS.SUPPORT_BDIRECT_MODE: 6>: 1, <NV_ENC_CAPS.SUPPORT_CABAC: 7>: 1, <NV_ENC_CAPS.SUPPORT_ADAPTIVE_TRANSFORM: 8>: 1, <NV_ENC_CAPS.SUPPORT_STEREO_MVC: 9>: 1, <NV_ENC_CAPS.NUM_MAX_TEMPORAL_LAYERS: 10>: 4, <NV_ENC_CAPS.SUPPORT_HIERARCHICAL_PFRAMES: 11>: 1, <NV_ENC_CAPS.SUPPORT_HIERARCHICAL_BFRAMES: 12>: 1, <NV_ENC_CAPS.LEVEL_MAX: 13>: 62, <NV_ENC_CAPS.LEVEL_MIN: 14>: 10, <NV_ENC_CAPS.SEPARATE_COLOUR_PLANE: 15>: 1, <NV_ENC_CAPS.WIDTH_MAX: 16>: 4096, <NV_ENC_CAPS.HEIGHT_MAX: 17>: 4096, <NV_ENC_CAPS.SUPPORT_TEMPORAL_SVC: 18>: 1, <NV_ENC_CAPS.SUPPORT_DYN_RES_CHANGE: 19>: 1, <NV_ENC_CAPS.SUPPORT_DYN_BITRATE_CHANGE: 20>: 1, <NV_ENC_CAPS.SUPPORT_DYN_FORCE_CONSTQP: 21>: 1, <NV_ENC_CAPS.SUPPORT_DYN_RCMODE_CHANGE: 22>: 0, <NV_ENC_CAPS.SUPPORT_SUBFRAME_RE


Once the profiling is done, you will see a popup notifying you when the report file is ready.

<div class="alert alert-block alert-info">
<b> Optionally enable Python sampling</b>

<p>Look up the relevant CLI flags in the <a href="https://docs.nvidia.com/nsight-systems/UserGuide/index.html#python-profiling">Nsight Systems User Guide</a> or use `nsys profile --help`.</p>

Use the Jupyterlab Nsight extension or the <i>nsys profile</i> command to collect a profile for the <i>video_segmentation/main_nvtx-cvcuda-nvcodec.py</i> program with Python sampling enabled.<br>
Optionally set the Python sampling frequency to 400Hz.
</div>

In [8]:
!nsys profile \
--trace cuda,nvtx,osrt,nvvideo \
--output reports/optimized_cvcuda_nvcodec_pybt \
--force-overwrite=true \
--python-sampling=true --python-sampling-frequency=400 \
python video_segmentation/main_nvtx-cvcuda-nvcodec.py

         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.
Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Collecting data...
{<NV_ENC_CAPS.NUM_MAX_BFRAMES: 0>: 4, <NV_ENC_CAPS.SUPPORTED_RATECONTROL_MODES: 1>: 63, <NV_ENC_CAPS.SUPPORT_FIELD_ENCODING: 2>: 0, <NV_ENC_CAPS.SUPPORT_MONOCHROME: 3>: 0, <NV_ENC_CAPS.SUPPORT_FMO: 4>: 0, <NV_ENC_CAPS.SUPPORT_QPELMV: 5>: 1, <NV_ENC_CAPS.SUPPORT_BDIRECT_MODE: 6>: 1, <NV_ENC_CAPS.SUPPORT_CABAC: 7>: 1, <NV_ENC_CAPS.SUPPORT_ADAPTIVE_TRANSFORM: 8>: 1, <NV_ENC_CAPS.SUPPORT_STEREO_MVC: 9>: 1, <NV_ENC_CAPS.NUM_MAX_TEMPORAL_LAYERS: 10>: 4, <NV_ENC_CAPS.SUPPORT_HIERARCHICAL_PFRAMES: 11>: 1, <NV_ENC_CAPS.SUPPORT_HIERARCHICAL_BFRAMES: 12>: 1, <NV_ENC_CAPS.LEVEL_MAX: 13>: 62, <NV_ENC_CAPS.LEVEL_MIN: 14>: 10, <NV_ENC_CAPS.SEPARATE_COLOUR_PLA

Let's open the report file in the Nsight Systems GUI.

Zooming into a single batch of the pipeline confirms that the application is invoking the Video Encode and Video Decode APIs and there are no more memory transfers from HtoD before the execution of CUDA kernels for the batch and DtoH afterwards. Filtering the memory operations to just the _pipeline_ NVTX range as before shows no Memcpy HtoD or Memcpy DtoH operations. The pipeline stage is now down to ~5.7s which is a speedup of 2.1x compared to the previous optimization step and 11.5x to the baseline code.

<center><img src=images/step2/nsys_timeline_nvcodec.png></center>

<div class="alert alert-block alert-success">
    <b>Summary</b>
    <p>
        We went through another iteration of the optimization workflow by avoiding the data movement between the CPU and GPU.
    </p>
    <p>
        The <i>Events View</i> feature is handy when searching for events of interest in a timeline row.
    </p>
    <p>
        The Jupyterlab extension for NVIDIA Nsight tools enables you to directly profile Python code in a code cell.
    </p>
</div>

Please click [here](step3.ipynb) to move to the next step.