Why is model_run spending a lot of time after all operations have finished. #8035

KarelPeeters · 2021-06-11T16:13:07Z

KarelPeeters
Jun 11, 2021

I've been profiling a network I'm trying to run with the highest throughput I get can for inference. It's based on the AlphaZero network, which for inference means only Conv, Relu, Add, Flatten and Tanh nodes, so nothing exotic. I've attached the model file as small.zip but I get the same behavior for even simpler sequential, linear models as well. The onnx file was exported from PyTorch, with opset_version=12.

I'm calling session.run from Python in a loop that runs for 100 iterations. The batch size is 1000, resulting in a total input shape (1000, 5, 9, 9) and output shapes (1000, 1) and (1000, 81).

The profile I get with sess_options.enable_profiling = True. looks like this:

And zoomed in on the second model_run:

This mostly makes sense, there's some session initialization, then the first model_run is slow and all other are a lot faster in comparison. I don't understand what is taking so much time in the zoomed in profile of model_run though, it seems that less than 10% is taken up by actual operations. What is taking so long afterwards?

It can't be the memcopy back to CPU memory, since the output is 5x smaller then the input and the input doesn't seem to be taking a lot of time. Is this just a bug in the profiler or is the empty space time that is spent actually doing nothing?

All the relevant version info I can think of:

Windows 10 Version Dev (OS Build 21930.1000)
NVIDIA GeForce GTX 1060 with Max-Q Design
onnxruntime-gpu: 1.8.0

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:41:42_Pacific_Daylight_Time_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

cudnn-11.3-windows-x64-v8.2.1.32

small.zip

Answered by wangyems

Aug 20, 2021

The profiler currently only works reliably on CPU profiling. The non-execute region in model_run actually is filled with CUDA kernel runs.

View full answer

wangyems · 2021-08-20T00:43:04Z

wangyems
Aug 20, 2021

The profiler currently only works reliably on CPU profiling. The non-execute region in model_run actually is filled with CUDA kernel runs.

1 reply

KarelPeeters Sep 12, 2021
Author

Ah okay, so the small bit at the start is just submitting work to the GPU which is then executed later? That makes sense, thanks!

AndreyOrb · 2024-05-28T18:47:36Z

AndreyOrb
May 28, 2024

I found out that it happens due to d2h operator that happens if you run the model.Run without io_bindings.
When the result is returned, the data is synced from GPU memory to CPU RAM. That takes time.

What you can do is disable that sync by applying kOrtRunOptionsConfigDisableSynchronizeExecutionProviders.
Then, model_run will exit immediately, but the data will remain in unsynced with CPU RAM.

That can be handy is some situations. For example, if you have a lock around "model_run" and want to maximize the GPU utilization.

3 replies

KarelPeeters May 28, 2024
Author

You do need to be careful with changes like this though. Apparently the events reported in this profile is just the pushing of operations to the GPU execution queue, which might run ahead of the actual execution of those operations by a lot. See for example this stackoverflow answer for what this means exactly.

Best case skipping the synchronization just hides even more info from the profiling traces, worst case this causes a bunch of synchronization issues because operators are executing at different times then expected.

AndreyOrb May 28, 2024

You are absolutely right, Such modifications require deep understanding of the process and potential consequences.
In the process I'm working with, I intentionally disjoint the sync part from the processing part. Once the processing part is done, I release the lock to allow another thread to start GPU processing. But, at this point the GPU is still not synced for the first thread, so I explicitly sync it with
cudaStreamSynchronize(static_cast<cudaStream_t>(cudaStreamDefault));
But there's more. There's some point in between when the GPU RAM is occupied by both old outputs and the new inputs. If memory is limited, a batch size must be adjusted.

AndreyOrb Apr 9, 2025

I also found that setting "kOrtRunOptionsConfigDisableSynchronizeExecutionProviders" is not enough.
If you run the session without allocation/providing the output tensors in advance, the onnxruntime will internally allocate CPU-based tensors in CUDA-pinned memory, so the last data placement will automatically sync (instead of returning immediately).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is model_run spending a lot of time after all operations have finished. #8035

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why is model_run spending a lot of time after all operations have finished. #8035

Uh oh!

KarelPeeters Jun 11, 2021

Replies: 2 comments · 4 replies

Uh oh!

wangyems Aug 20, 2021

Uh oh!

KarelPeeters Sep 12, 2021 Author

Uh oh!

AndreyOrb May 28, 2024

Uh oh!

KarelPeeters May 28, 2024 Author

Uh oh!

AndreyOrb May 28, 2024

Uh oh!

AndreyOrb Apr 9, 2025

KarelPeeters
Jun 11, 2021

Replies: 2 comments 4 replies

wangyems
Aug 20, 2021

KarelPeeters Sep 12, 2021
Author

AndreyOrb
May 28, 2024

KarelPeeters May 28, 2024
Author