OpenCL Performance: way to extract parallel kernel execution with out-of-order command queue #67

preda · 2024-03-20T16:36:52Z

I'm developing an OpenCL application PRPLL/GpuOwl https://github.com/preda/gpuowl/tree/prpll for a primes search project.

The app runs a long series of kernels serially, in a long loop, e.g. let's say this is the sequence of kernels submitted:
A, B, C, D, A, B, C, D, and so on. As these kernels are to be run serially, it's natural to use an in-order queue.

So initially we had a single process, with a single in-order queue.

An observation was made that when running two such processes in parallel (independent processes, running on the same GPU), the performance is a bit better than "half". I.e. the agregate throughput was improved by running two processes in parallel on one GPU vs. running a single process on the GPU.

Taking this observation into account, I wanted to reproduce the same behaviour (observed when running two processes) in a single process by running two "logical" streams of kernels in a single process. The logic being that while each stream is serial, there is parallelism between the two streams that can be exploited by the GPU. E.g. we want to run A1,B1,C1,D1 on stream1, and A2,B2,C2,D2 on stream2, then A1 can be executed on the GPU in parallel with any kernel from stream 2. (by "stream" I mean a logical sequence of kernels that must be executed serially/in-order).

My first approach was to use two in-order command-queues, allocating one queue to each logical "stream". But I hit this bug ROCm/ROCR-Runtime#186 which causes one hot thread (100%CPU) and perf degradation when using two queues.

As a consequence, I decided to use a single out-of-order command queue, and model the serial dependence inside the logical streams with OpenCL event wait-lists. Unfortunately after implementing this, I realized that there is no parallelism exploited between the two "streams". It appears that no kernels are executed in parallel at all, even though some could and should be executed in parallel.

Example: let's assume these are the kernels submitted to the out-of-order queue:
A1,C2,B1,D2,C1, with dependence modelled through events: A1<B1<C1 and C2<D2. Then A1 and C2 could be run in parallel on the GPU.

(Another scenario is: A1,B1,A2 with the dependence A1<B1; here A1 and A2 are elligible for parallel running though this fact is less obvious. I would hope this parallelism opportunity can be exploited as well).

But this is not what is observed: by timing the kernels, I obtain a profile that is consistent with all the kernels being run serially.

When the kernels are run by way of two processes, I see that the "running" time of the kernels grows (almost doubles) as a consequence of two processes using the GPU in parallel. The kernels from the two processes are effectivelly executed in parallel, and this is seen in the per-kernel running time, and in the overall improved throughput.

But when the kernels are run through an "interleaved out-of-order queue", the running time of each kernel does not increase. That means that each kernel is executed "standalone", and no parallelism is exploited. The agregate throughput is consistent with running serially (lower than when running through two processes).

Basically, I want to be able to obtain the same level of parallelism and performance by running a single process (either with multiple queues, or with a single out-of-order queue) as what is obtained by running two processes with a single in-order queue each.

The story can reproduced using this project (at the given commit, or generally the "prpll" branch):
https://github.com/preda/gpuowl/tree/7520fade45359f07f19151085d1dff5480ab29a9
compiling with make in the source folder,
executing echo PRP=118845473 > work-1.txt
and running with
./build-debug/prpll -d 0 -prp 118063003 -verbose

(basically the above runs two PRP tests for the two numbers mentioned, one in the work-1.txt file, and one on the command line).

The text was updated successfully, but these errors were encountered:

preda · 2024-03-20T16:40:27Z

I'm using ROCm 6.1 RC, on Linux Ubuntu 22.04 kernel 6.8.1, GPU Radeon Pro VII and Radeon VII.

preda · 2024-03-22T08:49:03Z

I think I found a clue: in clinfo for all my GPUs (Radeon VII and Radeon Pro VII) I see:

  Queue on Host properties:                              
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:                            
    Out-of-Order:                                Yes
    Profiling :                                  Yes

So it seems that the Host Command Queue does not implement out-of-order. OK.

Why is that -- is it a limitation of the hardware (particular GPU models), of the ROCm version, not implemented yet in software, something else?

Thanks anyway.
My observation was correct (i.e. the Out-of-order queue is not actually running out-of-order), but the reason was not "a bug" but rather "by design".

preda · 2024-03-22T08:51:19Z

One more observation: it's obvious that the HW is capable of running multiple compute kernels in parallel, as it does so when the kernels are queued from multiple processes. So it seems that the non-existent "out-of-order" can't be a limitation of the hardware. It's probably more like "not implemented". Are there plans in that direction?
Is it implemented in HIP?

ppanchad-amd · 2024-11-01T15:49:55Z

Hi @preda. Apologies for the delayed response. Internal ticket has been created to assist with your issue. Thanks!

schung-amd · 2024-11-05T18:55:28Z

Hi @preda, I'm reaching out to the internal team for more details but from what I can see we don't support out-of-order host-side queues by design and there are no plans to add support for that.

preda mentioned this issue Mar 23, 2024

OpenCL hot loop (100% one thread) when using two command queues with profiling ROCm/ROCR-Runtime#186

Open

ppanchad-amd added the Under Investigation label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCL Performance: way to extract parallel kernel execution with out-of-order command queue #67

OpenCL Performance: way to extract parallel kernel execution with out-of-order command queue #67

preda commented Mar 20, 2024 •

edited

Loading

preda commented Mar 20, 2024

preda commented Mar 22, 2024

preda commented Mar 22, 2024

ppanchad-amd commented Nov 1, 2024

schung-amd commented Nov 5, 2024

OpenCL Performance: way to extract parallel kernel execution with out-of-order command queue #67

OpenCL Performance: way to extract parallel kernel execution with out-of-order command queue #67

Comments

preda commented Mar 20, 2024 • edited Loading

preda commented Mar 20, 2024

preda commented Mar 22, 2024

preda commented Mar 22, 2024

ppanchad-amd commented Nov 1, 2024

schung-amd commented Nov 5, 2024

preda commented Mar 20, 2024 •

edited

Loading