Rely on new IGC behavior to workaround issues with -O0 compilation in reduce-then-scan #2088

mmichel11 · 2025-02-27T17:55:55Z

I recently discovered that after our discussions with the IGC team, IGC reimplemented it's software workaround for the function calling bug that affects certain integrated graphics / DG2 kernels compiled with -O0 due to the large number of issues similar to what we have encountered and worked around in #2046.

The new device compiler behavior enables -O0 to work with SIMD32 kernels (sub-group sizes of 32) with a tradeoff of EU occupancy by disabling a hardware feature known as EU fusion. With this approach, my understanding is that EU occupancy only reaches 50% as opposed to our current SIMD16 workaround which has no such issue. The performance impact is acceptable from my perspective given that this only occurs when the kernels are compiled with -O0 which will likely be used for debugging only. This also avoids changing the kernel SIMD width between optimization levels which may impact debuggability from the user's side.

Relying on the new IGC behavior also resolves the recently documented scenario where kernel name mismatches occur when the host and device compiler use different optimization levels which is the default behavior with the intel/llvm open-source clang++ driver when no optimization flag is specified.

GPU Driver support this patch relies on is currently limited to the 2506.18 rolling driver: https://dgpu-docs.intel.com/releases/releases.html with no current LTS driver support. Given that onedpl 2022.8.0 has already dropped with our workaround, users on LTS drivers will not see a kernel compilation crash with this release. For onedpl 2022.9.0 and beyond, we can document the minimum LTS / rolling driver with the workaround along with the following device compiler warning message users will see with AOT compilation:

warning: EU fusion is disabled, it does not work on the current platform if SIMD32 mode specified by intel_reqd_sub_group_size(32)

I have tested through internal CI and tests behave as expected.

Rolling driver version 2506.18 from 02/07 introduces a new way to workaround the -O0 bug that prevents usage of SIMD32 kernels on certain iGPUs. The new workaround reenables SIMD32 kernels with -O0 compilation. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h

akukanov · 2025-02-28T11:56:35Z

I am hesitant to revert the current solution at least until there is a long-term support driver with the new behavior. Ideally, there should be no LTS drivers without the new behavior, but maybe that is too much to ask for.

I do not see the SG size difference in debug and release as an issue. It's an implementation detail of oneDPL. which nobody should rely upon.

Relying on the new IGC behavior also resolves the recently documented scenario where kernel name mismatches occur when the host and device compiler use different optimization levels which is the default behavior with the intel/llvm open-source clang++ driver when no optimization flag is specified.

It seems I missed the details of this issue. Why the kernel name even depends on the subgroup size? Is it impacted by the reqd_sub_group_size attribute, or does it come from our name generation?

danhoeflinger · 2025-02-28T13:22:33Z

Relying on the new IGC behavior also resolves the recently documented scenario where kernel name mismatches occur when the host and device compiler use different optimization levels which is the default behavior with the intel/llvm open-source clang++ driver when no optimization flag is specified.

It seems I missed the details of this issue. Why the kernel name even depends on the subgroup size? Is it impacted by the reqd_sub_group_size attribute, or does it come from our name generation?

With different optimization levels between the "device pass" and "host pass" of the compiler, 2 different template instantiations of the submitter are used. At runtime, when we go to JIT compile and/or launch the kernel, we find that no matching kernel was set up by the "device pass" of the compiler and there is a missing kernel error, because the "device pass" used this separate template instantiation and set up a different kernel.

@mmichel11 and I discussed a number of possible but "sneaky" workarounds for this which could make the "device pass" of the compiler see all kernel options, but given this surprise fix it would be better to not need to pursue them.

mmichel11 · 2025-02-28T15:00:50Z

Duplicating a bit from what I have mentioned offline to make it public here:

I am hesitant to revert the current solution at least until there is a long-term support driver with the new behavior. Ideally, there should be no LTS drivers without the new behavior, but maybe that is too much to ask for.

I can ask the IGC team when the specific fix will make it into a LTS driver. LTS drivers are supported for ~3 years, so sometime in 2028 would be when all supported drivers will have this fix.

Why the kernel name even depends on the subgroup size? Is it impacted by the reqd_sub_group_size attribute, or does it come from our name generation?

The unnamed lambda naming procedure causes __parallel_reduce_then_scan_reduce_submitter and __parallel_reduce_then_scan_scan_submitter to be included in the kernel name which has a template argument for the __sub_group_size. It is a good point to think about and the alternative branch I have linked below does work around the immediate issue through removing this template argument.

@mmichel11 and I discussed a number of possible but "sneaky" workarounds for this which could make the "device pass" of the compiler see all kernel options, but given this surprise fix it would be better to not need to pursue them.

Another sneaky fix that works is here: https://github.com/uxlfoundation/oneDPL/tree/dev/mmichel11/remove_sg_sz_template_rts. Moving sub-group size to a constexpr static field of the submitters as opposed to a template argument causes it to no longer be reflected in the kernel name and fixes the mismatch. However, the host still thinks the sub-group size is 16 when it is really 32 and allocates more temporary storage than what is needed. This is a risk if the flipside occurs and the host allocates less temporary storage than what is required which will cause illegal memory accesses. I suppose we could always err on allocating more temporary storage than what is needed, but this mismatch between the host and device compilation feels dangerous to me and I'm concerned it could open us up to similar issues as here.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 · 2025-03-05T21:05:13Z

We have a few different options now each with their own pros and cons. I think it is easiest discuss this in an offline forum, so moving to draft for now.

mmichel11 · 2025-03-17T13:47:56Z

Closing as we have decided to use a different approach which will support older drivers as well.

mmichel11 added this to the 2022.9.0 milestone Feb 27, 2025

mmichel11 requested review from danhoeflinger, adamfidel, akukanov, dmitriy-sobolev and SergeyKopienko February 27, 2025 20:23

SergeyKopienko reviewed Feb 28, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h Outdated Show resolved Hide resolved

Address review comment

9d9c963

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 marked this pull request as draft March 5, 2025 21:04

mmichel11 closed this Mar 17, 2025

mmichel11 deleted the dev/mmichel11/use_new_o0_workaround branch March 31, 2025 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rely on new IGC behavior to workaround issues with -O0 compilation in reduce-then-scan #2088

Rely on new IGC behavior to workaround issues with -O0 compilation in reduce-then-scan #2088

Uh oh!

mmichel11 commented Feb 27, 2025

Uh oh!

Uh oh!

akukanov commented Feb 28, 2025

Uh oh!

danhoeflinger commented Feb 28, 2025

Uh oh!

mmichel11 commented Feb 28, 2025

Uh oh!

mmichel11 commented Mar 5, 2025

Uh oh!

mmichel11 commented Mar 17, 2025

Uh oh!

Uh oh!

Rely on new IGC behavior to workaround issues with -O0 compilation in reduce-then-scan #2088

Rely on new IGC behavior to workaround issues with -O0 compilation in reduce-then-scan #2088

Uh oh!

Conversation

mmichel11 commented Feb 27, 2025

Uh oh!

Uh oh!

akukanov commented Feb 28, 2025

Uh oh!

danhoeflinger commented Feb 28, 2025

Uh oh!

mmichel11 commented Feb 28, 2025

Uh oh!

mmichel11 commented Mar 5, 2025

Uh oh!

mmichel11 commented Mar 17, 2025

Uh oh!

Uh oh!