Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rely on new IGC behavior to workaround issues with -O0 compilation in reduce-then-scan #2088

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mmichel11
Copy link
Contributor

I recently discovered that after our discussions with the IGC team, IGC reimplemented it's software workaround for the function calling bug that affects certain integrated graphics / DG2 kernels compiled with -O0 due to the large number of issues similar to what we have encountered and worked around in #2046.

The new device compiler behavior enables -O0 to work with SIMD32 kernels (sub-group sizes of 32) with a tradeoff of EU occupancy by disabling a hardware feature known as EU fusion. With this approach, my understanding is that EU occupancy only reaches 50% as opposed to our current SIMD16 workaround which has no such issue. The performance impact is acceptable from my perspective given that this only occurs when the kernels are compiled with -O0 which will likely be used for debugging only. This also avoids changing the kernel SIMD width between optimization levels which may impact debuggability from the user's side.

Relying on the new IGC behavior also resolves the recently documented scenario where kernel name mismatches occur when the host and device compiler use different optimization levels which is the default behavior with the intel/llvm open-source clang++ driver when no optimization flag is specified.

GPU Driver support this patch relies on is currently limited to the 2506.18 rolling driver: https://dgpu-docs.intel.com/releases/releases.html with no current LTS driver support. Given that onedpl 2022.8.0 has already dropped with our workaround, users on LTS drivers will not see a kernel compilation crash with this release. For onedpl 2022.9.0 and beyond, we can document the minimum LTS / rolling driver with the workaround along with the following device compiler warning message users will see with AOT compilation:

warning: EU fusion is disabled, it does not work on the current platform if SIMD32 mode specified by intel_reqd_sub_group_size(32)

I have tested through internal CI and tests behave as expected.

Rolling driver version 2506.18 from 02/07 introduces a new way to
workaround the -O0 bug that prevents usage of SIMD32 kernels on certain
iGPUs. The new workaround reenables SIMD32 kernels with -O0 compilation.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
@akukanov
Copy link
Contributor

I am hesitant to revert the current solution at least until there is a long-term support driver with the new behavior. Ideally, there should be no LTS drivers without the new behavior, but maybe that is too much to ask for.

I do not see the SG size difference in debug and release as an issue. It's an implementation detail of oneDPL. which nobody should rely upon.

Relying on the new IGC behavior also resolves the recently documented scenario where kernel name mismatches occur when the host and device compiler use different optimization levels which is the default behavior with the intel/llvm open-source clang++ driver when no optimization flag is specified.

It seems I missed the details of this issue. Why the kernel name even depends on the subgroup size? Is it impacted by the reqd_sub_group_size attribute, or does it come from our name generation?

@danhoeflinger
Copy link
Contributor

Relying on the new IGC behavior also resolves the recently documented scenario where kernel name mismatches occur when the host and device compiler use different optimization levels which is the default behavior with the intel/llvm open-source clang++ driver when no optimization flag is specified.

It seems I missed the details of this issue. Why the kernel name even depends on the subgroup size? Is it impacted by the reqd_sub_group_size attribute, or does it come from our name generation?

With different optimization levels between the "device pass" and "host pass" of the compiler, 2 different template instantiations of the submitter are used. At runtime, when we go to JIT compile and/or launch the kernel, we find that no matching kernel was set up by the "device pass" of the compiler and there is a missing kernel error, because the "device pass" used this separate template instantiation and set up a different kernel.

@mmichel11 and I discussed a number of possible but "sneaky" workarounds for this which could make the "device pass" of the compiler see all kernel options, but given this surprise fix it would be better to not need to pursue them.

@mmichel11
Copy link
Contributor Author

Duplicating a bit from what I have mentioned offline to make it public here:

I am hesitant to revert the current solution at least until there is a long-term support driver with the new behavior. Ideally, there should be no LTS drivers without the new behavior, but maybe that is too much to ask for.

I can ask the IGC team when the specific fix will make it into a LTS driver. LTS drivers are supported for ~3 years, so sometime in 2028 would be when all supported drivers will have this fix.

Why the kernel name even depends on the subgroup size? Is it impacted by the reqd_sub_group_size attribute, or does it come from our name generation?

The unnamed lambda naming procedure causes __parallel_reduce_then_scan_reduce_submitter and __parallel_reduce_then_scan_scan_submitter to be included in the kernel name which has a template argument for the __sub_group_size. It is a good point to think about and the alternative branch I have linked below does work around the immediate issue through removing this template argument.

@mmichel11 and I discussed a number of possible but "sneaky" workarounds for this which could make the "device pass" of the compiler see all kernel options, but given this surprise fix it would be better to not need to pursue them.

Another sneaky fix that works is here: https://github.com/uxlfoundation/oneDPL/tree/dev/mmichel11/remove_sg_sz_template_rts. Moving sub-group size to a constexpr static field of the submitters as opposed to a template argument causes it to no longer be reflected in the kernel name and fixes the mismatch. However, the host still thinks the sub-group size is 16 when it is really 32 and allocates more temporary storage than what is needed. This is a risk if the flipside occurs and the host allocates less temporary storage than what is required which will cause illegal memory accesses. I suppose we could always err on allocating more temporary storage than what is needed, but this mismatch between the host and device compilation feels dangerous to me and I'm concerned it could open us up to similar issues as here.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
@mmichel11 mmichel11 marked this pull request as draft March 5, 2025 21:04
@mmichel11
Copy link
Contributor Author

We have a few different options now each with their own pros and cons. I think it is easiest discuss this in an offline forum, so moving to draft for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants