-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rely on new IGC behavior to workaround issues with -O0 compilation in reduce-then-scan #2088
base: main
Are you sure you want to change the base?
Conversation
Rolling driver version 2506.18 from 02/07 introduces a new way to workaround the -O0 bug that prevents usage of SIMD32 kernels on certain iGPUs. The new workaround reenables SIMD32 kernels with -O0 compilation. Signed-off-by: Matthew Michel <matthew.michel@intel.com>
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
I am hesitant to revert the current solution at least until there is a long-term support driver with the new behavior. Ideally, there should be no LTS drivers without the new behavior, but maybe that is too much to ask for. I do not see the SG size difference in debug and release as an issue. It's an implementation detail of oneDPL. which nobody should rely upon.
It seems I missed the details of this issue. Why the kernel name even depends on the subgroup size? Is it impacted by the |
With different optimization levels between the "device pass" and "host pass" of the compiler, 2 different template instantiations of the submitter are used. At runtime, when we go to JIT compile and/or launch the kernel, we find that no matching kernel was set up by the "device pass" of the compiler and there is a missing kernel error, because the "device pass" used this separate template instantiation and set up a different kernel. @mmichel11 and I discussed a number of possible but "sneaky" workarounds for this which could make the "device pass" of the compiler see all kernel options, but given this surprise fix it would be better to not need to pursue them. |
Duplicating a bit from what I have mentioned offline to make it public here:
I can ask the IGC team when the specific fix will make it into a LTS driver. LTS drivers are supported for ~3 years, so sometime in 2028 would be when all supported drivers will have this fix.
The unnamed lambda naming procedure causes
Another sneaky fix that works is here: https://github.com/uxlfoundation/oneDPL/tree/dev/mmichel11/remove_sg_sz_template_rts. Moving sub-group size to a |
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
We have a few different options now each with their own pros and cons. I think it is easiest discuss this in an offline forum, so moving to draft for now. |
I recently discovered that after our discussions with the IGC team, IGC reimplemented it's software workaround for the function calling bug that affects certain integrated graphics / DG2 kernels compiled with
-O0
due to the large number of issues similar to what we have encountered and worked around in #2046.The new device compiler behavior enables
-O0
to work with SIMD32 kernels (sub-group sizes of32
) with a tradeoff of EU occupancy by disabling a hardware feature known as EU fusion. With this approach, my understanding is that EU occupancy only reaches 50% as opposed to our current SIMD16 workaround which has no such issue. The performance impact is acceptable from my perspective given that this only occurs when the kernels are compiled with-O0
which will likely be used for debugging only. This also avoids changing the kernel SIMD width between optimization levels which may impact debuggability from the user's side.Relying on the new IGC behavior also resolves the recently documented scenario where kernel name mismatches occur when the host and device compiler use different optimization levels which is the default behavior with the intel/llvm open-source clang++ driver when no optimization flag is specified.
GPU Driver support this patch relies on is currently limited to the
2506.18
rolling driver: https://dgpu-docs.intel.com/releases/releases.html with no current LTS driver support. Given that onedpl 2022.8.0 has already dropped with our workaround, users on LTS drivers will not see a kernel compilation crash with this release. For onedpl 2022.9.0 and beyond, we can document the minimum LTS / rolling driver with the workaround along with the following device compiler warning message users will see with AOT compilation:I have tested through internal CI and tests behave as expected.