-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update GPU occupancy checking to utilize CUDA's occupancy calculator #21958
Update GPU occupancy checking to utilize CUDA's occupancy calculator #21958
Conversation
…functions -Replace references to the UnqueryableDeviceParams struct with calls to CUDA's built-in occupancy calculation functions -Update calls to the occupancy checking functions with the new changes -Changes should provide more long-term reliability and will remove the need to manually update hardcoded data values for new GPU architectures
Thanks for the PR, this looks really good. My only concern is: Do you know when the new CUDA driver function you're calling was added? |
Looks like the functions were implemented in CUDA 6.5 (mentioned here: https://devblogs.nvidia.com/10-ways-cuda-6-5-improves-performance-productivity/) |
CompareOccupancy(&blocks_per_sm, device_description, regs_per_thread, | ||
smem_per_block, thread_dims, cufunc); | ||
if (suggested_threads != 0) { | ||
VLOG(2) << "The cuda occupancy calculator reccommends using " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm missing it, but which part here is the typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reccommends (sorry I was coy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah my bad, thanks!
const DeviceDescription& device_description, | ||
uint64 registers_per_thread, | ||
uint64 shared_memory_per_block, | ||
const ThreadDim& thread_dims, CUfunction func); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is and these functions are platform-independent. But the implementations of them are platform-dependent. So if anyone calls CalculateOccupancy they're going to get the wrong answer (or a crash) if they're not using CUDA. In fact one were to build without CUDA support, StreamExecutor won't even link, as I read this.
This needs to be done somehow in a platform-independent way.
-Maintain functionality, just move CalculateOccupancy() and CompareOccupancy() methods from device_description to cuda_gpu_executor -Remove CUDA requirement in general class device_description
Thanks for the feedback, this commit moves the occupancy functions over to cuda_gpu_executor, which was the only file they appeared to be called from. Device_description should be fine without cuda now |
lgtm, but let's see what the tests say. |
Looks like the tests are clean enough other than the clang-format business. If you can make that change we should be able to merge this. Thank you again for the patch, this is a good change. |
Thanks for your help, hopefully this fixes the issue |
Thanks @MattConley for the clang fix, but it seems there are still some left, would you mind to fix them (again)? |
Apologies, I had overlooked the placement of some references and dereferences; this should be all of the required formats within the changed code |
Nagging Assignee @aaroey: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
FYI I'm trying to merge this, running into some build problems internally, still not sure exactly what is going on. |
Thanks for the update; I'm not getting the issues on my end, please let me know if I can do anything to help. |
PiperOrigin-RevId: 215331087
I'm not sure why I didn't see this when I reviewed the patch earlier, but this is actually a functional change for XLA. Previously we'd set the block size to Now we set the block size to the max block size: Was this change intentional on your part? I don't think it was intentional on my part... Is there a CUDA API that gives us information that lets us recover the old behavior? (I'm not seeing it.) |
@MattConley friendly ping. I don't want to revert this if I don't have to, but I'm not actually sure how to fix this. |
Apologies for the delayed response; this change in behavior was my mistake. It looks like the desired functionality is available through the I'm looking at adding back the I should have a solution fairly quickly, and will submit a merge request as soon as it's ready. |
Hey, friendly ping on this. (I can also pick it up if you're busy; like, there's no obligation, it's not like this is your job. :) |
Just submitted PR #24944 with a potential fix. (Again, sorry for the delay :) The solution seems somewhat clunky, but does integrate the occupancy calculator without the need for hardcoded values; definitely open to suggestions. |
This change will make supporting future architectures easier and removes dependence on hard-coded GPU data
-Replace references to the UnqueryableDeviceParams struct with calls to CUDA's built-in occupancy calculation functions
-Update calls to the occupancy checking functions with the new changes
-Changes should provide more long-term reliability and will remove the need to manually update hard-coded data values for new GPU architectures
@tfboyd @zheng-xq