Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GPU occupancy checking to utilize CUDA's occupancy calculator #21958

Merged
merged 6 commits into from Oct 2, 2018

Conversation

MattConley
Copy link
Contributor

This change will make supporting future architectures easier and removes dependence on hard-coded GPU data

-Replace references to the UnqueryableDeviceParams struct with calls to CUDA's built-in occupancy calculation functions
-Update calls to the occupancy checking functions with the new changes
-Changes should provide more long-term reliability and will remove the need to manually update hard-coded data values for new GPU architectures

@tfboyd @zheng-xq

…functions

-Replace references to the UnqueryableDeviceParams struct with calls to CUDA's built-in occupancy calculation functions
-Update calls to the occupancy checking functions with the new changes
-Changes should provide more long-term reliability and will remove the need to manually update hardcoded data values for new GPU architectures
@aaroey aaroey requested a review from jlebar August 31, 2018 05:20
@aaroey aaroey self-assigned this Aug 31, 2018
@jlebar
Copy link
Contributor

jlebar commented Aug 31, 2018

Thanks for the PR, this looks really good.

My only concern is: Do you know when the new CUDA driver function you're calling was added?

@MattConley
Copy link
Contributor Author

Looks like the functions were implemented in CUDA 6.5 (mentioned here: https://devblogs.nvidia.com/10-ways-cuda-6-5-improves-performance-productivity/)

CompareOccupancy(&blocks_per_sm, device_description, regs_per_thread,
smem_per_block, thread_dims, cufunc);
if (suggested_threads != 0) {
VLOG(2) << "The cuda occupancy calculator reccommends using "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm missing it, but which part here is the typo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reccommends (sorry I was coy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, thanks!

const DeviceDescription& device_description,
uint64 registers_per_thread,
uint64 shared_memory_per_block,
const ThreadDim& thread_dims, CUfunction func);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is and these functions are platform-independent. But the implementations of them are platform-dependent. So if anyone calls CalculateOccupancy they're going to get the wrong answer (or a crash) if they're not using CUDA. In fact one were to build without CUDA support, StreamExecutor won't even link, as I read this.

This needs to be done somehow in a platform-independent way.

-Maintain functionality, just move CalculateOccupancy() and CompareOccupancy() methods from device_description to cuda_gpu_executor
-Remove CUDA requirement in general class device_description
@MattConley
Copy link
Contributor Author

Thanks for the feedback, this commit moves the occupancy functions over to cuda_gpu_executor, which was the only file they appeared to be called from. Device_description should be fine without cuda now

@jlebar jlebar added the kokoro:force-run Tests on submitted change label Sep 4, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 4, 2018
jlebar
jlebar previously approved these changes Sep 4, 2018
@jlebar
Copy link
Contributor

jlebar commented Sep 4, 2018

lgtm, but let's see what the tests say.

@jlebar
Copy link
Contributor

jlebar commented Sep 5, 2018

Looks like the tests are clean enough other than the clang-format business. If you can make that change we should be able to merge this.

Thank you again for the patch, this is a good change.

@aaroey aaroey added the kokoro:force-run Tests on submitted change label Sep 6, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 6, 2018
@MattConley
Copy link
Contributor Author

Thanks for your help, hopefully this fixes the issue

@aaroey aaroey added the kokoro:force-run Tests on submitted change label Sep 6, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 6, 2018
@aaroey
Copy link
Member

aaroey commented Sep 6, 2018

Thanks @MattConley for the clang fix, but it seems there are still some left, would you mind to fix them (again)?
Thanks for the patient.

@MattConley
Copy link
Contributor Author

Apologies, I had overlooked the placement of some references and dereferences; this should be all of the required formats within the changed code

@aaroey aaroey added the kokoro:force-run Tests on submitted change label Sep 6, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 6, 2018
@tensorflowbutler
Copy link
Member

Nagging Assignee @aaroey: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@aaroey aaroey added the ready to pull PR ready for merge process label Sep 21, 2018
@MattConley
Copy link
Contributor Author

@aaroey @jlebar Thanks again for your help; is this ready to be merged in?

@jlebar
Copy link
Contributor

jlebar commented Sep 30, 2018

FYI I'm trying to merge this, running into some build problems internally, still not sure exactly what is going on.

@MattConley
Copy link
Contributor Author

Thanks for the update; I'm not getting the issues on my end, please let me know if I can do anything to help.

@tensorflow-copybara tensorflow-copybara merged commit 6a5090b into tensorflow:master Oct 2, 2018
tensorflow-copybara pushed a commit that referenced this pull request Oct 2, 2018
@MattConley MattConley deleted the CudaOccupancy branch October 3, 2018 21:24
@jlebar
Copy link
Contributor

jlebar commented Dec 6, 2018

I'm not sure why I didn't see this when I reviewed the patch earlier, but this is actually a functional change for XLA.

Previously we'd set the block size to device_desc.threads_per_core_limit() / device_desc.blocks_per_core_limit(). This is the smallest block size such that we can fill a core with these blocks.

Now we set the block size to the max block size: device_desc.threads_per_block_limit();. That's the opposite of the old code.

Was this change intentional on your part? I don't think it was intentional on my part...

Is there a CUDA API that gives us information that lets us recover the old behavior? (I'm not seeing it.)
Or do we have to go back to hardcoding this info?

@jlebar
Copy link
Contributor

jlebar commented Dec 12, 2018

@MattConley friendly ping. I don't want to revert this if I don't have to, but I'm not actually sure how to fix this.

@MattConley
Copy link
Contributor Author

Apologies for the delayed response; this change in behavior was my mistake. It looks like the desired functionality is available through the cuOccupancyMaxActiveBlocksPerMultiprocessor driver function, which is accessible from CUDADriver's GetMaxOccupiedBlocksPerCore wrapper.

I'm looking at adding back the blocks_per_core_limit functionality to the device description, and then letting the cuda_gpu_executor fill in the field using the above call. I'm running into a slight issue of passing in a valid CUfunction, but aside from that it looks like this is the easiest way to achieve desired behavior while avoiding hard-coding the values.

I should have a solution fairly quickly, and will submit a merge request as soon as it's ready.

@jlebar
Copy link
Contributor

jlebar commented Jan 14, 2019

Hey, friendly ping on this. (I can also pick it up if you're busy; like, there's no obligation, it's not like this is your job. :)

@MattConley
Copy link
Contributor Author

Just submitted PR #24944 with a potential fix. (Again, sorry for the delay :) The solution seems somewhat clunky, but does integrate the occupancy calculator without the need for hardcoded values; definitely open to suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants