Update GPU occupancy checking to utilize CUDA's occupancy calculator #21958

MattConley · 2018-08-29T20:40:33Z

This change will make supporting future architectures easier and removes dependence on hard-coded GPU data

-Replace references to the UnqueryableDeviceParams struct with calls to CUDA's built-in occupancy calculation functions
-Update calls to the occupancy checking functions with the new changes
-Changes should provide more long-term reliability and will remove the need to manually update hard-coded data values for new GPU architectures

@tfboyd @zheng-xq

…functions -Replace references to the UnqueryableDeviceParams struct with calls to CUDA's built-in occupancy calculation functions -Update calls to the occupancy checking functions with the new changes -Changes should provide more long-term reliability and will remove the need to manually update hardcoded data values for new GPU architectures

jlebar · 2018-08-31T05:53:07Z

Thanks for the PR, this looks really good.

My only concern is: Do you know when the new CUDA driver function you're calling was added?

MattConley · 2018-08-31T18:24:22Z

Looks like the functions were implemented in CUDA 6.5 (mentioned here: https://devblogs.nvidia.com/10-ways-cuda-6-5-improves-performance-productivity/)

jlebar · 2018-09-01T00:10:25Z

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc

+      CompareOccupancy(&blocks_per_sm, device_description, regs_per_thread,
+                       smem_per_block, thread_dims, cufunc);
+  if (suggested_threads != 0) {
+    VLOG(2) << "The cuda occupancy calculator reccommends using "


Sorry I'm missing it, but which part here is the typo?

reccommends (sorry I was coy)

Ah my bad, thanks!

jlebar · 2018-09-01T00:30:16Z

tensorflow/stream_executor/device_description.h

+                     const DeviceDescription& device_description,
+                     uint64 registers_per_thread,
+                     uint64 shared_memory_per_block,
+                     const ThreadDim& thread_dims, CUfunction func);


This file is and these functions are platform-independent. But the implementations of them are platform-dependent. So if anyone calls CalculateOccupancy they're going to get the wrong answer (or a crash) if they're not using CUDA. In fact one were to build without CUDA support, StreamExecutor won't even link, as I read this.

This needs to be done somehow in a platform-independent way.

-Maintain functionality, just move CalculateOccupancy() and CompareOccupancy() methods from device_description to cuda_gpu_executor -Remove CUDA requirement in general class device_description

MattConley · 2018-09-04T23:23:38Z

Thanks for the feedback, this commit moves the occupancy functions over to cuda_gpu_executor, which was the only file they appeared to be called from. Device_description should be fine without cuda now

jlebar · 2018-09-04T23:36:45Z

lgtm, but let's see what the tests say.

jlebar · 2018-09-05T00:36:16Z

Looks like the tests are clean enough other than the clang-format business. If you can make that change we should be able to merge this.

Thank you again for the patch, this is a good change.

MattConley · 2018-09-06T15:25:26Z

Thanks for your help, hopefully this fixes the issue

aaroey · 2018-09-06T15:56:02Z

Thanks @MattConley for the clang fix, but it seems there are still some left, would you mind to fix them (again)?
Thanks for the patient.

MattConley · 2018-09-06T20:12:06Z

Apologies, I had overlooked the placement of some references and dereferences; this should be all of the required formats within the changed code

tensorflowbutler · 2018-09-21T19:04:11Z

Nagging Assignee @aaroey: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

MattConley · 2018-09-27T15:55:05Z

@aaroey @jlebar Thanks again for your help; is this ready to be merged in?

jlebar · 2018-09-30T17:47:02Z

FYI I'm trying to merge this, running into some build problems internally, still not sure exactly what is going on.

MattConley · 2018-10-01T18:07:32Z

Thanks for the update; I'm not getting the issues on my end, please let me know if I can do anything to help.

PiperOrigin-RevId: 215331087

jlebar · 2018-12-06T01:56:21Z

I'm not sure why I didn't see this when I reviewed the patch earlier, but this is actually a functional change for XLA.

Previously we'd set the block size to device_desc.threads_per_core_limit() / device_desc.blocks_per_core_limit(). This is the smallest block size such that we can fill a core with these blocks.

Now we set the block size to the max block size: device_desc.threads_per_block_limit();. That's the opposite of the old code.

Was this change intentional on your part? I don't think it was intentional on my part...

Is there a CUDA API that gives us information that lets us recover the old behavior? (I'm not seeing it.)
Or do we have to go back to hardcoding this info?

jlebar · 2018-12-12T01:04:42Z

@MattConley friendly ping. I don't want to revert this if I don't have to, but I'm not actually sure how to fix this.

MattConley · 2018-12-12T01:38:01Z

Apologies for the delayed response; this change in behavior was my mistake. It looks like the desired functionality is available through the cuOccupancyMaxActiveBlocksPerMultiprocessor driver function, which is accessible from CUDADriver's GetMaxOccupiedBlocksPerCore wrapper.

I'm looking at adding back the blocks_per_core_limit functionality to the device description, and then letting the cuda_gpu_executor fill in the field using the above call. I'm running into a slight issue of passing in a valid CUfunction, but aside from that it looks like this is the easiest way to achieve desired behavior while avoiding hard-coding the values.

I should have a solution fairly quickly, and will submit a merge request as soon as it's ready.

jlebar · 2019-01-14T23:31:39Z

Hey, friendly ping on this. (I can also pick it up if you're busy; like, there's no obligation, it's not like this is your job. :)

MattConley · 2019-01-16T00:53:10Z

Just submitted PR #24944 with a potential fix. (Again, sorry for the delay :) The solution seems somewhat clunky, but does integrate the occupancy calculator without the need for hardcoded values; definitely open to suggestions.

googlebot added the cla: yes label Aug 29, 2018

aaroey requested a review from jlebar August 31, 2018 05:20

aaroey self-assigned this Aug 31, 2018

jlebar suggested changes Sep 1, 2018

View reviewed changes

MattConley added 2 commits September 4, 2018 14:20

Move CUDA-specific occupancy calculation into proper file

fa20b59

-Maintain functionality, just move CalculateOccupancy() and CompareOccupancy() methods from device_description to cuda_gpu_executor -Remove CUDA requirement in general class device_description

Fixed transition typo

cd6597b

Recommended typo fix

475b771

jlebar added the kokoro:force-run Tests on submitted change label Sep 4, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 4, 2018

jlebar previously approved these changes Sep 4, 2018

View reviewed changes

aaroey added the kokoro:force-run Tests on submitted change label Sep 6, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 6, 2018

Fixed clang formatting

d0574f6

MattConley dismissed jlebar’s stale review via d0574f6 September 6, 2018 15:23

aaroey added the kokoro:force-run Tests on submitted change label Sep 6, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 6, 2018

Fully fixed clang errors

6a5090b

aaroey added the kokoro:force-run Tests on submitted change label Sep 6, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 6, 2018

aaroey approved these changes Sep 21, 2018

View reviewed changes

aaroey added the ready to pull PR ready for merge process label Sep 21, 2018

tensorflow-copybara merged commit 6a5090b into tensorflow:master Oct 2, 2018

tensorflow-copybara pushed a commit that referenced this pull request Oct 2, 2018

Merge pull request #21958 from MattConley:CudaOccupancy

6161d8c

PiperOrigin-RevId: 215331087

MattConley deleted the CudaOccupancy branch October 3, 2018 21:24

yifeif mentioned this pull request Nov 14, 2018

[WIP] Load CUDA libraries dynamically #22562

Closed

MattConley mentioned this pull request Jan 16, 2019

Fix to restore desired blocks-per-core behavior to XLA #24944

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update GPU occupancy checking to utilize CUDA's occupancy calculator #21958

Update GPU occupancy checking to utilize CUDA's occupancy calculator #21958

MattConley commented Aug 29, 2018

jlebar commented Aug 31, 2018

MattConley commented Aug 31, 2018

jlebar Sep 1, 2018

MattConley Sep 4, 2018

jlebar Sep 4, 2018

MattConley Sep 4, 2018

jlebar Sep 1, 2018

MattConley commented Sep 4, 2018

jlebar commented Sep 4, 2018

jlebar commented Sep 5, 2018

MattConley commented Sep 6, 2018

aaroey commented Sep 6, 2018 •

edited

MattConley commented Sep 6, 2018

tensorflowbutler commented Sep 21, 2018

MattConley commented Sep 27, 2018

jlebar commented Sep 30, 2018

MattConley commented Oct 1, 2018

jlebar commented Dec 6, 2018

jlebar commented Dec 12, 2018

MattConley commented Dec 12, 2018

jlebar commented Jan 14, 2019

MattConley commented Jan 16, 2019

Update GPU occupancy checking to utilize CUDA's occupancy calculator #21958

Update GPU occupancy checking to utilize CUDA's occupancy calculator #21958

Conversation

MattConley commented Aug 29, 2018

jlebar commented Aug 31, 2018

MattConley commented Aug 31, 2018

jlebar Sep 1, 2018

Choose a reason for hiding this comment

MattConley Sep 4, 2018

Choose a reason for hiding this comment

jlebar Sep 4, 2018

Choose a reason for hiding this comment

MattConley Sep 4, 2018

Choose a reason for hiding this comment

jlebar Sep 1, 2018

Choose a reason for hiding this comment

MattConley commented Sep 4, 2018

jlebar commented Sep 4, 2018

jlebar commented Sep 5, 2018

MattConley commented Sep 6, 2018

aaroey commented Sep 6, 2018 • edited

MattConley commented Sep 6, 2018

tensorflowbutler commented Sep 21, 2018

MattConley commented Sep 27, 2018

jlebar commented Sep 30, 2018

MattConley commented Oct 1, 2018

jlebar commented Dec 6, 2018

jlebar commented Dec 12, 2018

MattConley commented Dec 12, 2018

jlebar commented Jan 14, 2019

MattConley commented Jan 16, 2019

aaroey commented Sep 6, 2018 •

edited