Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new GPU kernel for [unsorted] segment reductions #51392

Conversation

benbarsdell
Copy link
Contributor

  • Optionally replaces the old atomics-based kernels with calls to SegmentReduceGPU (the same kernel already used for sparse segment reductions). This behavior is enabled by default, but the old kernels can be re-enabled by setting the environment variable TF_USE_ATOMIC_SEGMENT_REDUCTIONS=1. On Windows, the old kernels are always used due to a build issue with the new kernel.
  • This improves performance, and guarantees that these ops are deterministic. In future it is hoped that the old kernels can be removed completely.
  • Also adds a GPU kernel registration for SegmentMean, which didn't previously exist.

cc @nluehr @reedwm @duncanriach

@google-ml-butler google-ml-butler bot added the size:L CL Change Size: Large label Aug 9, 2021
@google-cla google-cla bot added the cla: yes label Aug 9, 2021
@gbaned gbaned self-assigned this Aug 9, 2021
@gbaned gbaned added the comp:core issues related to core part of tensorflow label Aug 9, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Aug 9, 2021
@gbaned gbaned requested a review from reedwm August 9, 2021 14:08
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Aug 9, 2021
@benbarsdell benbarsdell force-pushed the gpu-SegmentReductions-new-rebased2 branch from 3d67ee5 to bb4bf3f Compare August 9, 2021 14:49
@reedwm reedwm requested review from sanjoy and removed request for reedwm August 9, 2021 22:30
@@ -20,6 +20,22 @@ limitations under the License.

namespace tensorflow {

bool UseAtomicSegmentReductions() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this UseNonDeterministicSegmentReductions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

functor::NonAtomicMaxOpGpu<T>, \
functor::AtomicMaxOpGpu<T>>;
#define DEFINE_SORTED_GPU_SPECS_INDEX(T, Index) \
template struct SegmentReductionFunctor<T, Index, functor::Zero<T>, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add /*EmptySegmentValueF=*/ and /*InitialValueF=*/ here to make it clear which arg is which? Same elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Aug 10, 2021
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Aug 12, 2021
@duncanriach
Copy link
Contributor

duncanriach commented Aug 12, 2021

First pass through this. Nice work, @benbarsdell. A couple of thoughts:

  1. The original code differentiated between atomic and nonatomic reductions (because there were sometimes both), but not for reasons of determinism. And, of course, atomics can be used in deterministic code. So, there's some lingering old symbology around atomic/nonatomic which has gotten tangled up with (or repurposed for) deterministic/nondeterministic operation. @sanjoy picked out one example of that above. There are other places in the code where perhaps "non-atomic" would ideally be replaced with "deterministic" and where "atomic" would ideally be replaced with "nondeterministic."
  2. I'm wondering if we should add tests to confirm determinism, or if we should rely on it being deterministic by design. There will, of course, be backup testing from @reedwm's determinism auto-checker.

- Optionally replaces the old atomics-based kernels with calls to
  SegmentReduceGPU (the same kernel already used for sparse segment
  reductions). This behavior is enabled by default, but the old kernels
  can be re-enabled by setting the environment variable
  TF_USE_ATOMIC_SEGMENT_REDUCTIONS=1. On Windows, the old kernels are
  always used due to a build issue with the new kernel.
- This improves performance, and guarantees that these ops are
  deterministic. In future it is hoped that the old kernels can be
  removed completely.
- Also adds a GPU kernel registration for SegmentMean, which didn't
  previously exist.
- Extends existing tests to cover several different inner dimension
  sizes (important for the GPU implementations).
@benbarsdell
Copy link
Contributor Author

Thanks for the suggestions Duncan. I'm facing some build issues right now but will push the additional atomic->deterministic rewordings tomorrow along with a rebase.

I haven't looked closely at how the determinism tests work but will take a look.

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Aug 25, 2021
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 25, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 25, 2021
@gbaned gbaned added kokoro:force-run Tests on submitted change and removed awaiting review Pull request awaiting review labels Aug 25, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 25, 2021
@copybara-service copybara-service bot merged commit 948248c into tensorflow:master Aug 26, 2021
PR Queue automation moved this from Approved by Reviewer to Merged Aug 26, 2021
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 26, 2021
@reedwm
Copy link
Member

reedwm commented Aug 26, 2021

This was rolled back in 44cdcda. Will investigate.

copybara-service bot pushed a commit that referenced this pull request Sep 2, 2021
PiperOrigin-RevId: 394502922
Change-Id: I4e183dc73bf209e4623682d8d622117e9f1de28a
@reedwm
Copy link
Member

reedwm commented Sep 2, 2021

Rolled forward in 9a1072e

copybara-service bot pushed a commit that referenced this pull request Sep 4, 2021
Causes an internal performance regression.

PiperOrigin-RevId: 394788152
Change-Id: I702fb4ec245823b96ce82f58b1b0d6c505b674c3
@reedwm
Copy link
Member

reedwm commented Sep 4, 2021

Unfortunately this was rolled back again in 51ee415, since it caused a performance regression in an internal model. Also, the test segment_reduction_ops_deterministic_test.py failed on Windows, since the deterministic algorithms were disabled on Windows.

I think the best approach here is to keep using the old nondeterministic kernels by default, and only use the new ones if either determinism is enabled or an environmental variable is set (say, TF_USE_DETERMINISTIC_SEGMENT_REDUCTIONS). Then the determinism API is unblocked and we can internally debug the performance issue, and potentially send you an example to reproduce.

@benbarsdell, do you want to create a new PR with the new kernels disabled by default, by this Wednesday? If not, I can rollforward, disabling the new kernels by default. I want to get this in as soon as possible since it is required for determinism.

benbarsdell added a commit to benbarsdell/tensorflow that referenced this pull request Sep 7, 2021
arovir01 pushed a commit to arovir01/tensorflow that referenced this pull request Sep 17, 2021
PiperOrigin-RevId: 394502922
Change-Id: I4e183dc73bf209e4623682d8d622117e9f1de28a
arovir01 pushed a commit to arovir01/tensorflow that referenced this pull request Sep 17, 2021
Causes an internal performance regression.

PiperOrigin-RevId: 394788152
Change-Id: I702fb4ec245823b96ce82f58b1b0d6c505b674c3
arovir01 pushed a commit to arovir01/tensorflow that referenced this pull request Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:core issues related to core part of tensorflow size:L CL Change Size: Large
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

7 participants