Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Intel MKL] Parallelize UnsortedSegmentSum on CPU deivce #49152

Conversation

Zantares
Copy link
Contributor

UnsortedSegmentSum is a single thread op on CPU, here use the Eigen device to parallelize it. It's updated from an old PR #26427 .

I resubmit this PR because the original one got an internal crash and was reverted. Because I can't get the failure case, I just went through the old code and found a potential rounding issue here: https://github.com/Intel-tensorflow/tensorflow/blob/52a6cfddef9b51b608b4a554b77a10e1522d56ec/tensorflow/core/kernels/segment_reduction_ops.cc#L419-L425

This PR uses a simple and reasonable way to balance workload like what TF usually did:

    // Each output row may contain different size of reduction from inputs,
    // balance the workload to a task group. Each task is output row index.
    const int64 kMaxTaskBlock =
        std::min(num_reductions, (int64)cpu_device.numThreads());
    const int64 kAverTaskSize = (N + kMaxTaskBlock - 1) / kMaxTaskBlock

Other parts are the same as the original one.

Performance improvement:

  • Origin
BM_UnsortedSegmentSum_4096_128_1_ 300315 2279 6983.2MB/s
BM_UnsortedSegmentSum_4096_128_128_ 306408 2268 6844.3MB/s
  • Optimized
BM_UnsortedSegmentSum_4096_128_1_ 302769 2268 6926.6MB/s
BM_UnsortedSegmentSum_4096_128_128_ 116179 6112 18051.1MB/s

This patch tries to parallel UnsortSegmentReduction with num_segments, so the performance has no change if num_segments is 1, otherwise it has ~3X improvement in the benchmark.

@google-ml-butler google-ml-butler bot added the size:M CL Change Size: Medium label May 13, 2021
@google-cla google-cla bot added the cla: yes label May 13, 2021
@Zantares
Copy link
Contributor Author

Here's a simple introduce for the optimization:

  1. row_counter stores how many rows will be reduce to each output row, num_reductions is the real num_segments which excluded 0 count rows, it also decides the degree of parallelism.
  2. Loop row_counter to pick each task to block_range. block_range records task index for shard function.
    image

@Zantares
Copy link
Contributor Author

@ezhulenev can you help to review this again?

@gbaned gbaned self-assigned this May 13, 2021
@gbaned gbaned added the comp:core issues related to core part of tensorflow label May 13, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation May 13, 2021
@gbaned gbaned requested a review from ezhulenev May 13, 2021 07:56
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels May 13, 2021
PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer May 13, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label May 13, 2021
@copybara-service copybara-service bot merged commit b801797 into tensorflow:master May 13, 2021
PR Queue automation moved this from Approved by Reviewer to Merged May 13, 2021
@ezhulenev
Copy link
Member

It was rolled back because of internal failures, looks like it produces nans. Any ideas? Will try to get better error messages tomorrow.

Error message:

tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [Condition x >= y did not hold element-wise:] [x (<...internal_node_name...>/Sum_2:0) = ] [-nan -nan -nan...]

@Zantares
Copy link
Contributor Author

Thank for your reminding, I found it may cause accuracy issue(including NaN) when task_index is 0 but exited from here: https://github.com/tensorflow/tensorflow/pull/49152/files#diff-d29812caca45335066ddacf8fb06afc416f1f557bc446068ebaa163495e0b03bR388-R390. The older PR started from index 1 but the new PR started from index 0 to simplify the code, and seems it caused new issue.
I will make a related case to test locally first, and update it later.

@Zantares Zantares deleted the tenglu/parallelize_unsorted branch May 8, 2023 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:core issues related to core part of tensorflow ready to pull PR ready for merge process size:M CL Change Size: Medium
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

4 participants