[Intel MKL] Parallelize UnsortedSegmentSum on CPU deivce #49152

Zantares · 2021-05-13T06:08:27Z

UnsortedSegmentSum is a single thread op on CPU, here use the Eigen device to parallelize it. It's updated from an old PR #26427 .

I resubmit this PR because the original one got an internal crash and was reverted. Because I can't get the failure case, I just went through the old code and found a potential rounding issue here: https://github.com/Intel-tensorflow/tensorflow/blob/52a6cfddef9b51b608b4a554b77a10e1522d56ec/tensorflow/core/kernels/segment_reduction_ops.cc#L419-L425

This PR uses a simple and reasonable way to balance workload like what TF usually did:

    // Each output row may contain different size of reduction from inputs,
    // balance the workload to a task group. Each task is output row index.
    const int64 kMaxTaskBlock =
        std::min(num_reductions, (int64)cpu_device.numThreads());
    const int64 kAverTaskSize = (N + kMaxTaskBlock - 1) / kMaxTaskBlock

Other parts are the same as the original one.

Performance improvement:

Origin

BM_UnsortedSegmentSum_4096_128_1_ 300315 2279 6983.2MB/s
BM_UnsortedSegmentSum_4096_128_128_ 306408 2268 6844.3MB/s

Optimized

BM_UnsortedSegmentSum_4096_128_1_ 302769 2268 6926.6MB/s
BM_UnsortedSegmentSum_4096_128_128_ 116179 6112 18051.1MB/s

This patch tries to parallel UnsortSegmentReduction with num_segments, so the performance has no change if num_segments is 1, otherwise it has ~3X improvement in the benchmark.

Zantares · 2021-05-13T06:11:41Z

Here's a simple introduce for the optimization:

row_counter stores how many rows will be reduce to each output row, num_reductions is the real num_segments which excluded 0 count rows, it also decides the degree of parallelism.
Loop row_counter to pick each task to block_range. block_range records task index for shard function.

Zantares · 2021-05-13T06:13:21Z

@ezhulenev can you help to review this again?

ezhulenev · 2021-05-13T22:41:00Z

It was rolled back because of internal failures, looks like it produces nans. Any ideas? Will try to get better error messages tomorrow.

Error message:

tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [Condition x >= y did not hold element-wise:] [x (<...internal_node_name...>/Sum_2:0) = ] [-nan -nan -nan...]

Zantares · 2021-05-14T11:27:32Z

Thank for your reminding, I found it may cause accuracy issue(including NaN) when task_index is 0 but exited from here: https://github.com/tensorflow/tensorflow/pull/49152/files#diff-d29812caca45335066ddacf8fb06afc416f1f557bc446068ebaa163495e0b03bR388-R390. The older PR started from index 1 but the new PR started from index 0 to simplify the code, and seems it caused new issue.
I will make a related case to test locally first, and update it later.

Parallelize UnsortedSegmentSum on CPU deivce.

3284324

google-ml-butler bot added the size:M CL Change Size: Medium label May 13, 2021

google-cla bot added the cla: yes label May 13, 2021

gbaned self-assigned this May 13, 2021

gbaned added the comp:core issues related to core part of tensorflow label May 13, 2021

gbaned added this to Assigned Reviewer in PR Queue via automation May 13, 2021

gbaned requested a review from ezhulenev May 13, 2021 07:56

ezhulenev approved these changes May 13, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels May 13, 2021

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer May 13, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label May 13, 2021

copybara-service bot merged commit b801797 into tensorflow:master May 13, 2021

PR Queue automation moved this from Approved by Reviewer to Merged May 13, 2021

Zantares mentioned this pull request Feb 14, 2022

[oneDNN] Parallelize UnsortedSegmentOp by simpler method #54366

Merged

Zantares deleted the tenglu/parallelize_unsorted branch May 8, 2023 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Intel MKL] Parallelize UnsortedSegmentSum on CPU deivce #49152

[Intel MKL] Parallelize UnsortedSegmentSum on CPU deivce #49152

Zantares commented May 13, 2021

Zantares commented May 13, 2021

Zantares commented May 13, 2021

ezhulenev commented May 13, 2021

Zantares commented May 14, 2021

[Intel MKL] Parallelize UnsortedSegmentSum on CPU deivce #49152

[Intel MKL] Parallelize UnsortedSegmentSum on CPU deivce #49152

Conversation

Zantares commented May 13, 2021

Zantares commented May 13, 2021

Zantares commented May 13, 2021

ezhulenev commented May 13, 2021

Zantares commented May 14, 2021