Implement horizontal input fusion. #43051

trentlo · 2020-09-08T18:00:44Z

Extend horizontal fusion to support fusion of reduction instructions.

The PR for the code generation of parallel reduction is here. Potential performance gain can also be inferred by the numbers listed in the code gen PR.

trentlo · 2020-09-08T18:02:13Z

@thomasjoerg could you help to review this PR when you have a moment? Thanks!

thomasjoerg · 2020-09-09T08:48:20Z

tensorflow/compiler/xla/service/gpu/BUILD

+    name = "horizontal_input_fusion_test",
+    srcs = ["horizontal_input_fusion_test.cc"],
+    deps = [
+        ":fusion_merger",


I think you don't need this. Please try to keep the deps minimal.

My oversight. Cleaned them up.

thomasjoerg · 2020-09-09T08:51:08Z

tensorflow/compiler/xla/service/gpu/gpu_fusible.cc

+  if (!instr.IsMultiOutputFusion()) {
+    return fused_expression_root;
+  }
+  // If possible, we want to pick a reduction-to-vector operand of the


Unrelated to your change, but since you are at it: "reduction-to-vector" is outdated terminology. Please replace it with "reduction from or to contiguous dimensions". Thanks!

thomasjoerg · 2020-09-09T08:56:09Z

tensorflow/compiler/xla/service/gpu/gpu_fusible.cc

-      if (IsReductionFromOrToContiguousDimensions(*inst)) {
-        return inst;
-      }
+const HloInstruction* GetMajorNodeForMultiOutputFusion(


Naming: What we are looking for is the HLO in the fusion that determines the emitter to be used. We casually refer to this HLO as "the real hero" sometimes. Not sure that's a great name, but "major" is very ambiguous. Therefore, I'd prefer GetRealHeroOfMultiOutputFusion. The code comment in the header explains what it means.

Revised according to your suggestion. Thanks!

thomasjoerg · 2020-09-09T08:57:35Z

tensorflow/compiler/xla/service/gpu/gpu_fusible.cc

+    if (user->opcode() == HloOpcode::kGetTupleElement) {
+      // Skip GTE.
+      return IsConsumerTheOnlyNonRootUser(*user, consumer);
+    } else if (user == &consumer) {


You return in the line above. Please drop the else, just if, it's cleaner.

thomasjoerg · 2020-09-09T11:31:50Z

tensorflow/compiler/xla/service/gpu/gpu_fusible.cc

+      return true;
+    } else if (user == user->parent()->root_instruction()) {
+      // Consumed by ROOT is always fine, since it is impossible to create
+      // cycles through ROOT.


Not clear why we care about cycles here. Please clarify the comment.

Revised. We simply should not have these comments here. They are confusing.

thomasjoerg · 2020-09-09T11:33:36Z

tensorflow/compiler/xla/service/gpu/horizontal_input_fusion.cc

+
+// Gets the representative input shape of the multi-output fusion.
+Shape GetInputShapeForMultiOutputFusion(const HloInstruction& instr) {
+  // Get the major node used in the emitter.


Get the HLO that determines the what emitter will be used.

thomasjoerg · 2020-09-09T11:36:19Z

tensorflow/compiler/xla/service/gpu/horizontal_input_fusion.h

+// This optimization pass horizontally fuses kInput fusions to both reduce the
+// kernel launch overhead and increase parallelism degree. See
+// GpuHorizontalFusion for general description and motivation about horizontal
+// fusion. GpuHorizontalFusion deals with kLoop fusions while this pass deals


Would it make sense to s/GpuHorizontalFusion/GpuHorizontalLoopFusion? Can be a separate PR though.

That makes sense. I will make another PR to do this.

thomasjoerg · 2020-09-09T11:41:10Z

tensorflow/compiler/xla/service/gpu/horizontal_input_fusion.cc

+      if (ShapesCompatibleForMultiOutputFusion(*fusion_anchor, *fused) &&
+          !FusionWouldBeTooLarge(*fusion_anchor, *fused)) {
+        VLOG(3) << absl::StrCat("Fuse ", fused->ToString(), " into ",
+                                fusion_anchor->ToString());


VLOG(3) << "Fuse " << fused->ToString() << " into " << fusion_anchor->ToString();
Same below.

thomasjoerg · 2020-09-09T11:51:22Z

tensorflow/compiler/xla/service/gpu/horizontal_input_fusion.cc

+                return shape_a.rank() < shape_b.rank();
+              } else if (ShapeUtil::ElementsIn(shape_a) !=
+                         ShapeUtil::ElementsIn(shape_b)) {
+                // Sort according to element size so that roughly the same input


This would order shapes [128,256] and [256,128] in an arbitrary order, right? If so, we may miss out on fusion opportunities because two equal shapes may not be sorted next to each other.

I revised the comparison function.

thomasjoerg · 2020-09-09T11:55:34Z

tensorflow/compiler/xla/service/gpu/horizontal_input_fusion_test.cc

+  // Verify that horizontal fusion is kicked in. Check that there are multiple
+  // `reduce` instructions fused into the same fusion. 6 is just a randomly
+  // picked number as we don't exactly know how large the fusion will be
+  // created.


Why don't we know? Is it because the "fusion too large" heuristic will kick in at some time?

Right. It is because of the FusionWouldBeTooLarge constraint. I added the reason into the comments.

Extend horizontal fusion to support fusion of reduction instructions.

…sion. So that we can distinguish [128,256] and [256,128].

trentlo

Thank you for the review! @thomasjoerg

I should have addressed all the comments. Please help to take a look again.

trentlo · 2020-09-09T23:24:22Z

tensorflow/compiler/xla/service/gpu/BUILD

+    name = "horizontal_input_fusion_test",
+    srcs = ["horizontal_input_fusion_test.cc"],
+    deps = [
+        ":fusion_merger",


My oversight. Cleaned them up.

trentlo · 2020-09-09T23:24:59Z

tensorflow/compiler/xla/service/gpu/gpu_fusible.cc

-      if (IsReductionFromOrToContiguousDimensions(*inst)) {
-        return inst;
-      }
+const HloInstruction* GetMajorNodeForMultiOutputFusion(


Revised according to your suggestion. Thanks!

trentlo · 2020-09-09T23:25:10Z

tensorflow/compiler/xla/service/gpu/gpu_fusible.cc

+    if (user->opcode() == HloOpcode::kGetTupleElement) {
+      // Skip GTE.
+      return IsConsumerTheOnlyNonRootUser(*user, consumer);
+    } else if (user == &consumer) {


trentlo · 2020-09-09T23:25:33Z

tensorflow/compiler/xla/service/gpu/gpu_fusible.cc

+      return true;
+    } else if (user == user->parent()->root_instruction()) {
+      // Consumed by ROOT is always fine, since it is impossible to create
+      // cycles through ROOT.


Revised. We simply should not have these comments here. They are confusing.

trentlo · 2020-09-09T23:25:45Z

tensorflow/compiler/xla/service/gpu/horizontal_input_fusion.cc

+
+// Gets the representative input shape of the multi-output fusion.
+Shape GetInputShapeForMultiOutputFusion(const HloInstruction& instr) {
+  // Get the major node used in the emitter.


trentlo · 2020-09-09T23:26:11Z

tensorflow/compiler/xla/service/gpu/horizontal_input_fusion.cc

+                return shape_a.rank() < shape_b.rank();
+              } else if (ShapeUtil::ElementsIn(shape_a) !=
+                         ShapeUtil::ElementsIn(shape_b)) {
+                // Sort according to element size so that roughly the same input


I revised the comparison function.

trentlo · 2020-09-09T23:26:27Z

tensorflow/compiler/xla/service/gpu/horizontal_input_fusion.cc

+      if (ShapesCompatibleForMultiOutputFusion(*fusion_anchor, *fused) &&
+          !FusionWouldBeTooLarge(*fusion_anchor, *fused)) {
+        VLOG(3) << absl::StrCat("Fuse ", fused->ToString(), " into ",
+                                fusion_anchor->ToString());


trentlo · 2020-09-09T23:27:54Z

tensorflow/compiler/xla/service/gpu/horizontal_input_fusion.h

+// This optimization pass horizontally fuses kInput fusions to both reduce the
+// kernel launch overhead and increase parallelism degree. See
+// GpuHorizontalFusion for general description and motivation about horizontal
+// fusion. GpuHorizontalFusion deals with kLoop fusions while this pass deals


That makes sense. I will make another PR to do this.

trentlo · 2020-09-09T23:28:39Z

tensorflow/compiler/xla/service/gpu/horizontal_input_fusion_test.cc

+  // Verify that horizontal fusion is kicked in. Check that there are multiple
+  // `reduce` instructions fused into the same fusion. 6 is just a randomly
+  // picked number as we don't exactly know how large the fusion will be
+  // created.


Right. It is because of the FusionWouldBeTooLarge constraint. I added the reason into the comments.

trentlo · 2020-09-14T18:21:20Z

@thomasjoerg ping~

google-ml-butler bot added the size:L CL Change Size: Large label Sep 8, 2020

googlebot added the cla: yes label Sep 8, 2020

gbaned self-assigned this Sep 9, 2020

gbaned added the comp:xla XLA label Sep 9, 2020

gbaned added this to Assigned Reviewer in PR Queue via automation Sep 9, 2020

gbaned requested a review from thomasjoerg September 9, 2020 03:31

thomasjoerg previously approved these changes Sep 9, 2020

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Sep 9, 2020

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Sep 9, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 9, 2020

trentlo added 3 commits September 9, 2020 13:26

Implement horizontal input fusion.

f072abb

Extend horizontal fusion to support fusion of reduction instructions.

[XLA/GPU] Address review comments.

67d2ee2

[XLA/GPU] Revise the shape comparison function in horizontal_input_fu…

327aeb6

…sion. So that we can distinguish [128,256] and [256,128].

trentlo dismissed thomasjoerg’s stale review via 327aeb6 September 9, 2020 23:23

trentlo force-pushed the horizontal_input_fusion branch from 82ca8ac to 327aeb6 Compare September 9, 2020 23:23

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Sep 9, 2020

google-ml-butler bot removed the ready to pull PR ready for merge process label Sep 9, 2020

trentlo commented Sep 9, 2020

View reviewed changes

[XLA/GPU] minor comment polishing.

d4ccaf4

gbaned requested a review from thomasjoerg September 11, 2020 18:44

gbaned added the awaiting review Pull request awaiting review label Sep 15, 2020

thomasjoerg approved these changes Sep 15, 2020

View reviewed changes

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Sep 15, 2020

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Sep 15, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 15, 2020

gbaned removed the awaiting review Pull request awaiting review label Sep 15, 2020

gbaned added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Sep 16, 2020

tensorflow-copybara merged commit 091d39a into tensorflow:master Sep 16, 2020

PR Queue automation moved this from Approved by Reviewer to Merged Sep 16, 2020

trentlo mentioned this pull request Oct 2, 2020

[XLA/GPU] clear operands for removed HLOs. #43746

Merged

trentlo mentioned this pull request Oct 13, 2020

Horizontal input fusion again #43964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement horizontal input fusion. #43051

Implement horizontal input fusion. #43051

trentlo commented Sep 8, 2020

trentlo commented Sep 8, 2020 •

edited

thomasjoerg Sep 9, 2020

trentlo Sep 9, 2020

thomasjoerg Sep 9, 2020

thomasjoerg Sep 9, 2020

trentlo Sep 9, 2020

thomasjoerg Sep 9, 2020

trentlo Sep 9, 2020

thomasjoerg Sep 9, 2020

trentlo Sep 9, 2020

thomasjoerg Sep 9, 2020

trentlo Sep 9, 2020

thomasjoerg Sep 9, 2020

trentlo Sep 9, 2020

thomasjoerg Sep 9, 2020

trentlo Sep 9, 2020

thomasjoerg Sep 9, 2020

trentlo Sep 9, 2020

thomasjoerg Sep 9, 2020

trentlo Sep 9, 2020

trentlo left a comment

trentlo Sep 9, 2020

trentlo Sep 9, 2020

trentlo Sep 9, 2020

trentlo Sep 9, 2020

trentlo Sep 9, 2020

trentlo Sep 9, 2020

trentlo Sep 9, 2020

trentlo Sep 9, 2020

trentlo Sep 9, 2020

trentlo commented Sep 14, 2020

Implement horizontal input fusion. #43051

Implement horizontal input fusion. #43051

Conversation

trentlo commented Sep 8, 2020

trentlo commented Sep 8, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trentlo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trentlo commented Sep 14, 2020

trentlo commented Sep 8, 2020 •

edited