Iterative horizontal fusion. #48706

trentlo · 2021-04-22T17:59:57Z

Extend horizontal fusion to support non-fusion instructions.
Enable iterative optimization for horizontal fusion. After each iteration,
new horizontal fusion opportunities are exposed because the producers to
the previously generated horizontally fused instructions will become
fusion candidates. See IterativeHorizontalFusion in the unittest as an example.

1. Extend horizontal fusion to support non-fusion instructions. 2. Enable iterative optimization for horizontal fusion. After each iteration, new horizontal fusion opportunites are exposed because the producers to the previously generated horizontally fused instructions will become fusion candidates.

trentlo · 2021-04-22T18:00:40Z

@cheshire could you help to review this PR? Thanks!

cheshire · 2021-04-22T18:26:34Z

BTW I'm currently working on trying to change the calling convention to allow more than N arguments for calls. If successful, that should enable a lot more horizontal fusion opportunities (in many cases I have seen it hits the # of arguments to the kernel boundary)

cheshire · 2021-04-22T18:28:40Z

tensorflow/compiler/xla/service/gpu/horizontal_loop_fusion.cc

@@ -67,6 +67,25 @@ PrimitiveType GetUniqueOutputTypeOfFusion(const HloInstruction& instr) {
  return first_output_type;
 }

+size_t GetInstrCountOfFusible(const HloInstruction& instr) {


Is it possible to reduce the duplication with the same function in the other file?

Good cache. Will do.

cheshire · 2021-04-22T18:29:19Z

tensorflow/compiler/xla/service/gpu/horizontal_loop_fusion.cc

+
+// Creates a kLoop fusion instruction and fuses `fused` into the created
+// fusion instruction.
+HloInstruction* MakeLoopFusionInstruction(HloInstruction* fused) {


Same note regarding reducing duplication. Fusion type could be simply passed as a parameter?

cheshire · 2021-04-22T18:32:57Z

tensorflow/compiler/xla/service/gpu/horizontal_loop_fusion.cc

+      // Convert fusible into fusion_instrs to simplify the implementation of
+      // `Fuse()`.
+      std::vector<HloInstruction*> fusion_instrs;
+      for (auto instr : fusibles) {


Explicit type preferred

cheshire · 2021-04-22T18:34:01Z

tensorflow/compiler/xla/service/gpu/horizontal_loop_fusion.cc

@@ -493,16 +524,26 @@ StatusOr<bool> HorizontalLoopFusionImpl::Run() {
    auto consumer = def_to_use_order[i];
    HorizontalLoopFusionImpl::FusionCandidates fusion_candidates(consumer);
    while (true) {


If we are running this in a fixed point, could we remove this while/true loop then?

Theoretically we could, but having it in this way is better in practice.

Note that the granularity of this while loop processes is very fine, i.e., just the instructions that share the same immediate consumer. We write this while loop here is because we don't want to fuse all of these instructions into a kernel (;instead, fuse them into multiple kernels), as a kernel too large can be problematic. Theoretically, we could rely on the fixed point to run this pass many times to get these instructions fused but it is not efficient.

On the other hand, the fixed point is used to process the fusions newly generated by this pass. For example, in the unittest IterativeHorizontalFusion, the fusion created by fusing fusion.0 and fusion.1 won't be traversed until next iteration.

trentlo

Response to the comments. Will revise the code soon.

trentlo · 2021-04-22T18:57:39Z

tensorflow/compiler/xla/service/gpu/horizontal_loop_fusion.cc

@@ -67,6 +67,25 @@ PrimitiveType GetUniqueOutputTypeOfFusion(const HloInstruction& instr) {
  return first_output_type;
 }

+size_t GetInstrCountOfFusible(const HloInstruction& instr) {


Good cache. Will do.

trentlo · 2021-04-22T18:58:28Z

tensorflow/compiler/xla/service/gpu/horizontal_loop_fusion.cc

+
+// Creates a kLoop fusion instruction and fuses `fused` into the created
+// fusion instruction.
+HloInstruction* MakeLoopFusionInstruction(HloInstruction* fused) {


trentlo · 2021-04-22T18:59:42Z

tensorflow/compiler/xla/service/gpu/horizontal_loop_fusion.cc

+      // Convert fusible into fusion_instrs to simplify the implementation of
+      // `Fuse()`.
+      std::vector<HloInstruction*> fusion_instrs;
+      for (auto instr : fusibles) {


trentlo · 2021-04-22T19:13:03Z

tensorflow/compiler/xla/service/gpu/horizontal_loop_fusion.cc

@@ -493,16 +524,26 @@ StatusOr<bool> HorizontalLoopFusionImpl::Run() {
    auto consumer = def_to_use_order[i];
    HorizontalLoopFusionImpl::FusionCandidates fusion_candidates(consumer);
    while (true) {


Theoretically we could, but having it in this way is better in practice.

Note that the granularity of this while loop processes is very fine, i.e., just the instructions that share the same immediate consumer. We write this while loop here is because we don't want to fuse all of these instructions into a kernel (;instead, fuse them into multiple kernels), as a kernel too large can be problematic. Theoretically, we could rely on the fixed point to run this pass many times to get these instructions fused but it is not efficient.

On the other hand, the fixed point is used to process the fusions newly generated by this pass. For example, in the unittest IterativeHorizontalFusion, the fusion created by fusing fusion.0 and fusion.1 won't be traversed until next iteration.

trentlo · 2021-04-22T21:50:39Z

BTW I'm currently working on trying to change the calling convention to allow more than N arguments for calls. If successful, that should enable a lot more horizontal fusion opportunities (in many cases I have seen it hits the # of arguments to the kernel boundary)

Cool! As it is a limitation in the CUDA kernel signature, (wondering if) are you going to pack the arguments into a struct and pass the struct?

trentlo · 2021-04-22T21:51:00Z

I've addressed the review comments, please help to take a look again. Thanks!

cheshire · 2021-04-22T22:37:26Z

wondering if) are you going to pack the arguments into a struct and pass the struct?

This would not work, as the limit is on the argument size and not the number of arguments. I'm experimenting with passing the buffer table directly instead.

cheshire · 2021-04-22T22:38:30Z

tensorflow/compiler/xla/service/gpu/gpu_fusible.cc

+absl::InlinedVector<HloInstruction*, 2> GetOutputsOfFusible(
+    const HloInstruction& instr) {
+  if (instr.opcode() != HloOpcode::kFusion) {
+    return {const_cast<HloInstruction*>(&instr)};


Must we const-cast? Why not return a vector of const pointers?

The return type has to align with the return type of HloInstruction::operands(), which is of absl::InlinedVector<HloInstruction*, 2>. So, returning absl::InlinedVector<const HloInstruction*, 2> will require new allocation of the vector. const_cast may be a bit less evil.

Let me know if you have any other ideas?

const_cast seems to be technically UB if modifying methods are then called, right? Allocating 2 pointers on a stack (new vector) should be essentially free, right?

Just did some search. It is undefined behavior if the object itself (not the pointer) is indeed const. Then, changing the object is undefined behavior through const_cast.

My previous statement was wrong though--we allocate a new vector (in the stack) anyway. So, let's remove the const_cast. Please help to take a look of the latest commit.

google-ml-butler bot added the size:L CL Change Size: Large label Apr 22, 2021

google-ml-butler bot requested review from joker-eph and sanjoy April 22, 2021 18:00

google-cla bot added the cla: yes label Apr 22, 2021

cheshire requested review from cheshire and removed request for sanjoy and joker-eph April 22, 2021 18:25

cheshire requested changes Apr 22, 2021

View reviewed changes

trentlo commented Apr 22, 2021

View reviewed changes

trentlo requested a review from cheshire April 22, 2021 21:51

[XLA/GPU] Cleanup and refactory to address review comments.

bd6d2c8

cheshire requested changes Apr 22, 2021

View reviewed changes

[XLA/GPU] remove const_cast in GetOutputsOfFusible().

0e06048

cheshire approved these changes Apr 23, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Apr 23, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Apr 23, 2021

gbaned self-assigned this Apr 23, 2021

gbaned added the comp:xla XLA label Apr 23, 2021

gbaned added this to Assigned Reviewer in PR Queue via automation Apr 23, 2021

copybara-service bot merged commit d4f6c19 into tensorflow:master Apr 23, 2021

PR Queue automation moved this from Assigned Reviewer to Merged Apr 23, 2021

nouiz mentioned this pull request Aug 11, 2022

Iterative Horizontal Fusion #44238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterative horizontal fusion. #48706

Iterative horizontal fusion. #48706

trentlo commented Apr 22, 2021

trentlo commented Apr 22, 2021

cheshire commented Apr 22, 2021

cheshire Apr 22, 2021

trentlo Apr 22, 2021

cheshire Apr 22, 2021

trentlo Apr 22, 2021

cheshire Apr 22, 2021

trentlo Apr 22, 2021

cheshire Apr 22, 2021

trentlo Apr 22, 2021

trentlo left a comment

trentlo Apr 22, 2021

trentlo Apr 22, 2021

trentlo Apr 22, 2021

trentlo Apr 22, 2021

trentlo commented Apr 22, 2021

trentlo commented Apr 22, 2021

cheshire commented Apr 22, 2021

cheshire Apr 22, 2021

trentlo Apr 23, 2021 •

edited

cheshire Apr 23, 2021

trentlo Apr 23, 2021

Iterative horizontal fusion. #48706

Iterative horizontal fusion. #48706

Conversation

trentlo commented Apr 22, 2021

trentlo commented Apr 22, 2021

cheshire commented Apr 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trentlo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trentlo commented Apr 22, 2021

trentlo commented Apr 22, 2021

cheshire commented Apr 22, 2021

Choose a reason for hiding this comment

trentlo Apr 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trentlo Apr 23, 2021 •

edited