New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement horizontal input fusion. #43051
Implement horizontal input fusion. #43051
Conversation
@thomasjoerg could you help to review this PR when you have a moment? Thanks! |
name = "horizontal_input_fusion_test", | ||
srcs = ["horizontal_input_fusion_test.cc"], | ||
deps = [ | ||
":fusion_merger", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need this. Please try to keep the deps minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My oversight. Cleaned them up.
if (!instr.IsMultiOutputFusion()) { | ||
return fused_expression_root; | ||
} | ||
// If possible, we want to pick a reduction-to-vector operand of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to your change, but since you are at it: "reduction-to-vector" is outdated terminology. Please replace it with "reduction from or to contiguous dimensions". Thanks!
if (IsReductionFromOrToContiguousDimensions(*inst)) { | ||
return inst; | ||
} | ||
const HloInstruction* GetMajorNodeForMultiOutputFusion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: What we are looking for is the HLO in the fusion that determines the emitter to be used. We casually refer to this HLO as "the real hero" sometimes. Not sure that's a great name, but "major" is very ambiguous. Therefore, I'd prefer GetRealHeroOfMultiOutputFusion. The code comment in the header explains what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised according to your suggestion. Thanks!
if (user->opcode() == HloOpcode::kGetTupleElement) { | ||
// Skip GTE. | ||
return IsConsumerTheOnlyNonRootUser(*user, consumer); | ||
} else if (user == &consumer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You return
in the line above. Please drop the else
, just if
, it's cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised.
return true; | ||
} else if (user == user->parent()->root_instruction()) { | ||
// Consumed by ROOT is always fine, since it is impossible to create | ||
// cycles through ROOT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear why we care about cycles here. Please clarify the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. We simply should not have these comments here. They are confusing.
|
||
// Gets the representative input shape of the multi-output fusion. | ||
Shape GetInputShapeForMultiOutputFusion(const HloInstruction& instr) { | ||
// Get the major node used in the emitter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get the HLO that determines the what emitter will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// This optimization pass horizontally fuses kInput fusions to both reduce the | ||
// kernel launch overhead and increase parallelism degree. See | ||
// GpuHorizontalFusion for general description and motivation about horizontal | ||
// fusion. GpuHorizontalFusion deals with kLoop fusions while this pass deals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to s/GpuHorizontalFusion/GpuHorizontalLoopFusion? Can be a separate PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I will make another PR to do this.
if (ShapesCompatibleForMultiOutputFusion(*fusion_anchor, *fused) && | ||
!FusionWouldBeTooLarge(*fusion_anchor, *fused)) { | ||
VLOG(3) << absl::StrCat("Fuse ", fused->ToString(), " into ", | ||
fusion_anchor->ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLOG(3) << "Fuse " << fused->ToString() << " into " << fusion_anchor->ToString();
Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
return shape_a.rank() < shape_b.rank(); | ||
} else if (ShapeUtil::ElementsIn(shape_a) != | ||
ShapeUtil::ElementsIn(shape_b)) { | ||
// Sort according to element size so that roughly the same input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would order shapes [128,256] and [256,128] in an arbitrary order, right? If so, we may miss out on fusion opportunities because two equal shapes may not be sorted next to each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revised the comparison function.
// Verify that horizontal fusion is kicked in. Check that there are multiple | ||
// `reduce` instructions fused into the same fusion. 6 is just a randomly | ||
// picked number as we don't exactly know how large the fusion will be | ||
// created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we know? Is it because the "fusion too large" heuristic will kick in at some time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It is because of the FusionWouldBeTooLarge constraint. I added the reason into the comments.
Extend horizontal fusion to support fusion of reduction instructions.
…sion. So that we can distinguish [128,256] and [256,128].
82ca8ac
to
327aeb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! @thomasjoerg
I should have addressed all the comments. Please help to take a look again.
name = "horizontal_input_fusion_test", | ||
srcs = ["horizontal_input_fusion_test.cc"], | ||
deps = [ | ||
":fusion_merger", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My oversight. Cleaned them up.
if (IsReductionFromOrToContiguousDimensions(*inst)) { | ||
return inst; | ||
} | ||
const HloInstruction* GetMajorNodeForMultiOutputFusion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised according to your suggestion. Thanks!
if (user->opcode() == HloOpcode::kGetTupleElement) { | ||
// Skip GTE. | ||
return IsConsumerTheOnlyNonRootUser(*user, consumer); | ||
} else if (user == &consumer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised.
return true; | ||
} else if (user == user->parent()->root_instruction()) { | ||
// Consumed by ROOT is always fine, since it is impossible to create | ||
// cycles through ROOT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. We simply should not have these comments here. They are confusing.
|
||
// Gets the representative input shape of the multi-output fusion. | ||
Shape GetInputShapeForMultiOutputFusion(const HloInstruction& instr) { | ||
// Get the major node used in the emitter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return shape_a.rank() < shape_b.rank(); | ||
} else if (ShapeUtil::ElementsIn(shape_a) != | ||
ShapeUtil::ElementsIn(shape_b)) { | ||
// Sort according to element size so that roughly the same input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revised the comparison function.
if (ShapesCompatibleForMultiOutputFusion(*fusion_anchor, *fused) && | ||
!FusionWouldBeTooLarge(*fusion_anchor, *fused)) { | ||
VLOG(3) << absl::StrCat("Fuse ", fused->ToString(), " into ", | ||
fusion_anchor->ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
// This optimization pass horizontally fuses kInput fusions to both reduce the | ||
// kernel launch overhead and increase parallelism degree. See | ||
// GpuHorizontalFusion for general description and motivation about horizontal | ||
// fusion. GpuHorizontalFusion deals with kLoop fusions while this pass deals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I will make another PR to do this.
// Verify that horizontal fusion is kicked in. Check that there are multiple | ||
// `reduce` instructions fused into the same fusion. 6 is just a randomly | ||
// picked number as we don't exactly know how large the fusion will be | ||
// created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It is because of the FusionWouldBeTooLarge constraint. I added the reason into the comments.
@thomasjoerg ping~ |
Extend horizontal fusion to support fusion of reduction instructions.
The PR for the code generation of parallel reduction is here. Potential performance gain can also be inferred by the numbers listed in the code gen PR.