Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement horizontal input fusion. #43051

Merged

Conversation

trentlo
Copy link
Contributor

@trentlo trentlo commented Sep 8, 2020

Extend horizontal fusion to support fusion of reduction instructions.

The PR for the code generation of parallel reduction is here. Potential performance gain can also be inferred by the numbers listed in the code gen PR.

@google-ml-butler google-ml-butler bot added the size:L CL Change Size: Large label Sep 8, 2020
@trentlo
Copy link
Contributor Author

trentlo commented Sep 8, 2020

@thomasjoerg could you help to review this PR when you have a moment? Thanks!

@gbaned gbaned self-assigned this Sep 9, 2020
@gbaned gbaned added the comp:xla XLA label Sep 9, 2020
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Sep 9, 2020
thomasjoerg
thomasjoerg previously approved these changes Sep 9, 2020
name = "horizontal_input_fusion_test",
srcs = ["horizontal_input_fusion_test.cc"],
deps = [
":fusion_merger",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need this. Please try to keep the deps minimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My oversight. Cleaned them up.

if (!instr.IsMultiOutputFusion()) {
return fused_expression_root;
}
// If possible, we want to pick a reduction-to-vector operand of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to your change, but since you are at it: "reduction-to-vector" is outdated terminology. Please replace it with "reduction from or to contiguous dimensions". Thanks!

if (IsReductionFromOrToContiguousDimensions(*inst)) {
return inst;
}
const HloInstruction* GetMajorNodeForMultiOutputFusion(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: What we are looking for is the HLO in the fusion that determines the emitter to be used. We casually refer to this HLO as "the real hero" sometimes. Not sure that's a great name, but "major" is very ambiguous. Therefore, I'd prefer GetRealHeroOfMultiOutputFusion. The code comment in the header explains what it means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised according to your suggestion. Thanks!

if (user->opcode() == HloOpcode::kGetTupleElement) {
// Skip GTE.
return IsConsumerTheOnlyNonRootUser(*user, consumer);
} else if (user == &consumer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You return in the line above. Please drop the else, just if, it's cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised.

return true;
} else if (user == user->parent()->root_instruction()) {
// Consumed by ROOT is always fine, since it is impossible to create
// cycles through ROOT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear why we care about cycles here. Please clarify the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. We simply should not have these comments here. They are confusing.


// Gets the representative input shape of the multi-output fusion.
Shape GetInputShapeForMultiOutputFusion(const HloInstruction& instr) {
// Get the major node used in the emitter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get the HLO that determines the what emitter will be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// This optimization pass horizontally fuses kInput fusions to both reduce the
// kernel launch overhead and increase parallelism degree. See
// GpuHorizontalFusion for general description and motivation about horizontal
// fusion. GpuHorizontalFusion deals with kLoop fusions while this pass deals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to s/GpuHorizontalFusion/GpuHorizontalLoopFusion? Can be a separate PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I will make another PR to do this.

if (ShapesCompatibleForMultiOutputFusion(*fusion_anchor, *fused) &&
!FusionWouldBeTooLarge(*fusion_anchor, *fused)) {
VLOG(3) << absl::StrCat("Fuse ", fused->ToString(), " into ",
fusion_anchor->ToString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLOG(3) << "Fuse " << fused->ToString() << " into " << fusion_anchor->ToString();
Same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

return shape_a.rank() < shape_b.rank();
} else if (ShapeUtil::ElementsIn(shape_a) !=
ShapeUtil::ElementsIn(shape_b)) {
// Sort according to element size so that roughly the same input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would order shapes [128,256] and [256,128] in an arbitrary order, right? If so, we may miss out on fusion opportunities because two equal shapes may not be sorted next to each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised the comparison function.

// Verify that horizontal fusion is kicked in. Check that there are multiple
// `reduce` instructions fused into the same fusion. 6 is just a randomly
// picked number as we don't exactly know how large the fusion will be
// created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we know? Is it because the "fusion too large" heuristic will kick in at some time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It is because of the FusionWouldBeTooLarge constraint. I added the reason into the comments.

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Sep 9, 2020
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Sep 9, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 9, 2020
Extend horizontal fusion to support fusion of reduction instructions.
…sion.

So that we can distinguish [128,256] and [256,128].
PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Sep 9, 2020
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Sep 9, 2020
Copy link
Contributor Author

@trentlo trentlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review! @thomasjoerg

I should have addressed all the comments. Please help to take a look again.

name = "horizontal_input_fusion_test",
srcs = ["horizontal_input_fusion_test.cc"],
deps = [
":fusion_merger",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My oversight. Cleaned them up.

if (IsReductionFromOrToContiguousDimensions(*inst)) {
return inst;
}
const HloInstruction* GetMajorNodeForMultiOutputFusion(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised according to your suggestion. Thanks!

if (user->opcode() == HloOpcode::kGetTupleElement) {
// Skip GTE.
return IsConsumerTheOnlyNonRootUser(*user, consumer);
} else if (user == &consumer) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised.

return true;
} else if (user == user->parent()->root_instruction()) {
// Consumed by ROOT is always fine, since it is impossible to create
// cycles through ROOT.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. We simply should not have these comments here. They are confusing.


// Gets the representative input shape of the multi-output fusion.
Shape GetInputShapeForMultiOutputFusion(const HloInstruction& instr) {
// Get the major node used in the emitter.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return shape_a.rank() < shape_b.rank();
} else if (ShapeUtil::ElementsIn(shape_a) !=
ShapeUtil::ElementsIn(shape_b)) {
// Sort according to element size so that roughly the same input
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised the comparison function.

if (ShapesCompatibleForMultiOutputFusion(*fusion_anchor, *fused) &&
!FusionWouldBeTooLarge(*fusion_anchor, *fused)) {
VLOG(3) << absl::StrCat("Fuse ", fused->ToString(), " into ",
fusion_anchor->ToString());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

// This optimization pass horizontally fuses kInput fusions to both reduce the
// kernel launch overhead and increase parallelism degree. See
// GpuHorizontalFusion for general description and motivation about horizontal
// fusion. GpuHorizontalFusion deals with kLoop fusions while this pass deals
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I will make another PR to do this.

// Verify that horizontal fusion is kicked in. Check that there are multiple
// `reduce` instructions fused into the same fusion. 6 is just a randomly
// picked number as we don't exactly know how large the fusion will be
// created.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It is because of the FusionWouldBeTooLarge constraint. I added the reason into the comments.

@trentlo
Copy link
Contributor Author

trentlo commented Sep 14, 2020

@thomasjoerg ping~

@gbaned gbaned added the awaiting review Pull request awaiting review label Sep 15, 2020
PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Sep 15, 2020
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Sep 15, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 15, 2020
@gbaned gbaned removed the awaiting review Pull request awaiting review label Sep 15, 2020
@gbaned gbaned added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Sep 16, 2020
@tensorflow-copybara tensorflow-copybara merged commit 091d39a into tensorflow:master Sep 16, 2020
PR Queue automation moved this from Approved by Reviewer to Merged Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:xla XLA ready to pull PR ready for merge process size:L CL Change Size: Large
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

6 participants