Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XLA/GPU] Add CustomCallSchedule to give schedule hints to custom-calls. #48806

Merged

Conversation

trentlo
Copy link
Contributor

@trentlo trentlo commented Apr 28, 2021

Add schedule hints EARLY_AS_POSSIBLE and LATE_AS_POSSIBLE to custom-calls.
This supports a custom-call case, where a logical operation can be lowered into
two HLOs (e.g., PerformX and PerformXDone). We can utilize this mechanism to
either hide host latencies between the pair of the custom-calls or the two calls
can more accurately identify the def-use relationship (typically PerformX is
scheduled right after all of its producers have been scheduled and PerformXDone
is scheduled right before its first consumer.)

I need this change for implementing XLA Horovod ops. I have a working prototype internally within NVIDIA and it works well with this change for our tracked DL models, i.e., host overhead is well hidden and I do see some overlapping/parallelism between communication and computation.

@google-ml-butler google-ml-butler bot added the size:L CL Change Size: Large label Apr 28, 2021
@google-cla google-cla bot added the cla: yes label Apr 28, 2021
@trentlo trentlo force-pushed the xla_custom_call_for_hvd_042721 branch from f4f3a55 to 7de009c Compare April 28, 2021 18:12
Add schedule hints EARLY_AS_POSSIBLE and LATE_AS_POSSIBLE to custom-calls.
This supports a custom-call case, where a logical operation can be lowered into
two HLOs (e.g., PerformX and PerformXDone). We can utilize this mechanism to
either hide host latencies between the pair of the custom-calls or the two calls
can more accurately identify the def-use relationship (typically PerformX is
scheduled right after all of its producers have been scheduled and PerformXDone
is scheduled right before its first consumer.)
@trentlo
Copy link
Contributor Author

trentlo commented Apr 28, 2021

@timshen91 could you help to review this PR? Thanks!

@cheshire @sanjoy FYI.

@gbaned gbaned self-assigned this Apr 29, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Apr 29, 2021
// Postprocessor of the HloInstructionSequence. This is an opt-in postprocessing
// function to MemorySchedulerAlgorithm to enforce certain hlo schedule
// constraints desired for custom-calls.
typedef std::function<HloInstructionSequence(const HloInstructionSequence&)>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it more composable if we do postprocessing as another pass after memory scheduling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler tries to compare the mem usage of each schedule and pick the lowest one, so ideally postprocessing would happen in this file after the scheduling algorithm, but before the mem usage calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the main motivation to implement in the current way (vs. a separate pass) is to account for potential memory size changes caused by the postprocessors. In this way, the change is accounted into the memory usage calculation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yunxing @dimvar Further thoughts?

if (CustomCallWithSchedule(user,
CustomCallSchedule::EARLY_AS_POSSIBLE) &&
absl::c_all_of(user->operands(), [&](const HloInstruction* opnd) {
return scheduled.contains(opnd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should control-dependencies also be checked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I was also wondering if checking control dependencies is necessary in this particular use case of custom-calls. I think it might not be necessary as in the current code base, i.e., I don't find any codes adding control dependencies to custom-calls. However, let's be conservative and add the check for control dependencies for future proof.

if (CustomCallWithSchedule(opnd,
CustomCallSchedule::LATE_AS_POSSIBLE) &&
absl::c_all_of(opnd->users(), [&](const HloInstruction* u) {
return scheduled.contains(u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same about control-dependencies. I think you might need to check it.

// Postprocessor of the HloInstructionSequence. This is an opt-in postprocessing
// function to MemorySchedulerAlgorithm to enforce certain hlo schedule
// constraints desired for custom-calls.
typedef std::function<HloInstructionSequence(const HloInstructionSequence&)>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


ENTRY %CustomCall () -> f32[1,2,3] {
%constant = f32[1]{0} constant({12345})
ROOT %custom-call = f32[1,2,3]{0,2,1} custom-call(f32[1]{0} %constant), custom_call_target="foo\"bar", schedule=LATE_AS_POSSIBLE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the tests also for EARLY ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@Lee165
Copy link

Lee165 commented Apr 29, 2021 via email

Copy link
Contributor Author

@trentlo trentlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look. Some response inlined. Will update the codes soon.

if (CustomCallWithSchedule(user,
CustomCallSchedule::EARLY_AS_POSSIBLE) &&
absl::c_all_of(user->operands(), [&](const HloInstruction* opnd) {
return scheduled.contains(opnd);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I was also wondering if checking control dependencies is necessary in this particular use case of custom-calls. I think it might not be necessary as in the current code base, i.e., I don't find any codes adding control dependencies to custom-calls. However, let's be conservative and add the check for control dependencies for future proof.

// Postprocessor of the HloInstructionSequence. This is an opt-in postprocessing
// function to MemorySchedulerAlgorithm to enforce certain hlo schedule
// constraints desired for custom-calls.
typedef std::function<HloInstructionSequence(const HloInstructionSequence&)>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the main motivation to implement in the current way (vs. a separate pass) is to account for potential memory size changes caused by the postprocessors. In this way, the change is accounted into the memory usage calculation.


ENTRY %CustomCall () -> f32[1,2,3] {
%constant = f32[1]{0} constant({12345})
ROOT %custom-call = f32[1,2,3]{0,2,1} custom-call(f32[1]{0} %constant), custom_call_target="foo\"bar", schedule=LATE_AS_POSSIBLE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@trentlo
Copy link
Contributor Author

trentlo commented Apr 29, 2021

Codes updated. Please help to take another look. Thanks!

@trentlo trentlo requested review from yunxing and Kariddi April 29, 2021 19:36
PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Apr 30, 2021
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Apr 30, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Apr 30, 2021
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label May 3, 2021
@trentlo
Copy link
Contributor Author

trentlo commented May 3, 2021

@Kariddi A test failure is because I missed the initialization of custom_call_schedule_ in one of the HloCustomCallInstruction constructors. (EDIT: just checked. It was missing because that constructor is new and not in NV's forked branch.) I push a commit to fix it. I don't think other test/build failures in the CI are related to this PR.

Could you please approve the PR again? Thanks!

@trentlo trentlo requested a review from Kariddi May 3, 2021 20:26
@trentlo
Copy link
Contributor Author

trentlo commented May 4, 2021

@Kariddi A test failure is because I missed the initialization of custom_call_schedule_ in one of the HloCustomCallInstruction constructors. (EDIT: just checked. It was missing because that constructor is new and not in NV's forked branch.) I push a commit to fix it. I don't think other test/build failures in the CI are related to this PR.

Could you please approve the PR again? Thanks!

ping~

Copy link
Contributor

@sanjoy sanjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of nitpicky comments.


ENTRY %CustomCall () -> f32[1,2,3] {
%constant = f32[1]{0} constant({12345})
%custom-call.0 = f32[1,2,3]{0,2,1} custom-call(f32[1]{0} %constant), custom_call_target="foo", schedule=EARLY_AS_POSSIBLE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about s/EARLY_AS_POSSIBLE/EARLIEST and s/LATE_AS_POSSIBLE/LATEST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Updated.

@@ -1562,6 +1568,8 @@ class HloCustomCallInstruction : public HloInstruction {
std::vector<std::pair<ShapeIndex, std::pair<int64, ShapeIndex>>>
output_to_operand_aliasing_;
absl::optional<Literal> literal_;
// A custom-call schedule hint.
CustomCallSchedule custom_call_schedule_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just add a default initializer (to CustomCallSchedule::NONE) here instead of doing it in every constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. I just wanted to conform to the original coding style in the file. Let me know if you still want me to switch to use default initializers.

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes May 5, 2021
Copy link
Contributor Author

@trentlo trentlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Please help to take another look. Thanks!


ENTRY %CustomCall () -> f32[1,2,3] {
%constant = f32[1]{0} constant({12345})
%custom-call.0 = f32[1,2,3]{0,2,1} custom-call(f32[1]{0} %constant), custom_call_target="foo", schedule=EARLY_AS_POSSIBLE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Updated.

@@ -1562,6 +1568,8 @@ class HloCustomCallInstruction : public HloInstruction {
std::vector<std::pair<ShapeIndex, std::pair<int64, ShapeIndex>>>
output_to_operand_aliasing_;
absl::optional<Literal> literal_;
// A custom-call schedule hint.
CustomCallSchedule custom_call_schedule_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. I just wanted to conform to the original coding style in the file. Let me know if you still want me to switch to use default initializers.

EARLY_AS_POSSIBLE -> EARLIEST.
LATE_AS_POSSIBLE -> LATEST.
@trentlo trentlo requested a review from sanjoy May 6, 2021 18:33
@trentlo
Copy link
Contributor Author

trentlo commented May 7, 2021

@sanjoy are you still on this?

// relationship of the two calls (typically PerformX is scheduled right after
// all of its producers have been scheduled and PerformXDone is scheduled right
// before its first consumer.)
HloInstructionSequence postprocessor_to_custom_schedule(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually this is a bit dodgy: we are exposing the schedule to all clients, but only the GPU backend uses it.

Should we explicitly error out on other backends? Or ignore it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior is that if some clients add the schedule attributes to backends other than GPU, the attributes are simply ignored as the their postprocessors are empty.

An alternative is to make a default postprocessor to give warnings. E.g., "the schedule is set but ignored on the backend." How do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the current implementation in effect ignores the schedule attributes if they are used in other backends. This is a safe and correct behavior by all means.

After some thought, it is good for me to leave this in the current way (i.e., no postprocessors in other backends and effectively ignore the schedule attributes). Let me know if you have other thoughts or I don't address your concern.

};
const std::vector<HloInstruction*>& instrs = input.instructions();
for (HloInstruction* instr : instrs) {
if (scheduled.contains(instr)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't we just defined the lambda checking precisely this? Or more concretely: do we need the lambda at all then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda is needed because the absl::c_all_of() below requires a function form.

absl::c_all_of(user->operands(), is_scheduled) && absl::c_all_of(user->control_predecessors(), is_scheduled)) {

It is cleaner to also use the lambda here. Will update it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

return scheduled.contains(instr);
};
const std::vector<HloInstruction*>& instrs = input.instructions();
for (HloInstruction* instr : instrs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nitpick: maybe just use input.instructions() inline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

// relationship of the two calls (typically PerformX is scheduled right after
// all of its producers have been scheduled and PerformXDone is scheduled right
// before its first consumer.)
HloInstructionSequence postprocessor_to_custom_schedule(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style guide says functions are UpperCase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I don't know why I made it lower case. Will revise it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -237,6 +243,9 @@ message HloInstructionProto {
repeated xla.CustomCallOutputOperandAliasing
custom_call_output_operand_aliasing = 74;

// Specifies the desired schedule for the custom-call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify that the field is only present for custom calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add comments to make it clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -3823,6 +3828,24 @@ StatusOr<PrecisionConfig::Precision> StringToPrecision(const string& name) {
return found->second;
}

StatusOr<CustomCallSchedule> StringToCustomCallSchedule(const string& name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: absl::string_view for param

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -3823,6 +3828,24 @@ StatusOr<PrecisionConfig::Precision> StringToPrecision(const string& name) {
return found->second;
}

StatusOr<CustomCallSchedule> StringToCustomCallSchedule(const string& name) {
static std::unordered_map<string, CustomCallSchedule>* map = [] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use absl::flat_hash_map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated.

// Postprocessor of the HloInstructionSequence. This is an opt-in postprocessing
// function to MemorySchedulerAlgorithm to enforce certain hlo schedule
// constraints desired for custom-calls.
typedef std::function<HloInstructionSequence(const HloInstructionSequence&)>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "using" is preferred to typedef.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

// Postprocessor of the HloInstructionSequence. This is an opt-in postprocessing
// function to MemorySchedulerAlgorithm to enforce certain hlo schedule
// constraints desired for custom-calls.
typedef std::function<HloInstructionSequence(const HloInstructionSequence&)>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yunxing @dimvar Further thoughts?

Copy link
Contributor Author

@trentlo trentlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks George for taking a look. I will update the codes accordingly soon.

// relationship of the two calls (typically PerformX is scheduled right after
// all of its producers have been scheduled and PerformXDone is scheduled right
// before its first consumer.)
HloInstructionSequence postprocessor_to_custom_schedule(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior is that if some clients add the schedule attributes to backends other than GPU, the attributes are simply ignored as the their postprocessors are empty.

An alternative is to make a default postprocessor to give warnings. E.g., "the schedule is set but ignored on the backend." How do you think?

// relationship of the two calls (typically PerformX is scheduled right after
// all of its producers have been scheduled and PerformXDone is scheduled right
// before its first consumer.)
HloInstructionSequence postprocessor_to_custom_schedule(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I don't know why I made it lower case. Will revise it.

};
const std::vector<HloInstruction*>& instrs = input.instructions();
for (HloInstruction* instr : instrs) {
if (scheduled.contains(instr)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda is needed because the absl::c_all_of() below requires a function form.

absl::c_all_of(user->operands(), is_scheduled) && absl::c_all_of(user->control_predecessors(), is_scheduled)) {

It is cleaner to also use the lambda here. Will update it.

@@ -237,6 +243,9 @@ message HloInstructionProto {
repeated xla.CustomCallOutputOperandAliasing
custom_call_output_operand_aliasing = 74;

// Specifies the desired schedule for the custom-call.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add comments to make it clear.

Copy link
Contributor Author

@trentlo trentlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codes updated. Please help to take a look again and let me know if all the previous comments are addressed. Thanks!

// relationship of the two calls (typically PerformX is scheduled right after
// all of its producers have been scheduled and PerformXDone is scheduled right
// before its first consumer.)
HloInstructionSequence postprocessor_to_custom_schedule(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return scheduled.contains(instr);
};
const std::vector<HloInstruction*>& instrs = input.instructions();
for (HloInstruction* instr : instrs) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

};
const std::vector<HloInstruction*>& instrs = input.instructions();
for (HloInstruction* instr : instrs) {
if (scheduled.contains(instr)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -237,6 +243,9 @@ message HloInstructionProto {
repeated xla.CustomCallOutputOperandAliasing
custom_call_output_operand_aliasing = 74;

// Specifies the desired schedule for the custom-call.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -3823,6 +3828,24 @@ StatusOr<PrecisionConfig::Precision> StringToPrecision(const string& name) {
return found->second;
}

StatusOr<CustomCallSchedule> StringToCustomCallSchedule(const string& name) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -3823,6 +3828,24 @@ StatusOr<PrecisionConfig::Precision> StringToPrecision(const string& name) {
return found->second;
}

StatusOr<CustomCallSchedule> StringToCustomCallSchedule(const string& name) {
static std::unordered_map<string, CustomCallSchedule>* map = [] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated.

// Postprocessor of the HloInstructionSequence. This is an opt-in postprocessing
// function to MemorySchedulerAlgorithm to enforce certain hlo schedule
// constraints desired for custom-calls.
typedef std::function<HloInstructionSequence(const HloInstructionSequence&)>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@trentlo trentlo requested a review from cheshire May 14, 2021 00:09
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels May 14, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label May 14, 2021
@sanjoy sanjoy removed their request for review May 14, 2021 06:59
@cheshire
Copy link
Member

Warning: Unhelpful next ID: saw 76 but field "custom_call_schedule" already uses it. Use 77 instead? [misleading_next_id]

@trentlo
Copy link
Contributor Author

trentlo commented May 15, 2021 via email

@gbaned gbaned removed the ready to pull PR ready for merge process label May 17, 2021
@trentlo
Copy link
Contributor Author

trentlo commented May 17, 2021

Warning: Unhelpful next ID: saw 76 but field "custom_call_schedule" already uses it. Use 77 instead? [misleading_next_id]

Updated. Please help to take a look again.

@trentlo trentlo requested a review from cheshire May 17, 2021 22:37
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels May 17, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label May 17, 2021
@akuegel akuegel added the kokoro:force-run Tests on submitted change label May 19, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label May 19, 2021
@copybara-service copybara-service bot merged commit ef862e4 into tensorflow:master May 19, 2021
PR Queue automation moved this from Reviewer Requested Changes to Merged May 19, 2021
copybara-service bot pushed a commit that referenced this pull request May 26, 2021
PiperOrigin-RevId: 375939575
Change-Id: If25885fd0f0e48e961a5596ff76a83fdf61c36b3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process size:L CL Change Size: Large
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

10 participants