Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align input memory format and grad memory format for GroupNorm backward #92668

Closed
wants to merge 1 commit into from

Conversation

CaoE
Copy link
Collaborator

@CaoE CaoE commented Jan 20, 2023

Fixes the skipped part of the test on #92671. Align the input memory format and the grad memory format for GroupNorm backward.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 20, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92668

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b3cd8ff:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: nn release notes category label Jan 20, 2023
@CaoE CaoE changed the title add device check for groupnorm backward memory format Align input memory format and grad memory format Jan 21, 2023
@CaoE CaoE changed the title Align input memory format and grad memory format Align input memory format and grad memory format for GroupNorm backward Jan 21, 2023
@CaoE CaoE requested review from jgong5 and removed request for jgong5 January 21, 2023 12:09
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with the change but check my comment on whether we can simplify the expression in the derivatives.yaml.

@@ -1171,7 +1171,7 @@
rstd: not_implemented("native_layer_norm_backward rstd")

- name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? grads[0].suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why don't we do the contiguous call on grads[0] (dY) inside the kernel of each device? This way, we don't have to put such a complicated expression (very hard to read indeed) in the derivatives.yaml. Also, a change for a particular device backend won't impact another.

Copy link
Collaborator Author

@CaoE CaoE Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should do the contiguous call inside kernels instead of derivatives.yaml, and will do it in a future PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus one on making this more readable in this or in followup PRs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why it's ok to use input.suggested_memory_format() for grads?
And wouldn't it essentially undo all the perf work you did in the previous PR, as we would always modify grads memory format for the backward pass (is we did not allocate it with channels last)

@CaoE CaoE force-pushed the ecao/fix_gn branch 4 times, most recently from 53acfe9 to 43cb756 Compare January 23, 2023 02:51
@CaoE CaoE marked this pull request as ready for review January 23, 2023 10:32
@CaoE CaoE requested a review from malfet January 23, 2023 10:33
@CaoE
Copy link
Collaborator Author

CaoE commented Jan 24, 2023

@malfet Could you please review this PR ? Thank you !

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be good to move the complex logic that currently exists in derivatives.yaml into the source code and add some embedded write-up on how it supposed to behave.

I.e. I think problem with the original PR (and may be with the entire series of PRs) was that inputs were in channels last format, but layer properties were still contiguous, but treated as channels last. (I might be wrong, have not looked into the code enough)

@@ -1171,7 +1171,7 @@
rstd: not_implemented("native_layer_norm_backward rstd")

- name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? grads[0].suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus one on making this more readable in this or in followup PRs

@@ -120,10 +120,13 @@ std::tuple<Tensor, Tensor, Tensor> native_group_norm_backward(
at::borrow_from_optional_tensor(gamma_opt);
const Tensor& gamma = *gamma_maybe_owned;

bool mixed_type = is_mixed_type(X, mean, rstd);
bool mixed_type = is_mixed_type(X, mean, rstd) || is_mixed_type(dY, mean, rstd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how this change is related to the memory format problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not related to memory format, and just make sure that X and dY are in bfloat16 and parameters are in float type when it hit mixed data type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move those to a separate PR and I'll approve it right away?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the dtype check

if (mixed_type) {
check_mixed_data_type(X, mean, rstd);
check_mixed_data_type(dY, mean, rstd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the above

auto memory_format = X.device().is_cpu() ?
X.suggest_memory_format() : at::MemoryFormat::Contiguous;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why just cpu? xpu should be the same, shouldn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we just add channels last support on CPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong, but as I read the code, this change affects behavior for XPU: it used to just bypass tensor in whatever format it was allocated for CPU and XPU, but with your changes memory layout is converted to a contiguous to contiguous for XPU.

Copy link
Collaborator Author

@CaoE CaoE Jan 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(), input.device().is_xpu() ? input : input.contiguous(), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())

dX = at::native::empty_like( X, c10::nullopt /* dtype */, c10::nullopt /* layout */, c10::nullopt /* device */, c10::nullopt /* pin_memory */, at::MemoryFormat::Contiguous);

The original version of GroupNorm backward shown above bypasses grads[0] and input in whatever format it was allocated for XPU, and always generate contiguous dX.
I want to keep the behavior for XPU in this PR and also make sure that: input, grad[0] and dX are memory layout aligned and support both channels last and channels first on CPU.
We will also check for XPU and submit another PR if needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to keep the behavior for XPU in this PR

But your changes force the memory format contiguous for XPU, not keeping the original behavior where memory formats are not modified, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original version, i.e., before #89485 merged, force the memory format to be contiguous for dX: dX = at::native::empty_like( X, c10::nullopt /* dtype */, c10::nullopt /* layout */, c10::nullopt /* device */, c10::nullopt /* pin_memory */, at::MemoryFormat::Contiguous);

@@ -1171,7 +1171,7 @@
rstd: not_implemented("native_layer_norm_backward rstd")

- name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? grads[0].suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why it's ok to use input.suggested_memory_format() for grads?
And wouldn't it essentially undo all the perf work you did in the previous PR, as we would always modify grads memory format for the backward pass (is we did not allocate it with channels last)

@CaoE
Copy link
Collaborator Author

CaoE commented Jan 28, 2023

Can you please explain why it's ok to use input.suggested_memory_format() for grads?
And wouldn't it essentially undo all the perf work you did in the previous PR, as we would always modify grads memory format for the backward pass (is we did not allocate it with channels last)

@malfet The PR #89485 add channels last support for group norm backward, which needs both grad input (grads[0]) and input (X) to be channels last.

For the case 1 : GroupNorm gets channels last input and generate channels last output in forward and GroupNorm backward gets channels last grad input and generate channels last grad output. The PR #89485 works well.

For the case 2: the propagation of channels last is broken (mentioned in #92166) e.g. GroupNorm gets channels first input but GroupNorm backward get channels last grad input, we need align the memory format according to input first, which will not undo the perf work for the case 1.

@malfet
Copy link
Contributor

malfet commented Jan 28, 2023

@CaoE, there are more than 2 cases. One should not make any assumptions about tensor layout for either forward or backward passes. I.e. sometimes the same operator can get inputs as channels first during the first call, but as channels last during the second, and it should cast them to the most appropriate format. I might be wrong, but it feels like current CPU implementation treats mean and rstd tensors layout as to be in the same, as input, so if the same op is used with channels first and channels last layout it might produce incorrect results.

@CaoE
Copy link
Collaborator Author

CaoE commented Jan 29, 2023

as input, so if the same op is used with channels first and channels last layout it might produce incorrect results.

Yes, if the layout of input and grad input are not the same it will produce incorrect results. So we should align the layout of input and grad input, and it will not undo the perf work when the layout of input and grad input are both channels last.

@CaoE
Copy link
Collaborator Author

CaoE commented Jan 31, 2023

@malfet Could you please review this again ? Thank you.

@CaoE CaoE added intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel labels Jan 31, 2023
@CaoE
Copy link
Collaborator Author

CaoE commented Feb 6, 2023

@malfet Could you please review this PR ?

@CaoE CaoE requested a review from kit1980 February 7, 2023 01:15
@CaoE
Copy link
Collaborator Author

CaoE commented Feb 9, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 2, macos-m1-12)

Details for Dev Infra team Raised by workflow job

@CaoE
Copy link
Collaborator Author

CaoE commented Feb 9, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel Merged open source release notes: nn release notes category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants