Align input memory format and grad memory format for GroupNorm backward #92668

CaoE · 2023-01-20T04:50:09Z

Fixes the skipped part of the test on #92671. Align the input memory format and the grad memory format for GroupNorm backward.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2023-01-20T04:50:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92668

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b3cd8ff:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jgong5

Fine with the change but check my comment on whether we can simplify the expression in the derivatives.yaml.

jgong5 · 2023-01-22T02:26:48Z

tools/autograd/derivatives.yaml

@@ -1171,7 +1171,7 @@
  rstd: not_implemented("native_layer_norm_backward rstd")

 - name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
-  input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? grads[0].suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
+  input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"


Curious why don't we do the contiguous call on grads[0] (dY) inside the kernel of each device? This way, we don't have to put such a complicated expression (very hard to read indeed) in the derivatives.yaml. Also, a change for a particular device backend won't impact another.

Yes, we should do the contiguous call inside kernels instead of derivatives.yaml, and will do it in a future PR.

Plus one on making this more readable in this or in followup PRs

Can you please explain why it's ok to use input.suggested_memory_format() for grads?
And wouldn't it essentially undo all the perf work you did in the previous PR, as we would always modify grads memory format for the backward pass (is we did not allocate it with channels last)

CaoE · 2023-01-24T02:21:24Z

@malfet Could you please review this PR ? Thank you !

malfet

IMO it would be good to move the complex logic that currently exists in derivatives.yaml into the source code and add some embedded write-up on how it supposed to behave.

I.e. I think problem with the original PR (and may be with the entire series of PRs) was that inputs were in channels last format, but layer properties were still contiguous, but treated as channels last. (I might be wrong, have not looked into the code enough)

malfet · 2023-01-27T15:40:32Z

tools/autograd/derivatives.yaml

@@ -1171,7 +1171,7 @@
  rstd: not_implemented("native_layer_norm_backward rstd")

 - name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
-  input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? grads[0].suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
+  input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"


Plus one on making this more readable in this or in followup PRs

malfet · 2023-01-27T15:41:43Z

aten/src/ATen/native/group_norm.cpp

@@ -120,10 +120,13 @@ std::tuple<Tensor, Tensor, Tensor> native_group_norm_backward(
      at::borrow_from_optional_tensor(gamma_opt);
  const Tensor& gamma = *gamma_maybe_owned;

-  bool mixed_type = is_mixed_type(X, mean, rstd);
+  bool mixed_type = is_mixed_type(X, mean, rstd) || is_mixed_type(dY, mean, rstd);


Hmm, how this change is related to the memory format problem?

It is not related to memory format, and just make sure that X and dY are in bfloat16 and parameters are in float type when it hit mixed data type.

Can you please move those to a separate PR and I'll approve it right away?

Moved the dtype check

malfet · 2023-01-27T15:41:50Z

aten/src/ATen/native/group_norm.cpp

  if (mixed_type) {
    check_mixed_data_type(X, mean, rstd);
+    check_mixed_data_type(dY, mean, rstd);


Same as the above

malfet · 2023-01-27T15:42:56Z

aten/src/ATen/native/group_norm.cpp

+  auto memory_format = X.device().is_cpu() ?
+      X.suggest_memory_format() : at::MemoryFormat::Contiguous;


Hmm, why just cpu? xpu should be the same, shouldn't it?

This is because we just add channels last support on CPU.

Please correct me if I'm wrong, but as I read the code, this change affects behavior for XPU: it used to just bypass tensor in whatever format it was allocated for CPU and XPU, but with your changes memory layout is converted to a contiguous to contiguous for XPU.

native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(), input.device().is_xpu() ? input : input.contiguous(), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())

dX = at::native::empty_like( X, c10::nullopt /* dtype */, c10::nullopt /* layout */, c10::nullopt /* device */, c10::nullopt /* pin_memory */, at::MemoryFormat::Contiguous);

The original version of GroupNorm backward shown above bypasses grads[0] and input in whatever format it was allocated for XPU, and always generate contiguous dX.
I want to keep the behavior for XPU in this PR and also make sure that: input, grad[0] and dX are memory layout aligned and support both channels last and channels first on CPU.
We will also check for XPU and submit another PR if needed.

I want to keep the behavior for XPU in this PR

But your changes force the memory format contiguous for XPU, not keeping the original behavior where memory formats are not modified, right?

The original version, i.e., before #89485 merged, force the memory format to be contiguous for dX: dX = at::native::empty_like( X, c10::nullopt /* dtype */, c10::nullopt /* layout */, c10::nullopt /* device */, c10::nullopt /* pin_memory */, at::MemoryFormat::Contiguous);

malfet · 2023-01-27T15:47:37Z

tools/autograd/derivatives.yaml

@@ -1171,7 +1171,7 @@
  rstd: not_implemented("native_layer_norm_backward rstd")

 - name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor)
-  input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? grads[0].suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"
+  input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())"


Can you please explain why it's ok to use input.suggested_memory_format() for grads?
And wouldn't it essentially undo all the perf work you did in the previous PR, as we would always modify grads memory format for the backward pass (is we did not allocate it with channels last)

CaoE · 2023-01-28T06:43:08Z

Can you please explain why it's ok to use input.suggested_memory_format() for grads?
And wouldn't it essentially undo all the perf work you did in the previous PR, as we would always modify grads memory format for the backward pass (is we did not allocate it with channels last)

@malfet The PR #89485 add channels last support for group norm backward, which needs both grad input (grads[0]) and input (X) to be channels last.

For the case 1 : GroupNorm gets channels last input and generate channels last output in forward and GroupNorm backward gets channels last grad input and generate channels last grad output. The PR #89485 works well.

For the case 2: the propagation of channels last is broken (mentioned in #92166) e.g. GroupNorm gets channels first input but GroupNorm backward get channels last grad input, we need align the memory format according to input first, which will not undo the perf work for the case 1.

malfet · 2023-01-28T18:03:09Z

@CaoE, there are more than 2 cases. One should not make any assumptions about tensor layout for either forward or backward passes. I.e. sometimes the same operator can get inputs as channels first during the first call, but as channels last during the second, and it should cast them to the most appropriate format. I might be wrong, but it feels like current CPU implementation treats mean and rstd tensors layout as to be in the same, as input, so if the same op is used with channels first and channels last layout it might produce incorrect results.

CaoE · 2023-01-29T02:27:01Z

as input, so if the same op is used with channels first and channels last layout it might produce incorrect results.

Yes, if the layout of input and grad input are not the same it will produce incorrect results. So we should align the layout of input and grad input, and it will not undo the perf work when the layout of input and grad input are both channels last.

CaoE · 2023-01-31T02:54:31Z

@malfet Could you please review this again ? Thank you.

CaoE · 2023-02-06T09:04:29Z

@malfet Could you please review this PR ?

CaoE · 2023-02-09T02:39:19Z

@pytorchbot merge

pytorchmergebot · 2023-02-09T02:41:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-02-09T05:58:37Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 2, macos-m1-12)

Details for Dev Infra team

Raised by workflow job

CaoE · 2023-02-09T08:53:52Z

@pytorchbot merge

pytorchmergebot · 2023-02-09T08:56:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the release notes: nn release notes category label Jan 20, 2023

CaoE mentioned this pull request Jan 20, 2023

add channels last with mixed data type support for GroupNorm backward #89485

Closed

pytorchbot added the open source label Jan 20, 2023

CaoE mentioned this pull request Jan 20, 2023

Recent PyTorch nightly builds breaks DALLE2_pytorch model in Torchbench #92166

Closed

CaoE force-pushed the ecao/fix_gn branch from b1e7ea7 to 1a4697a Compare January 21, 2023 05:18

CaoE changed the title ~~add device check for groupnorm backward memory format~~ Align input memory format and grad memory format Jan 21, 2023

CaoE changed the title ~~Align input memory format and grad memory format~~ Align input memory format and grad memory format for GroupNorm backward Jan 21, 2023

CaoE requested review from jgong5 and removed request for jgong5 January 21, 2023 12:09

jgong5 approved these changes Jan 22, 2023

View reviewed changes

CaoE force-pushed the ecao/fix_gn branch 4 times, most recently from 53acfe9 to 43cb756 Compare January 23, 2023 02:51

CaoE marked this pull request as ready for review January 23, 2023 10:32

CaoE requested review from albanD and soulitzer as code owners January 23, 2023 10:32

CaoE requested a review from malfet January 23, 2023 10:33

malfet reviewed Jan 27, 2023

View reviewed changes

CaoE force-pushed the ecao/fix_gn branch from 43cb756 to c094c28 Compare January 29, 2023 03:12

CaoE requested review from jgong5 and malfet January 29, 2023 03:14

jgong5 approved these changes Jan 30, 2023

View reviewed changes

CaoE added intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel labels Jan 31, 2023

CaoE force-pushed the ecao/fix_gn branch from c094c28 to b3cd8ff Compare February 6, 2023 00:49

align input memory format and grad memory format

b3cd8ff

CaoE requested a review from kit1980 February 7, 2023 01:15

malfet approved these changes Feb 8, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2023

pytorchmergebot added the Merged label Feb 9, 2023

pytorchmergebot closed this in 81e3183 Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align input memory format and grad memory format for GroupNorm backward #92668

Align input memory format and grad memory format for GroupNorm backward #92668

CaoE commented Jan 20, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Jan 20, 2023 •

edited

jgong5 left a comment

jgong5 Jan 22, 2023

CaoE Jan 23, 2023 •

edited

malfet Jan 27, 2023

malfet Jan 27, 2023

CaoE commented Jan 24, 2023

malfet left a comment

malfet Jan 27, 2023

malfet Jan 27, 2023

CaoE Jan 28, 2023

malfet Jan 28, 2023

CaoE Jan 29, 2023

malfet Jan 27, 2023

CaoE Jan 28, 2023

malfet Jan 27, 2023

CaoE Jan 28, 2023

malfet Jan 28, 2023

CaoE Jan 29, 2023 •

edited

jgong5 Jan 30, 2023

CaoE Jan 30, 2023

malfet Jan 27, 2023

CaoE commented Jan 28, 2023 •

edited

malfet commented Jan 28, 2023

CaoE commented Jan 29, 2023 •

edited

CaoE commented Jan 31, 2023

CaoE commented Feb 6, 2023

CaoE commented Feb 9, 2023

pytorchmergebot commented Feb 9, 2023

pytorchmergebot commented Feb 9, 2023

CaoE commented Feb 9, 2023

pytorchmergebot commented Feb 9, 2023

		auto memory_format = X.device().is_cpu() ?
		X.suggest_memory_format() : at::MemoryFormat::Contiguous;

Align input memory format and grad memory format for GroupNorm backward #92668

Align input memory format and grad memory format for GroupNorm backward #92668

Conversation

CaoE commented Jan 20, 2023 • edited by pytorch-bot bot

pytorch-bot bot commented Jan 20, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92668

✅ No Failures

jgong5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaoE Jan 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaoE commented Jan 24, 2023

malfet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaoE Jan 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaoE commented Jan 28, 2023 • edited

malfet commented Jan 28, 2023

CaoE commented Jan 29, 2023 • edited

CaoE commented Jan 31, 2023

CaoE commented Feb 6, 2023

CaoE commented Feb 9, 2023

pytorchmergebot commented Feb 9, 2023

Merge started

pytorchmergebot commented Feb 9, 2023

Merge failed

CaoE commented Feb 9, 2023

pytorchmergebot commented Feb 9, 2023

Merge started

CaoE commented Jan 20, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Jan 20, 2023 •

edited

CaoE Jan 23, 2023 •

edited

CaoE Jan 29, 2023 •

edited

CaoE commented Jan 28, 2023 •

edited

CaoE commented Jan 29, 2023 •

edited