-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align input memory format and grad memory format for GroupNorm backward #92668
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92668
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b3cd8ff: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with the change but check my comment on whether we can simplify the expression in the derivatives.yaml.
@@ -1171,7 +1171,7 @@ | |||
rstd: not_implemented("native_layer_norm_backward rstd") | |||
|
|||
- name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor) | |||
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? grads[0].suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())" | |||
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why don't we do the contiguous
call on grads[0]
(dY
) inside the kernel of each device? This way, we don't have to put such a complicated expression (very hard to read indeed) in the derivatives.yaml. Also, a change for a particular device backend won't impact another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should do the contiguous call inside kernels instead of derivatives.yaml, and will do it in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus one on making this more readable in this or in followup PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain why it's ok to use input.suggested_memory_format()
for grads?
And wouldn't it essentially undo all the perf work you did in the previous PR, as we would always modify grads memory format for the backward pass (is we did not allocate it with channels last)
53acfe9
to
43cb756
Compare
@malfet Could you please review this PR ? Thank you ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it would be good to move the complex logic that currently exists in derivatives.yaml
into the source code and add some embedded write-up on how it supposed to behave.
I.e. I think problem with the original PR (and may be with the entire series of PRs) was that inputs were in channels last format, but layer properties were still contiguous, but treated as channels last. (I might be wrong, have not looked into the code enough)
@@ -1171,7 +1171,7 @@ | |||
rstd: not_implemented("native_layer_norm_backward rstd") | |||
|
|||
- name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor) | |||
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? grads[0].suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())" | |||
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus one on making this more readable in this or in followup PRs
aten/src/ATen/native/group_norm.cpp
Outdated
@@ -120,10 +120,13 @@ std::tuple<Tensor, Tensor, Tensor> native_group_norm_backward( | |||
at::borrow_from_optional_tensor(gamma_opt); | |||
const Tensor& gamma = *gamma_maybe_owned; | |||
|
|||
bool mixed_type = is_mixed_type(X, mean, rstd); | |||
bool mixed_type = is_mixed_type(X, mean, rstd) || is_mixed_type(dY, mean, rstd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how this change is related to the memory format problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not related to memory format, and just make sure that X and dY are in bfloat16 and parameters are in float type when it hit mixed data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please move those to a separate PR and I'll approve it right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the dtype check
aten/src/ATen/native/group_norm.cpp
Outdated
if (mixed_type) { | ||
check_mixed_data_type(X, mean, rstd); | ||
check_mixed_data_type(dY, mean, rstd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the above
auto memory_format = X.device().is_cpu() ? | ||
X.suggest_memory_format() : at::MemoryFormat::Contiguous; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why just cpu? xpu
should be the same, shouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we just add channels last support on CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I'm wrong, but as I read the code, this change affects behavior for XPU: it used to just bypass tensor in whatever format it was allocated for CPU and XPU, but with your changes memory layout is converted to a contiguous to contiguous for XPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(), input.device().is_xpu() ? input : input.contiguous(), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())
dX = at::native::empty_like( X, c10::nullopt /* dtype */, c10::nullopt /* layout */, c10::nullopt /* device */, c10::nullopt /* pin_memory */, at::MemoryFormat::Contiguous);
The original version of GroupNorm backward shown above bypasses grads[0] and input in whatever format it was allocated for XPU, and always generate contiguous dX.
I want to keep the behavior for XPU in this PR and also make sure that: input, grad[0] and dX are memory layout aligned and support both channels last and channels first on CPU.
We will also check for XPU and submit another PR if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to keep the behavior for XPU in this PR
But your changes force the memory format contiguous for XPU, not keeping the original behavior where memory formats are not modified, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original version, i.e., before #89485 merged, force the memory format to be contiguous for dX: dX = at::native::empty_like( X, c10::nullopt /* dtype */, c10::nullopt /* layout */, c10::nullopt /* device */, c10::nullopt /* pin_memory */, at::MemoryFormat::Contiguous);
@@ -1171,7 +1171,7 @@ | |||
rstd: not_implemented("native_layer_norm_backward rstd") | |||
|
|||
- name: native_group_norm(Tensor input, Tensor? weight, Tensor? bias, SymInt N, SymInt C, SymInt HxW, int group, float eps) -> (Tensor, Tensor, Tensor) | |||
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? grads[0].suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())" | |||
input, weight, bias: "GradMode::is_enabled() || grads[1].defined() || grads[2].defined() ? infinitely_differentiable_native_group_norm_backward(grads[0], grads[1], grads[2], input, result1, result2, weight, N, C, HxW, group, eps, grad_input_mask) : (grads[0].defined() ? native_group_norm_backward_symint(grads[0].device().is_xpu() ? grads[0] : grads[0].contiguous(grads[0].device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), input.device().is_xpu() ? input : input.contiguous(input.device().is_cpu() ? input.suggest_memory_format() : c10::MemoryFormat::Contiguous), result1, result2, weight, N, C, HxW, group, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>())" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain why it's ok to use input.suggested_memory_format()
for grads?
And wouldn't it essentially undo all the perf work you did in the previous PR, as we would always modify grads memory format for the backward pass (is we did not allocate it with channels last)
@malfet The PR #89485 add channels last support for group norm backward, which needs both grad input (grads[0]) and input (X) to be channels last. For the case 1 : GroupNorm gets channels last input and generate channels last output in forward and GroupNorm backward gets channels last grad input and generate channels last grad output. The PR #89485 works well. For the case 2: the propagation of channels last is broken (mentioned in #92166) e.g. GroupNorm gets channels first input but GroupNorm backward get channels last grad input, we need align the memory format according to input first, which will not undo the perf work for the case 1. |
@CaoE, there are more than 2 cases. One should not make any assumptions about tensor layout for either forward or backward passes. I.e. sometimes the same operator can get inputs as channels first during the first call, but as channels last during the second, and it should cast them to the most appropriate format. I might be wrong, but it feels like current CPU implementation treats |
Yes, if the layout of input and grad input are not the same it will produce incorrect results. So we should align the layout of input and grad input, and it will not undo the perf work when the layout of input and grad input are both channels last. |
@malfet Could you please review this again ? Thank you. |
@malfet Could you please review this PR ? |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 2, macos-m1-12) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes the skipped part of the test on #92671. Align the input memory format and the grad memory format for GroupNorm backward.
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10