Use to() instead of contiguous() to generate channels last tensor for Intel XPU #161041

NeoZhangJianyu · 2025-08-20T06:06:29Z

Fixes #95693 for Intel XPU case.

In same cases of Conv Backward, the output grad tensor's stride will be abnormal format.
Like output_grad.shape = torch.Size([1, 1, 2, 2])
Normal stride: [4, 4, 2, 1]
Abnormal stride: [4, 1, 2, 1]

That will lead to the wrong result of conv backward for Intel XPU.

Solution:

Refer to the solution of Use .to instead of contiguous to generate channels last tensor #96791, use to() to correct the stride for channel last case.
Note: to() can't replace the contiguous() to make the tensor weights are contiguous. Refer to https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cudnn/ConvShared.cpp#L772
Add unit test case.

This PR is rebased on old PR #160606 with more review comments.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @gujinghui @EikanWang @fengyuan14 @guangyey

pytorch-bot · 2025-08-20T06:06:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161041

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 27 New Failures

As of commit 986b8c1 with merge base 2b62ef7 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-clang / linux-job (gh)
>>> Lint for aten/src/ATen/native/mkldnn/xpu/Conv.cpp:
pull / linux-docs / build-docs-cpp-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-functorch-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-python-false (gh)
pull / linux-jammy-cuda12.8-cudnn9-py3.10-clang12 / build (gh)
pull / linux-jammy-py3.10-clang12 / test (crossref, 1, 2, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (crossref, 2, 2, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (default, 1, 5, lf.linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (default, 2, 5, lf.linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (default, 3, 5, lf.linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (default, 4, 5, lf.linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (default, 5, 5, lf.linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (dynamo_wrapped, 1, 3, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (dynamo_wrapped, 2, 3, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (dynamo_wrapped, 3, 3, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang12 / test (einops, 1, 1, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (backwards_compat, 1, 1, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (default, 1, 5, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (default, 2, 5, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (default, 3, 5, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (default, 4, 5, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (default, 5, 5, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (distributed, 2, 2, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (docs_test, 1, 1, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (jit_legacy, 1, 1, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-gcc11 / test (numpy_2_x, 1, 1, lf.linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

aten/src/ATen/native/mkldnn/xpu/Conv.cpp

EikanWang · 2025-08-22T20:59:57Z

@NeoZhangJianyu , you can invoke as_stride to avoid doing memory copy if the tensor shape is [1, 1, 2, 2] while the stride is [4, 1, 2, 1]. Because it is safe to change the stride to [4, 4, 2, 1].

NeoZhangJianyu · 2025-08-25T05:52:39Z

@NeoZhangJianyu , you can invoke as_stride to avoid doing memory copy if the tensor shape is [1, 1, 2, 2] while the stride is [4, 1, 2, 1]. Because it is safe to change the stride to [4, 4, 2, 1].

as_strided() is used to set the new stride to tensor.
It needs to input a new stride value.

But there is no existed single function to create the correct stride of a tensor in current pytorch code.
It's hard to define new function to create the stride correctly.

So, it's very hard to call as_strided() to correct the stride.
Use to() is a simple method to correct the stride in this case. It's also solution in CUDA part.

Thank you!

NeoZhangJianyu · 2025-09-03T07:46:00Z

Updated the solution description:

Refer to the solution of #96791, use to() to correct the stride for channel last case.

Note: to() can't replace the contiguous() to make the tensor weights are contiguous. Refer to https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cudnn/ConvShared.cpp#L772

EikanWang · 2025-09-10T05:45:55Z

@NeoZhangJianyu , could you help rebase this PR and fix the linter issue?

Copilot

Pull Request Overview

Fixes an issue where convolution backward operations on Intel XPU fail with tensors having abnormal stride patterns. The solution addresses cases where output gradient tensors have incorrect stride orders that lead to wrong results.

Adds stride validation to detect non-decreasing stride patterns
Uses to() method to correct stride for channel-last format before falling back to contiguous()
Includes comprehensive test case to verify the fix

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
aten/src/ATen/native/mkldnn/xpu/Conv.cpp	Implements stride validation function and applies stride correction logic in convolution backward
test/xpu/test_conv.py	Adds test case that creates abnormal stride patterns and verifies backward pass correctness

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-10T06:06:12Z

aten/src/ATen/native/mkldnn/xpu/Conv.cpp

+
+  if (!is_stride_decrease_order(grad_output_))
+    grad_output_ = grad_output_.to(mfmt);
+


[nitpick] There are unnecessary blank lines (618 and 621) that break the logical flow of the conditional checks. Remove these extra blank lines to improve code readability.

Suggested change

if (!is_stride_decrease_order(grad_output_))

grad_output_ = grad_output_.to(mfmt);

if (!is_stride_decrease_order(grad_output_))

grad_output_ = grad_output_.to(mfmt);

github-actions · 2025-11-09T06:40:37Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

use to() to correct the abnormal stride

29eca7f

NeoZhangJianyu requested review from EikanWang and gujinghui as code owners August 20, 2025 06:06

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Aug 20, 2025

NeoZhangJianyu marked this pull request as draft August 20, 2025 06:06

pytorchbot added the open source label Aug 20, 2025

NeoZhangJianyu closed this Aug 20, 2025

NeoZhangJianyu reopened this Aug 21, 2025

ZhiweiYan-96 added topic: not user facing topic category module: xpu Intel XPU related issues ciflow/xpu Run XPU CI tasks labels Aug 21, 2025

Skylion007 reviewed Aug 21, 2025

View reviewed changes

aten/src/ATen/native/mkldnn/xpu/Conv.cpp Show resolved Hide resolved

check abnormal stride before execute to()

986b8c1

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 29, 2025

EikanWang requested a review from Copilot September 10, 2025 06:05

Copilot AI reviewed Sep 10, 2025

View reviewed changes

github-actions bot added the Stale label Nov 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use to() instead of contiguous() to generate channels last tensor for Intel XPU #161041

Use to() instead of contiguous() to generate channels last tensor for Intel XPU #161041

Uh oh!

NeoZhangJianyu commented Aug 20, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

EikanWang commented Aug 22, 2025

Uh oh!

NeoZhangJianyu commented Aug 25, 2025 •

edited

Loading

Uh oh!

NeoZhangJianyu commented Sep 3, 2025

Uh oh!

EikanWang commented Sep 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 10, 2025

Uh oh!

github-actions bot commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		if (!is_stride_decrease_order(grad_output_))
		grad_output_ = grad_output_.to(mfmt);

Use to() instead of contiguous() to generate channels last tensor for Intel XPU #161041

Are you sure you want to change the base?

Use to() instead of contiguous() to generate channels last tensor for Intel XPU #161041

Uh oh!

Conversation

NeoZhangJianyu commented Aug 20, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161041

❌ 27 New Failures

Uh oh!

Uh oh!

EikanWang commented Aug 22, 2025

Uh oh!

NeoZhangJianyu commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NeoZhangJianyu commented Sep 3, 2025

Uh oh!

EikanWang commented Sep 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NeoZhangJianyu commented Aug 20, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 20, 2025 •

edited

Loading

NeoZhangJianyu commented Aug 25, 2025 •

edited

Loading