Skip to content

Conversation

@NeoZhangJianyu
Copy link

@NeoZhangJianyu NeoZhangJianyu commented Aug 20, 2025

Fixes #95693 for Intel XPU case.

In same cases of Conv Backward, the output grad tensor's stride will be abnormal format.
Like output_grad.shape = torch.Size([1, 1, 2, 2])
Normal stride: [4, 4, 2, 1]
Abnormal stride: [4, 1, 2, 1]

That will lead to the wrong result of conv backward for Intel XPU.

Solution:

This PR is rebased on old PR #160606 with more review comments.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @gujinghui @EikanWang @fengyuan14 @guangyey

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 20, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161041

Note: Links to docs will display an error until the docs builds have been completed.

❌ 27 New Failures

As of commit 986b8c1 with merge base 2b62ef7 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Aug 20, 2025
@NeoZhangJianyu NeoZhangJianyu marked this pull request as draft August 20, 2025 06:06
@ZhiweiYan-96 ZhiweiYan-96 added topic: not user facing topic category module: xpu Intel XPU related issues ciflow/xpu Run XPU CI tasks labels Aug 21, 2025
@EikanWang
Copy link
Collaborator

@NeoZhangJianyu , you can invoke as_stride to avoid doing memory copy if the tensor shape is [1, 1, 2, 2] while the stride is [4, 1, 2, 1]. Because it is safe to change the stride to [4, 4, 2, 1].

@NeoZhangJianyu
Copy link
Author

NeoZhangJianyu commented Aug 25, 2025

@NeoZhangJianyu , you can invoke as_stride to avoid doing memory copy if the tensor shape is [1, 1, 2, 2] while the stride is [4, 1, 2, 1]. Because it is safe to change the stride to [4, 4, 2, 1].

as_strided() is used to set the new stride to tensor.
It needs to input a new stride value.

But there is no existed single function to create the correct stride of a tensor in current pytorch code.
It's hard to define new function to create the stride correctly.

So, it's very hard to call as_strided() to correct the stride.
Use to() is a simple method to correct the stride in this case. It's also solution in CUDA part.

Thank you!

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 29, 2025
@NeoZhangJianyu
Copy link
Author

Updated the solution description:

Refer to the solution of #96791, use to() to correct the stride for channel last case.

Note: to() can't replace the contiguous() to make the tensor weights are contiguous. Refer to https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cudnn/ConvShared.cpp#L772

@EikanWang
Copy link
Collaborator

@NeoZhangJianyu , could you help rebase this PR and fix the linter issue?

@EikanWang EikanWang requested a review from Copilot September 10, 2025 06:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Fixes an issue where convolution backward operations on Intel XPU fail with tensors having abnormal stride patterns. The solution addresses cases where output gradient tensors have incorrect stride orders that lead to wrong results.

  • Adds stride validation to detect non-decreasing stride patterns
  • Uses to() method to correct stride for channel-last format before falling back to contiguous()
  • Includes comprehensive test case to verify the fix

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
aten/src/ATen/native/mkldnn/xpu/Conv.cpp Implements stride validation function and applies stride correction logic in convolution backward
test/xpu/test_conv.py Adds test case that creates abnormal stride patterns and verifies backward pass correctness

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +618 to +621

if (!is_stride_decrease_order(grad_output_))
grad_output_ = grad_output_.to(mfmt);

Copy link

Copilot AI Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] There are unnecessary blank lines (618 and 621) that break the logical flow of the conditional checks. Remove these extra blank lines to improve code readability.

Suggested change
if (!is_stride_decrease_order(grad_output_))
grad_output_ = grad_output_.to(mfmt);
if (!is_stride_decrease_order(grad_output_))
grad_output_ = grad_output_.to(mfmt);

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Nov 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: cpu CPU specific problem (e.g., perf, algorithm) module: xpu Intel XPU related issues open source Stale topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AssertionError: expected size 64==64, stride 16384==1 at dim=1

5 participants