Skip to content

Conversation

chunyuan-w
Copy link
Collaborator

@chunyuan-w chunyuan-w commented Sep 3, 2024

Stack from ghstack (oldest at bottom):

Description

Fixes the FP32 accuracy failure of resmlp_12_224 and BF16 accuracy failure of volo_d1_224 in timm.

In this PR, we check whether input is contiguous using the following way:
If it has FixedLayout, we know the accurate strides. For FlexibleLayout, if its data is a ComputedBuffer, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous.

Additional context

The current GEMM template only supports this case: input.get_stride()[-1] == 1. In resmlp_12_224, when we run into this check, the layout of input is a FlexibleLayout. The reason is that when realizing the input which is a View IR, the convert_to_reinterpret_view call fails:

pytorch/torch/_inductor/ir.py

Lines 4712 to 4715 in d14fe3f

try:
return cls.convert_to_reinterpret_view(x)
except NotImplementedError:
pass

And it finally runs into this copy_input and returns a FlexibleLayout.

return cls.copy_input(x)

When checking its stride, this FlexibleLayout indeed satisfies input.get_stride()[-1] == 1 but it is later decided as a FixedLayout with size = (3072, 196), stride = (1, 3072), which is not supported by the GEMM template, thus causing accuracy issue in this model.
The FlexibleLayout is converted to FixedLayout during CppPackedGemmTemplate.add_choices which calls slice_nd when rendering the kernel (slice_nd(X)). When creating the SliceView IR, as_storage_and_layout invokes
decide_layout and converts it to a FixedLayout with size = (3072, 196), stride = (1, 3072).

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented Sep 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134982

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8f2b3f8 with merge base 217ba7b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Sep 4, 2024
chunyuan-w added a commit that referenced this pull request Sep 4, 2024
[ghstack-poisoned]
@chunyuan-w chunyuan-w marked this pull request as draft September 4, 2024 14:02
[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Sep 5, 2024
[ghstack-poisoned]
[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Sep 6, 2024
[ghstack-poisoned]
@chunyuan-w chunyuan-w changed the title [inductor] [cpp] input should be FixedLayout when checking its stride [inductor] [cpp] always converts input to contiguous for max-autotune Sep 6, 2024
[ghstack-poisoned]
[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Sep 6, 2024
[ghstack-poisoned]
@chunyuan-w chunyuan-w changed the title [inductor] [cpp] always converts input to contiguous for max-autotune [inductor] [cpp] fix the input contiguous check in max-autotune Sep 7, 2024
[ghstack-poisoned]
[ghstack-poisoned]
@chunyuan-w
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

)

def is_last_dim_stride1(x):
if isinstance(x.layout, ir.FixedLayout):
Copy link
Contributor

@jansel jansel Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is failing, should we be calling freeze_layout? Flexible layouts are allowed to change, so any check you do could become false later on. If you have a FlexibleLayout you can force the last dim to be stride=1 without a copy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've updated the code. Could you help take another look?

chunyuan-w added a commit that referenced this pull request Sep 9, 2024
@chunyuan-w chunyuan-w requested a review from jansel September 9, 2024 06:17
[ghstack-poisoned]
@chunyuan-w
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

chunyuan-w added a commit to chunyuan-w/pytorch that referenced this pull request Sep 10, 2024
…rch#134982)

## Description
Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm.

In this PR, we check whether input is contiguous using the following way:
If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous.

## Additional context
The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails:
https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4712-L4715

And it finally runs into this `copy_input` and returns a `FlexibleLayout`.
https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4722

When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model.
The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/mkldnn_lowerings.py#L1051) which calls [slice_nd](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/codegen/cpp_template_kernel.py#L150) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2288) invokes
[decide_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2135) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`.

Pull Request resolved: pytorch#134982
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
chunyuan-w added a commit to chunyuan-w/pytorch that referenced this pull request Sep 11, 2024
…rch#134982)

## Description
Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm.

In this PR, we check whether input is contiguous using the following way:
If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous.

## Additional context
The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails:
https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4712-L4715

And it finally runs into this `copy_input` and returns a `FlexibleLayout`.
https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4722

When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model.
The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/mkldnn_lowerings.py#L1051) which calls [slice_nd](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/codegen/cpp_template_kernel.py#L150) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2288) invokes
[decide_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2135) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`.

Pull Request resolved: pytorch#134982
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
…rch#134982)

## Description
Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm.

In this PR, we check whether input is contiguous using the following way:
If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous.

## Additional context
The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails:
https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4712-L4715

And it finally runs into this `copy_input` and returns a `FlexibleLayout`.
https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4722

When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model.
The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/mkldnn_lowerings.py#L1051) which calls [slice_nd](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/codegen/cpp_template_kernel.py#L150) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2288) invokes
[decide_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2135) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`.

Pull Request resolved: pytorch#134982
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
kit1980 pushed a commit that referenced this pull request Sep 20, 2024
)

[inductor] [cpp] fix the input contiguous check in max-autotune (#134982)

## Description
Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm.

In this PR, we check whether input is contiguous using the following way:
If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous.

## Additional context
The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails:
https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4712-L4715

And it finally runs into this `copy_input` and returns a `FlexibleLayout`.
https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4722

When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model.
The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/mkldnn_lowerings.py#L1051) which calls [slice_nd](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/codegen/cpp_template_kernel.py#L150) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2288) invokes
[decide_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2135) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`.

Pull Request resolved: #134982
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
@github-actions github-actions bot deleted the gh/chunyuan-w/26/head branch October 12, 2024 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants