Skip to content

Conversation

Valentine233
Copy link
Collaborator

@Valentine233 Valentine233 commented Mar 25, 2024

Copy link

pytorch-bot bot commented Mar 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122599

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 13c5c18 with merge base 03a05e7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Valentine233 Valentine233 added the topic: not user facing topic category label Mar 25, 2024
@Valentine233 Valentine233 marked this pull request as draft March 25, 2024 11:09
@Valentine233 Valentine233 marked this pull request as draft March 25, 2024 11:09
@Valentine233 Valentine233 marked this pull request as draft March 25, 2024 11:09
@Valentine233 Valentine233 marked this pull request as ready for review March 29, 2024 05:58
Comment on lines 95 to 109
if (
isinstance(t.data, ir.View)
and isinstance(t.data.data, ir.PermuteView)
and t.data.data.dims == [0, 3, 1, 2]
):
t = ir.Pointwise.create(
device=t.get_device(),
dtype=t.get_dtype(),
inner_fn=t.make_loader(),
ranges=t.get_size(),
origin_node=t.get_origin_node(),
traceback=t.get_traceback(),
)
t.realize()
t.freeze_layout()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several issues with this change:

  1. It only handles a specific case where the tensor is in 4D and permuted with a particular order. Can we make it general? Basically, we want a particular order of the last two dims?
  2. Related to 1, bmm can actually handle non-contiguous cases and also transposed cases for the last two dims. It only requires one of the dims to be contiguous while the other can have a stride larger than the size of the former dim. Is it too strict to always force contiguous here?
  3. Perhaps we can call require_stride_order instead of implementing the copy_input again here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestions!

A new function called is_pointwise_contiguous_or_transposed_after_perm is added to deduce the layout of a pointwise from its readers, because it has not been realized yet.

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 1, 2024
@Valentine233 Valentine233 requested a review from jgong5 April 2, 2024 10:29
@Valentine233
Copy link
Collaborator Author

@leslie-fang-intel @jgong5 Hi, I did some code changes. Please review again :)

@Valentine233 Valentine233 added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 10, 2024
Comment on lines 5044 to 5045
self.assertEqual(out_expected, out_actual)
self.assertEqual(out_expected.stride(), out_actual.stride())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this particular case, does the test pass with or without the PR change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is just to check the accuracy with the PR. I would remove it.

return t

if all(x.get_device().type == "cpu" for x in [mat1, mat2]):
if not ir.is_storage_and_layout(mat1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about mat1 or mat2 has flexible layout? Shall we also apply the similar logic here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and modified!

Comment on lines 92 to 94
# Make the inputs of bmm contiguous
# because bmm cpu implementation does contiguous() if not
# this is to avoid additional copies in bmm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment accurate? If the input is transposed in the last two dims, would bmm still make the inputs contiguous?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and modified!

)


def is_contiguous_or_transposed(sizes, strides):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transposition can happen in any pair of dims, not necessarily the last two dims. Also you are checking stride >= size not stride == size. The function name doesn't seem to match the implementation here. Perhaps, it is clearer we just inline the implementation in the tuned_bmm code without factoring out a util function here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and modified!

# Make the inputs of bmm contiguous
# because bmm cpu implementation does contiguous() if not
# this is to avoid additional copies in bmm
def do_bmm_input_contiguous(t, meta_t):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about may_require_contiguous?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and applied!

@Valentine233 Valentine233 requested a review from jgong5 April 12, 2024 02:10
if not ir.is_storage_and_layout(t):
return True
_, layout = ir.as_storage_and_layout(t, freeze=False)
return not isinstance(layout, ir.FixedLayout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not checking ir.FlexiableLyout directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to ir.FlexiableLyout.

@jgong5 jgong5 requested a review from eellison April 15, 2024 03:11
@Valentine233
Copy link
Collaborator Author

@eellison Hi, please help review the PR. Thanks!

@Valentine233
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
Fixes pytorch#117743.

Add contiguous layout optimization for `bmm` input, to avoid additional copies.

Pull Request resolved: pytorch#122599
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024
Fixes #117743.

Add contiguous layout optimization for `bmm` input, to avoid additional copies.

Pull Request resolved: #122599
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
@github-actions github-actions bot deleted the contiguous_node branch May 31, 2024 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[inductor][cpu]cait_m36_384 fp32 static shape default wrapper single thread performance regression

7 participants