[einsum] Call view instead of sum to remediate MPS regression #87135

janeyx99 · 2022-10-17T21:25:49Z

It turns out that squeeze is much faster than sum, and view is faster than squeeze, so we should default to that whenever possible.

Benchmarking results show that, on MPS, we would be going from the following code taking 29.89ms instead of the current 1466ms, almost a 50x speedup.

q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
torch.einsum('b i d, b j d -> b i j', q, k).max().item()

And a regular einsum will now take .506ms instead of 2.76ms.

q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
torch.einsum('b i d, b j d -> b i j', q, k)

Special thanks to @soulitzer for helping me experiment + figure out how to squash the remaining 5x regression due to squeeze being slower than view!!

pytorch-bot · 2022-10-17T21:25:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87135

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 2 Failures

As of commit 5292c2c:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2022-10-17T21:25:52Z

The committers listed above are authorized under a signed CLA.

✅ login: janeyx99 / name: Jane (Yuan) Xu (a5f333d)

janeyx99 · 2022-10-17T21:36:57Z

/easycla

janeyx99 · 2022-10-17T21:42:57Z

/easycla

janeyx99 · 2022-10-17T21:49:02Z

/easycla

janeyx99 · 2022-10-17T21:51:13Z

FYI @Birch-san

Birch-san · 2022-10-17T21:57:11Z

brilliant!! thanks @janeyx99 for pursuing this.
if this is further patched with a mitigation for the change to path kwargs: would that bring us back to 1.12.1 perf, or would it be faster? I assume 1.12.1 was made without knowledge of this squeeze optimization?

janeyx99 · 2022-10-17T22:16:56Z

@Birch-san Unfortunately this does not get us back to 1.12.1 perf just yet (see disclaimer)--it is still ~4-5x regression since this code was introduced to allow for flexible ordering of contractions.

Birch-san · 2022-10-17T22:20:35Z

yes, understood 🙂 and good progress nonetheless.
okay, so whatever happens from here has to work within the constraints of keeping support for flexible ordering? so a code path to use the old algorithm isn't possible, but something else might be possible?

janeyx99 · 2022-10-17T22:23:04Z

I am attempting to implement such a code path, but doing it elegantly is the challenge 😛

Birch-san · 2022-10-17T22:23:53Z

🙇

janeyx99 · 2022-10-18T02:53:38Z

@pytorchbot merge -r

pytorchmergebot · 2022-10-18T02:55:06Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-10-18T02:55:09Z

Successfully rebased replace-sum-with-squeeze onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout replace-sum-with-squeeze && git pull --rebase)

pytorchmergebot · 2022-10-18T02:56:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-10-18T05:02:13Z

Merge failed

Reason: 9 additional jobs have failed, first few of them are: trunk ,trunk / android-emulator-build-test / build-and-test ,trunk / ios-12-5-1-x86-64 / build ,trunk / macos-12-py3-x86-64 / build ,trunk / macos-12-py3-x86-64-lite-interpreter / build

Details for Dev Infra team

Raised by workflow job

janeyx99 · 2022-10-18T14:29:51Z

@pytorchbot merge

pytorchmergebot · 2022-10-18T14:31:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-10-18T15:31:36Z

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

janeyx99 · 2022-10-18T15:39:27Z

@Birch-san this PR should now get us back 1.12.1 perf! Feel free to verify! Once this lands, I will be trying to get this in our release candidate.

malfet · 2022-10-18T15:46:26Z

squeeze is indeed must faster than sum as it is essentially a no-op (just creates view of the tensor), but it is not equivalent for any non 1-dim tensor

Birch-san · 2022-10-18T15:49:54Z

outstanding! thanks so much.

by 1.12.1 perf, which side of #85297 (comment) are we talking? a .clone() was introduced post-1.12.1 for correctness in some situations, which regressed einsum perf by 54% (overall 5~6% slowdown in image generation). presumably that's still there?

either way: this certainly sounds like this fixes the path kwargs regression.

and regarding the release candidate: any idea whether the einsum correctness fix for #85224 will make it in? from @pcuenca's testing it sounds like it's not included? huggingface/diffusers#372 (comment)

albanD · 2022-10-18T15:53:19Z

aten/src/ATen/native/Linear.cpp

-    std::vector<int64_t> sum_dims(perm_index - out_num_dim);
-    std::iota(sum_dims.begin(), sum_dims.end(), out_num_dim);
-    ops[0] = ops[0].sum(sum_dims);
+    if (num_ops > 1) {


This if is for the special case where there is a single input and thus all the code above was a noop and so we need to actually do the reduction here.
For all the other cases (>1), we are guaranteed from the code above that all reduced dimensions are now of size 1 and thus can just be viewed the way we want?

A comment would be great for future readers.

oh shoot i just saw this, will add in another pr

janeyx99 · 2022-10-18T16:02:14Z

@Birch-san I wasn't aware of the other perf regression, but it would be interesting to do that benchmark.

With regards to the correctness--I believe 85689 already in https://hud.pytorch.org/hud/pytorch/pytorch/release%2F1.13/8?per_page=50

albanD

SGTM

janeyx99 · 2022-10-18T17:06:04Z

Ah, the test_proxy_tensor is a real failure. @albanD might you have a workaround idea?

======================================================================
ERROR: test_make_fx_symbolic_exhaustive_einsum_cpu_float32 (__main__.TestProxyTensorOpInfoCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/janeyx/pytorch/torch/testing/_internal/common_device_type.py", line 391, in instantiated_test
    raise rte
  File "/Users/janeyx/pytorch/torch/testing/_internal/common_device_type.py", line 378, in instantiated_test
    result = test(self, **param_kwargs)
  File "/Users/janeyx/pytorch/torch/testing/_internal/common_device_type.py", line 824, in test_wrapper
    return test(*args, **kwargs)
  File "/Users/janeyx/pytorch/test/test_proxy_tensor.py", line 1361, in test_make_fx_symbolic_exhaustive
    _test_make_fx_helper(self, device, dtype, op, "symbolic")
  File "/Users/janeyx/pytorch/test/test_proxy_tensor.py", line 1332, in _test_make_fx_helper
    new_f = make_fx(f, tracing_mode=tracing_mode)(args, kwargs)
  File "/Users/janeyx/pytorch/torch/fx/experimental/proxy_tensor.py", line 663, in wrapped
    t = dispatch_trace(wrap_key(func, args, fx_tracer), tracer=fx_tracer, concrete_args=tuple(phs))
  File "/Users/janeyx/pytorch/torch/fx/experimental/proxy_tensor.py", line 413, in dispatch_trace
    graph = tracer.trace(root, concrete_args)
  File "/Users/janeyx/pytorch/torch/fx/_symbolic_trace.py", line 739, in trace
    (self.create_arg(fn(*args)),),
  File "/Users/janeyx/pytorch/torch/fx/_symbolic_trace.py", line 614, in flatten_fn
    tree_out = root_fn(*tree_args)
  File "/Users/janeyx/pytorch/torch/fx/experimental/proxy_tensor.py", line 427, in wrapped
    out = f(*tensors)
  File "/Users/janeyx/pytorch/test/test_proxy_tensor.py", line 1322, in f
    return op.op(*args, **kwargs)
  File "/Users/janeyx/pytorch/torch/testing/_internal/common_methods_invocations.py", line 13219, in <lambda>
    op=lambda tensors, equation: torch.einsum(equation, tensors),
  File "/Users/janeyx/pytorch/torch/functional.py", line 373, in einsum
    return einsum(equation, *_operands)
  File "/Users/janeyx/pytorch/torch/functional.py", line 378, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides

albanD

Comments for the symint support.

aten/src/ATen/native/Linear.cpp

pcuenca · 2022-10-18T18:07:59Z

This looks absolutely amazing @janeyx99, thanks a lot for the effort! Regarding non-determinism, I'll run some tests to try to understand the scope better.

janeyx99 · 2022-10-18T23:00:03Z

@pytorchbot merge -f "preexisting failures"

pytorchmergebot · 2022-10-18T23:01:23Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-18T23:04:32Z

Hey @janeyx99.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@soulitzer

Fixes #87010. It turns out that squeeze is much faster than sum, and view is faster than squeeze, so we should default to that whenever possible. Benchmarking results show that, on MPS, we would be going from the following code taking **29.89ms instead of the current 1466ms, almost a 50x speedup**. ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k).max().item() ``` And a regular einsum will now take **.506ms instead of 2.76ms.** ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k) ``` Special thanks to @soulitzer for helping me experiment + figure out how to squash the remaining 5x regression due to squeeze being slower than view!! Pull Request resolved: #87135 Approved by: https://github.com/soulitzer, https://github.com/malfet, https://github.com/albanD

@soulitzer

…path is None (#87261) * [einsum] keep the promise that we contract left to right (#87199) We promise that if path is not defined, we would go left to right. The previous code did not keep that promise as we push'd combined ops to the back of the list. For most use cases this is fine (einsum with 3 or fewer inputs), but we should do what we say. Test plan: Added a print statement to print the sizes of ops we're contracting to see if the order is fixed. Code run: ``` import torch a = torch.rand(1) b = torch.rand(2) c = torch.rand(3) d = torch.rand(4) torch.einsum('a,b,c,d->abcd', a,b,c,d) ``` BEFORE--it does a+b, then c+d, then a+b+c+d, which...is right, but it's not the order specified by the user. ``` /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 1, 1]and b: [1, 2, 1, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 3, 1]and b: [1, 1, 1, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 1, 1]and b: [1, 1, 3, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] ``` WITH THIS CHANGE--it actually goes left to right: a+b, a+b+c, a+b+c+d ``` /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 1, 1]and b: [1, 2, 1, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 1, 1]and b: [1, 1, 3, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 3, 1]and b: [1, 1, 1, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] ``` Pull Request resolved: #87199 Approved by: https://github.com/soulitzer * [einsum] Call view instead of sum to remediate MPS regression (#87135) Fixes #87010. It turns out that squeeze is much faster than sum, and view is faster than squeeze, so we should default to that whenever possible. Benchmarking results show that, on MPS, we would be going from the following code taking **29.89ms instead of the current 1466ms, almost a 50x speedup**. ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k).max().item() ``` And a regular einsum will now take **.506ms instead of 2.76ms.** ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k) ``` Special thanks to @soulitzer for helping me experiment + figure out how to squash the remaining 5x regression due to squeeze being slower than view!! Pull Request resolved: #87135 Approved by: https://github.com/soulitzer, https://github.com/malfet, https://github.com/albanD

Tiny followup from #87135 (comment) and another typo i noticed while doing the autograd lab Pull Request resolved: #87264 Approved by: https://github.com/soulitzer

Tiny followup from pytorch#87135 (comment) and another typo i noticed while doing the autograd lab Pull Request resolved: pytorch#87264 Approved by: https://github.com/soulitzer

janeyx99 force-pushed the replace-sum-with-squeeze branch from 29836d2 to a5f333d Compare October 17, 2022 21:36

janeyx99 requested a review from soulitzer October 18, 2022 01:00

soulitzer approved these changes Oct 18, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 18, 2022

pytorchmergebot force-pushed the replace-sum-with-squeeze branch from a5f333d to 1d8caa6 Compare October 18, 2022 02:55

[einsum] Call squeeze instead of sum to remediate MPS regression

61326fc

janeyx99 force-pushed the replace-sum-with-squeeze branch from 1d8caa6 to 61326fc Compare October 18, 2022 14:29

more wins by using view and not squeeze

5311ddc

janeyx99 changed the title ~~[einsum] Call squeeze instead of sum to remediate MPS regression~~ [einsum] Call view instead of sum to remediate MPS regression Oct 18, 2022

malfet approved these changes Oct 18, 2022

View reviewed changes

malfet added this to the 1.13.0 milestone Oct 18, 2022

albanD reviewed Oct 18, 2022

View reviewed changes

albanD approved these changes Oct 18, 2022

View reviewed changes

albanD reviewed Oct 18, 2022

View reviewed changes

aten/src/ATen/native/Linear.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/Linear.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/Linear.cpp Show resolved Hide resolved

aten/src/ATen/native/Linear.cpp Show resolved Hide resolved

symintify

5292c2c

pcuenca mentioned this pull request Oct 18, 2022

Apple MPS image generation times take much longer with 0.4.0 huggingface/diffusers#548

Closed

pytorchmergebot added the Merged label Oct 18, 2022

pytorchmergebot closed this in 80790ec Oct 18, 2022

This was referenced Oct 19, 2022

[einsum] fix MPS regression and fix incorrect contraction order when path is None #87261

Merged

[v.1.13.0] Release Tracker #86312

Closed

[BE][einsum] add small comment explaining an invariant #87264

Closed

github-actions bot deleted the replace-sum-with-squeeze branch April 19, 2024 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[einsum] Call view instead of sum to remediate MPS regression #87135

[einsum] Call view instead of sum to remediate MPS regression #87135

janeyx99 commented Oct 17, 2022 •

edited

pytorch-bot bot commented Oct 17, 2022 •

edited

linux-foundation-easycla bot commented Oct 17, 2022 •

edited

janeyx99 commented Oct 17, 2022

janeyx99 commented Oct 17, 2022

janeyx99 commented Oct 17, 2022

janeyx99 commented Oct 17, 2022

Birch-san commented Oct 17, 2022

janeyx99 commented Oct 17, 2022

Birch-san commented Oct 17, 2022

janeyx99 commented Oct 17, 2022

Birch-san commented Oct 17, 2022

janeyx99 commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

janeyx99 commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

janeyx99 commented Oct 18, 2022

malfet commented Oct 18, 2022

Birch-san commented Oct 18, 2022

albanD Oct 18, 2022

janeyx99 Oct 18, 2022

albanD Oct 18, 2022

janeyx99 Oct 19, 2022

janeyx99 commented Oct 18, 2022

albanD left a comment

janeyx99 commented Oct 18, 2022

albanD left a comment

pcuenca commented Oct 18, 2022

janeyx99 commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

github-actions bot commented Oct 18, 2022

[einsum] Call view instead of sum to remediate MPS regression #87135

[einsum] Call view instead of sum to remediate MPS regression #87135

Conversation

janeyx99 commented Oct 17, 2022 • edited

pytorch-bot bot commented Oct 17, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87135

❗ 2 Active SEVs

❌ 2 Failures

linux-foundation-easycla bot commented Oct 17, 2022 • edited

janeyx99 commented Oct 17, 2022

janeyx99 commented Oct 17, 2022

janeyx99 commented Oct 17, 2022

janeyx99 commented Oct 17, 2022

Birch-san commented Oct 17, 2022

janeyx99 commented Oct 17, 2022

Birch-san commented Oct 17, 2022

janeyx99 commented Oct 17, 2022

Birch-san commented Oct 17, 2022

janeyx99 commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

Merge started

pytorchmergebot commented Oct 18, 2022

Merge failed

janeyx99 commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

Merge started

pytorchmergebot commented Oct 18, 2022

janeyx99 commented Oct 18, 2022

malfet commented Oct 18, 2022

Birch-san commented Oct 18, 2022

albanD Oct 18, 2022

Choose a reason for hiding this comment

janeyx99 Oct 18, 2022

Choose a reason for hiding this comment

albanD Oct 18, 2022

Choose a reason for hiding this comment

janeyx99 Oct 19, 2022

Choose a reason for hiding this comment

janeyx99 commented Oct 18, 2022

albanD left a comment

Choose a reason for hiding this comment

janeyx99 commented Oct 18, 2022

albanD left a comment

Choose a reason for hiding this comment

pcuenca commented Oct 18, 2022

janeyx99 commented Oct 18, 2022

pytorchmergebot commented Oct 18, 2022

Merge started

github-actions bot commented Oct 18, 2022

janeyx99 commented Oct 17, 2022 •

edited

pytorch-bot bot commented Oct 17, 2022 •

edited

linux-foundation-easycla bot commented Oct 17, 2022 •

edited