-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[einsum] Call view instead of sum to remediate MPS regression #87135
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87135
Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below:
❌ 2 FailuresAs of commit 5292c2c: The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
29836d2
to
a5f333d
Compare
/easycla |
2 similar comments
/easycla |
/easycla |
FYI @Birch-san |
brilliant!! thanks @janeyx99 for pursuing this. |
@Birch-san Unfortunately this does not get us back to 1.12.1 perf just yet (see disclaimer)--it is still ~4-5x regression since this code was introduced to allow for flexible ordering of contractions. |
yes, understood 🙂 and good progress nonetheless. |
I am attempting to implement such a code path, but doing it elegantly is the challenge 😛 |
🙇 |
@pytorchbot merge -r |
@pytorchbot successfully started a rebase job. Check the current status here |
Successfully rebased |
a5f333d
to
1d8caa6
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 9 additional jobs have failed, first few of them are: trunk ,trunk / android-emulator-build-test / build-and-test ,trunk / ios-12-5-1-x86-64 / build ,trunk / macos-12-py3-x86-64 / build ,trunk / macos-12-py3-x86-64-lite-interpreter / build Details for Dev Infra teamRaised by workflow job |
1d8caa6
to
61326fc
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot. |
@Birch-san this PR should now get us back 1.12.1 perf! Feel free to verify! Once this lands, I will be trying to get this in our release candidate. |
|
outstanding! thanks so much. by 1.12.1 perf, which side of #85297 (comment) are we talking? a either way: this certainly sounds like this fixes the path kwargs regression. and regarding the release candidate: any idea whether the einsum correctness fix for #85224 will make it in? from @pcuenca's testing it sounds like it's not included? huggingface/diffusers#372 (comment) |
std::vector<int64_t> sum_dims(perm_index - out_num_dim); | ||
std::iota(sum_dims.begin(), sum_dims.end(), out_num_dim); | ||
ops[0] = ops[0].sum(sum_dims); | ||
if (num_ops > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if is for the special case where there is a single input and thus all the code above was a noop and so we need to actually do the reduction here.
For all the other cases (>1), we are guaranteed from the code above that all reduced dimensions are now of size 1 and thus can just be viewed the way we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precisely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment would be great for future readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh shoot i just saw this, will add in another pr
@Birch-san I wasn't aware of the other perf regression, but it would be interesting to do that benchmark. With regards to the correctness--I believe 85689 already in https://hud.pytorch.org/hud/pytorch/pytorch/release%2F1.13/8?per_page=50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
Ah, the test_proxy_tensor is a real failure. @albanD might you have a workaround idea?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments for the symint support.
This looks absolutely amazing @janeyx99, thanks a lot for the effort! Regarding non-determinism, I'll run some tests to try to understand the scope better. |
@pytorchbot merge -f "preexisting failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Hey @janeyx99. |
Fixes #87010. It turns out that squeeze is much faster than sum, and view is faster than squeeze, so we should default to that whenever possible. Benchmarking results show that, on MPS, we would be going from the following code taking **29.89ms instead of the current 1466ms, almost a 50x speedup**. ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k).max().item() ``` And a regular einsum will now take **.506ms instead of 2.76ms.** ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k) ``` Special thanks to @soulitzer for helping me experiment + figure out how to squash the remaining 5x regression due to squeeze being slower than view!! Pull Request resolved: #87135 Approved by: https://github.com/soulitzer, https://github.com/malfet, https://github.com/albanD
…path is None (#87261) * [einsum] keep the promise that we contract left to right (#87199) We promise that if path is not defined, we would go left to right. The previous code did not keep that promise as we push'd combined ops to the back of the list. For most use cases this is fine (einsum with 3 or fewer inputs), but we should do what we say. Test plan: Added a print statement to print the sizes of ops we're contracting to see if the order is fixed. Code run: ``` import torch a = torch.rand(1) b = torch.rand(2) c = torch.rand(3) d = torch.rand(4) torch.einsum('a,b,c,d->abcd', a,b,c,d) ``` BEFORE--it does a+b, then c+d, then a+b+c+d, which...is right, but it's not the order specified by the user. ``` /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 1, 1]and b: [1, 2, 1, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 3, 1]and b: [1, 1, 1, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 1, 1]and b: [1, 1, 3, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] ``` WITH THIS CHANGE--it actually goes left to right: a+b, a+b+c, a+b+c+d ``` /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 1, 1]and b: [1, 2, 1, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 1, 1]and b: [1, 1, 3, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 3, 1]and b: [1, 1, 1, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] ``` Pull Request resolved: #87199 Approved by: https://github.com/soulitzer * [einsum] Call view instead of sum to remediate MPS regression (#87135) Fixes #87010. It turns out that squeeze is much faster than sum, and view is faster than squeeze, so we should default to that whenever possible. Benchmarking results show that, on MPS, we would be going from the following code taking **29.89ms instead of the current 1466ms, almost a 50x speedup**. ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k).max().item() ``` And a regular einsum will now take **.506ms instead of 2.76ms.** ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k) ``` Special thanks to @soulitzer for helping me experiment + figure out how to squash the remaining 5x regression due to squeeze being slower than view!! Pull Request resolved: #87135 Approved by: https://github.com/soulitzer, https://github.com/malfet, https://github.com/albanD
Tiny followup from #87135 (comment) and another typo i noticed while doing the autograd lab Pull Request resolved: #87264 Approved by: https://github.com/soulitzer
Tiny followup from pytorch#87135 (comment) and another typo i noticed while doing the autograd lab Pull Request resolved: pytorch#87264 Approved by: https://github.com/soulitzer
Tiny followup from pytorch#87135 (comment) and another typo i noticed while doing the autograd lab Pull Request resolved: pytorch#87264 Approved by: https://github.com/soulitzer
Tiny followup from pytorch#87135 (comment) and another typo i noticed while doing the autograd lab Pull Request resolved: pytorch#87264 Approved by: https://github.com/soulitzer
Fixes #87010.
It turns out that squeeze is much faster than sum, and view is faster than squeeze, so we should default to that whenever possible.
Benchmarking results show that, on MPS, we would be going from the following code taking 29.89ms instead of the current 1466ms, almost a 50x speedup.
And a regular einsum will now take .506ms instead of 2.76ms.
Special thanks to @soulitzer for helping me experiment + figure out how to squash the remaining 5x regression due to squeeze being slower than view!!