Don't create large intermediary tensors in the backward of matmul #95261

lezcano · 2023-02-22T01:48:30Z

Stack from ghstack (oldest at bottom):

Currently, if we multiply a transposed batch of matrices with shape
[b, m, n] and a matrix with shape [n, k], when computing the gradient
of the matrix, we instantiate a matrix of shape [b, n, k]. This may be
a very large matrix. Instead, we fold the batch of matrices into a
matrix, which avoids creating any large intermediary tensor.

Note that multiplying a batch of matrices and a matrix naturally occurs
within an attention module, so this case surely happens in the wild.
In particular, this issue was found while investigating the OOMs caused by the
improved folding algorithm in the next PR of this stack. See #76828 (comment)
This PR fixes those OOMs and decreases the memory footprint of the
backward of matmul.

I understand this is a tricky one, so I put it on its own PR to discuss it.

Differential Revision: D43541495

Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, m, n]. If the matrix is large, creating this unnecessary batch of matrices may be time and memory consuming. In this case, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. [ghstack-poisoned]

pytorch-bot · 2023-02-22T01:48:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95261

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Long queue on MacOS M1 runners

❌ 2 Failures

As of commit 3d85e21:

BROKEN TRUNK - The following jobs failed but were present on the merge base ac9b305:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang

I'm willing to approve this to unblock, but to me all this really says is that we should just write the g-dang backwards formula for matmul by hand and call it a day.

lezcano · 2023-02-22T08:21:36Z

Agree about the backward matmul. Note that backward for matmul is simply

mat1: grad.matmul(mat2.dim() > 1 ? mat2.mH : mat2.conj().unsqueeze(0))
mat2: (mat1.dim() > 1 ? mat1.mH : mat1.conj().unsqueeze(1)).matmul(grad)
``
I remember implementing this once, but, at least back then, no one wanted matmul to be non-composite.

… matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]

lezcano · 2023-02-22T11:03:38Z

What we can do is, we can try to land this one, and then I'll append a PR onto this stack making this function CompositeExplicit.

Regardless, could any of you @ezyang @ngimel import this one? I reckon it may need some internal tweaking to be landed.

On a different note, I believe it was this bug we were hitting and it didn't allow us to land #75195. I'll revisit this as well on this same stack.

… matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]

albanD · 2023-02-22T14:41:45Z

I'm willing to approve this to unblock, but to me all this really says is that we should just write the g-dang backwards formula for matmul by hand and call it a day.

Note that we already have all the scafolding for matmul to be explicitly differentiable, we even have a matmul_backward native function. Just that only Nested uses it today.

What I remember is that backends didn't want to see matmuls indeed. I don't know what is the real BC guarantee here tbh, but maybe we should just make CPU/CUDA follow Nested and let everything else as CompositeImplicit?

ezyang · 2023-02-22T14:53:34Z

At least in PT2, backends that don't want to see matmul can always use the old decomp, no big deal.

ngimel · 2023-02-22T17:55:30Z

Wouldn't the backward for matmul as written above still result in materialization of the big tensor if the inputs weren't flattened in forward? Even worse, even if inputs were flattened in forward and matmul went through addmm the above formula would still result in 3d gradient for 2d input that will be reduced by AutogradEngine?

lezcano · 2023-02-22T18:12:32Z

ugh, you are right. Let me think a bit about it.

… matmul" Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See #76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. [ghstack-poisoned]

lezcano · 2023-02-23T16:25:55Z

I implemented another implementation that @ngimel suggested and added a test for it. This optimisation could be marginally more general (removing several 1's at the front) but I didn't bother implementing it as I don't think it's that common in practice.

I think this PR is ready for review as-is. And again, I think this PR should be landed via phabricator.

For the reasons given by @ngimel in #95261 (comment), let's shelve the "make matmul autograd-explicit" for now.

There are still optimisations that may be implemented here and there as pytorch/functorch#989 (comment) suggests. I will try to explore those in the future.

albanD · 2023-02-23T17:12:18Z

Even worse, even if inputs were flattened in forward and matmul went through addmm the above formula would still result in 3d gradient for 2d input that will be reduced by AutogradEngine?

I don't think that is a problem. The AutogradEngine will handle the broadcasting for you if you didn't. But that doesn't mean you're not allow to do it!
So if you're differentiating matmul direction (and thus the expand is part of your functions's forward), then it is fine for your backward to reverse that expand.

lezcano · 2023-02-23T17:16:37Z

Yeah, the only thing is that one would need to be a bit more careful about how to implement it. If you guys think that it's still worth having a look at it, I'm happy to do so, but I think, for now, this and the next PR do a pretty good job at not creating large tensors / making unnecessary copies.

ngimel · 2023-02-23T17:18:00Z

Right, but the backward formula as written won't do implicit reduction (by substituting addmm instead of bmm), it will rely on AutogradEngine to do that. This is not to say explicit backward for matmul cannot be implemented, it can be, but unfortunately it's more involved than the neat formula above.

albanD · 2023-02-23T17:18:52Z

Ho yes, it would be a quite subtle formula, I agree.

lezcano · 2023-02-23T17:40:09Z

Also, I just remembered the reason I didn't go for implementing the backward formula manually. The thing is that, in some cases, you may reshape and copy one matrix in the forwards and then, use this reshaped matrix in the backwards. In fact, I believe that if you want to reshape it in the forwards, you want to reshape it in the backwards. If you want to implement the backward by hand, you should create a helper function that returns this intermediary tensor and then use it in the backward to avoid making the same copy in the forward and the backward. And this is just a bit cumbersome (although we do it for some functions like linalg.solve).

I get the feeling that if the forward is implemented carefully, we should be able to avoid the pain that writing the backward may be in the end.

ezyang · 2023-02-23T18:05:05Z

@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…5261) Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See pytorch/pytorch#76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. Differential Revision: [D43541495](https://our.internmc.facebook.com/intern/diff/D43541495) Pull Request resolved: pytorch/pytorch#95261 Approved by: https://github.com/ezyang

…tmul (pytorch#95261)" This reverts commit 03cc0f5.

The decomposition was not updated after #95261 [ghstack-poisoned]

The decomposition was not updated after #95261 ghstack-source-id: 458877f76eee5499064fef330be058889a2945b0 Pull Request resolved: #105850

The decomposition was not updated after #95261 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

The decomposition was not updated after #95261 ghstack-source-id: 23ec5eb3972cd89b798dfb33b5d58191bab4688b Pull Request resolved: #105850

The decomposition was not updated after #95261 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

The decomposition was not updated after #95261 ghstack-source-id: 9a4fbd7f6ae92f272862ec22e1e8de05c83ba198 Pull Request resolved: #105850

The decomposition was not updated after #95261 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

The decomposition was not updated after #95261 ghstack-source-id: 9125d9e23bc2e9b8f1f15dd5808e50afa01db012 Pull Request resolved: #105850

The decomposition was not updated after #95261 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

The decomposition was not updated after #95261 ghstack-source-id: 32f31e31ee16c16ea3f7ee93b7a57c1f04f2e951 Pull Request resolved: #105850

The decomposition was not updated after #95261 Pull Request resolved: #105850 Approved by: https://github.com/Chillee

…torch#95261) Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See pytorch#76828 (comment) This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. Differential Revision: [D43541495](https://our.internmc.facebook.com/intern/diff/D43541495) Pull Request resolved: pytorch#95261 Approved by: https://github.com/ezyang

lezcano requested review from nikitaved and IvanYashchuk as code owners February 22, 2023 01:48

pytorch-bot bot added the release notes: linalg_frontend release notes category label Feb 22, 2023

lezcano mentioned this pull request Feb 22, 2023

Avoid copies in matmul #76828

Closed

lezcano added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 22, 2023

lezcano requested review from ngimel, ezyang and albanD February 22, 2023 01:49

lezcano added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Feb 22, 2023

pytorchbot added the open source label Feb 22, 2023

ezyang reviewed Feb 22, 2023

View reviewed changes

lezcano added 2 commits February 22, 2023 12:53

pytorchmergebot added the Merged label Feb 27, 2023

pytorchmergebot closed this in 03cc0f5 Feb 27, 2023

ezyang mentioned this pull request Feb 27, 2023

Bad error message when internal diff is just out of date pytorch/test-infra#3808

Open

voznesenskym mentioned this pull request Feb 27, 2023

Better mark_dynamic assertions #95566

Closed

msaroufim mentioned this pull request Mar 3, 2023

Remove mention of dynamo.optimize() in docs #96002

Closed

pruthvistony added a commit to ROCm/pytorch that referenced this pull request May 2, 2023

Revert "Don't create large intermediary tensors in the backward of ma…

9ac98e2

…tmul (pytorch#95261)" This reverts commit 03cc0f5.

facebook-github-bot deleted the gh/Lezcano/185/head branch June 8, 2023 14:44

lezcano mentioned this pull request Jul 24, 2023

Update matmul decomp to match eager #105850

Closed

lezcano added a commit that referenced this pull request Jul 24, 2023

Update matmul decomp to match eager

174b08a

The decomposition was not updated after #95261 [ghstack-poisoned]

lezcano added a commit that referenced this pull request Jul 24, 2023

Update matmul decomp to match eager

8c3e1ab

The decomposition was not updated after #95261 ghstack-source-id: 458877f76eee5499064fef330be058889a2945b0 Pull Request resolved: #105850

lezcano added a commit that referenced this pull request Jul 24, 2023

Update matmul decomp to match eager

b5b56c2

The decomposition was not updated after #95261 ghstack-source-id: 23ec5eb3972cd89b798dfb33b5d58191bab4688b Pull Request resolved: #105850

lezcano added a commit that referenced this pull request Jul 25, 2023

Update matmul decomp to match eager

dfb7820

The decomposition was not updated after #95261 ghstack-source-id: 9a4fbd7f6ae92f272862ec22e1e8de05c83ba198 Pull Request resolved: #105850

lezcano added a commit that referenced this pull request Jul 25, 2023

Update matmul decomp to match eager

120cefb

The decomposition was not updated after #95261 ghstack-source-id: 9125d9e23bc2e9b8f1f15dd5808e50afa01db012 Pull Request resolved: #105850

lezcano added a commit that referenced this pull request Jul 25, 2023

Update matmul decomp to match eager

1eca1e3

The decomposition was not updated after #95261 ghstack-source-id: 32f31e31ee16c16ea3f7ee93b7a57c1f04f2e951 Pull Request resolved: #105850

pytorchmergebot pushed a commit that referenced this pull request Jul 26, 2023

Update matmul decomp to match eager (#105850)

36ae359

The decomposition was not updated after #95261 Pull Request resolved: #105850 Approved by: https://github.com/Chillee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't create large intermediary tensors in the backward of matmul #95261

Don't create large intermediary tensors in the backward of matmul #95261

lezcano commented Feb 22, 2023 •

edited by ezyang

pytorch-bot bot commented Feb 22, 2023 •

edited

ezyang left a comment

lezcano commented Feb 22, 2023 •

edited

lezcano commented Feb 22, 2023

albanD commented Feb 22, 2023 •

edited

ezyang commented Feb 22, 2023

ngimel commented Feb 22, 2023

lezcano commented Feb 22, 2023

lezcano commented Feb 23, 2023

albanD commented Feb 23, 2023

lezcano commented Feb 23, 2023

ngimel commented Feb 23, 2023

albanD commented Feb 23, 2023

lezcano commented Feb 23, 2023

ezyang commented Feb 23, 2023

Don't create large intermediary tensors in the backward of matmul #95261

Don't create large intermediary tensors in the backward of matmul #95261

Conversation

lezcano commented Feb 22, 2023 • edited by ezyang

pytorch-bot bot commented Feb 22, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95261

❗ 1 Active SEVs

❌ 2 Failures

ezyang left a comment

Choose a reason for hiding this comment

lezcano commented Feb 22, 2023 • edited

lezcano commented Feb 22, 2023

albanD commented Feb 22, 2023 • edited

ezyang commented Feb 22, 2023

ngimel commented Feb 22, 2023

lezcano commented Feb 22, 2023

lezcano commented Feb 23, 2023

albanD commented Feb 23, 2023

lezcano commented Feb 23, 2023

ngimel commented Feb 23, 2023

albanD commented Feb 23, 2023

lezcano commented Feb 23, 2023

ezyang commented Feb 23, 2023

lezcano commented Feb 22, 2023 •

edited by ezyang

pytorch-bot bot commented Feb 22, 2023 •

edited

lezcano commented Feb 22, 2023 •

edited

albanD commented Feb 22, 2023 •

edited