Skip to content

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Sep 8, 2025

Stack from ghstack (oldest at bottom):

Signed-off-by: Edward Z. Yang ezyang@meta.com

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162411

Note: Links to docs will display an error until the docs builds have been completed.

❌ 16 New Failures

As of commit 25a4078 with merge base 8171d60 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang added a commit that referenced this pull request Sep 8, 2025
Signed-off-by: Edward Z. Yang <ezyangmeta.com>

ghstack-source-id: 2eeedb0
Pull Request resolved: #162411
@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2025

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@ezyang ezyang added the keep-going Don't stop on first failure, keep running tests until the end label Sep 8, 2025
@ezyang ezyang requested a review from tianyu-l September 8, 2025 20:48
@ezyang
Copy link
Contributor Author

ezyang commented Sep 8, 2025

There are still test failures though.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2025

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.


Caused by:

// dL/dW = sum_batch ( (dL/dy)ᵀ @ x )
// Use einsum to contract over all leading dims without reshaping:
if (output_mask[1]) {
grad_weight = at::einsum("...o,...i->oi", {grad_output, self}); // [out, in]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this go into decomposition, since linear_backward is CompositeImplicitAutograd now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So another PR we have to do is make einsum not decompose, but THAT is likely to be a lot more controversial. Another reason why making views work is "better" (if you can shake it)


- name: linear(Tensor input, Tensor weight, Tensor? bias=None) -> Tensor
input, weight, bias: "grad.defined() ? linear_backward(input, grad, weight, grad_input_mask) : std::tuple<Tensor, Tensor, Tensor>()"
result: auto_linear
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what does this line do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for forward mode AD; it says that this function is linear and thus is forward ad formula is trivial (do the same function)

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds ok as long as there is no perf hit, this is a pretty hot path.

auto self_ = moveBatchDimToFront(self, self_bdim);
auto weight_ = moveBatchDimToFront(weight, weight_bdim);
auto bias_ = bias.has_value() ? std::make_optional<Tensor>(moveBatchDimToFront(*bias, bias_bdim)) : std::nullopt;
return std::make_tuple( at::linear(self_, weight_, bias_), 0 );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ho linear supports arbitrary batch dimensions on the weights?
Your backward formula seems to say no? :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops :)

// dL/dW = sum_batch ( (dL/dy)ᵀ @ x )
// Use einsum to contract over all leading dims without reshaping:
if (output_mask[1]) {
grad_weight = at::einsum("...o,...i->oi", {grad_output, self}); // [out, in]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the perf hit of this for a regular nn.Linear() layer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially actually pretty bad lol. And it doesn't even do what I want, because I want the einsum to also show up as its own operator LOL.

@albanD
Copy link
Collaborator

albanD commented Sep 12, 2025

Also I expect a lot more changes in the PT2 compilation stack to handle the new op and remove special casing for linear decomp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

keep-going Don't stop on first failure, keep running tests until the end

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants