Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-output derivative formulas can save unnecessary tensors #97575

Open
soulitzer opened this issue Mar 24, 2023 · 8 comments
Open

Multi-output derivative formulas can save unnecessary tensors #97575

soulitzer opened this issue Mar 24, 2023 · 8 comments
Labels
actionable module: autograd Related to torch.autograd, and the autograd engine in general module: nestedtensor NestedTensor tag see issue #25032 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@soulitzer
Copy link
Contributor

soulitzer commented Mar 24, 2023

Usually when you have a derivative formula, e.g. mm:

- name: mm(Tensor self, Tensor mat2) -> Tensor
  self: mm_mat1_backward(grad, mat2, self.sym_sizes(), self.sym_strides(), self.layout(), 1)
  mat2: mm_mat2_backward(grad, self, mat2.sym_sizes(), mat2.sym_strides(), mat2.layout(), 1)
  result: at::mm(self_t, mat2_p) + at::mm(self_p, mat2_t)

The structure informs autograd codegen what tensors need to be saved to compute which gradients, i.e. if only self requires grad, I only need to save mat2, and if only mat2 requires grad, I only need to save self.

In VariableType, the following logic is generated for mm:

if (_any_requires_grad) {
    grad_fn = std::shared_ptr<MmBackward0>(new MmBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self, mat2 ));
    if (grad_fn->should_compute_output(1)) {
      grad_fn->self_ = SavedVariable(self, false);
    }
  ...
    if (grad_fn->should_compute_output(0)) {
      grad_fn->mat2_ = SavedVariable(mat2, false);
    }
}

However, when you have a single derivative formula that produces multiple outputs, autograd codegen no longer has visibility into what tensors need to be saved in order to compute which gradients.

- name: matmul(Tensor self, Tensor other) -> Tensor
  self, other: matmul_backward(grad, self, other, grad_input_mask)

This yields the following unoptimized logic in autograd kernel:

 if (_any_requires_grad) {
    grad_fn = std::shared_ptr<MatmulBackward0>(new MatmulBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self, other ));
    grad_fn->self_ = SavedVariable(self, false);
    grad_fn->other_ = SavedVariable(other, false);
  }

One common case where this can matter is if I'm doing some fine-tuning, e.g. I want to requires_grad=False for the parameters of a bunch of layers in the middle of a network.

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @lezcano @Varal7 @cpuhrsch @jbschlosser @bhosmer @drisspg @gchanan

@soulitzer soulitzer added the module: autograd Related to torch.autograd, and the autograd engine in general label Mar 24, 2023
@dagitses dagitses added high priority triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 27, 2023
@zou3519
Copy link
Contributor

zou3519 commented Mar 27, 2023

How would you fix this? By splitting single derivative formulas into multiple?

@albanD
Copy link
Collaborator

albanD commented Mar 27, 2023

I don't think there is a way for us to do anything here: if you can split up the formula, you can do so.
But if computing each gradient separately is more expensive, then we could imagine a custom rule where you can provide both separate and merged formulas. But that would be quite a bit of work and I don't think we see a significant benefit for it?

Also in the great new world of torch.compile, the un-used computations will be DCEd away :D

@soulitzer soulitzer added the module: nestedtensor NestedTensor tag see issue #25032 label Mar 27, 2023
@soulitzer
Copy link
Contributor Author

I haven't checked too many other ones, but matmul autograd explicit kernel (specific to nested) seems splittable

@lezcano
Copy link
Collaborator

lezcano commented Mar 27, 2023

Sometimes, the issue with splitting them is that, if you have to implement the second derivatives, you need to implement the cross derivatives twice (and these will be recomputed if the user asks for these) which is a bit annoying.

@lezcano
Copy link
Collaborator

lezcano commented Mar 27, 2023

of course, this is not the case of matmul

@ezyang
Copy link
Contributor

ezyang commented Mar 27, 2023

Another crappy fix is to do something like:

- name: matmul(Tensor self, Tensor other) -> Tensor
  self, other: matmul_backward(grad, self, other, grad_input_mask)
  self: blah blah
  other: blah blah

@albanD
Copy link
Collaborator

albanD commented Mar 27, 2023

Yeah that's what I hinted at above, but we should only do that if we have cases with clear benefit. And given the new compile world, I don't think it will happen.

@soulitzer
Copy link
Contributor Author

@nikitaved has a fix for this now in #103750 which introduces a macro to hint autograd codegen at what to save.
What's left to do is update existing formulas to use this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
actionable module: autograd Related to torch.autograd, and the autograd engine in general module: nestedtensor NestedTensor tag see issue #25032 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants