Multi-output derivative formulas can save unnecessary tensors #97575

soulitzer · 2023-03-24T23:42:24Z

Usually when you have a derivative formula, e.g. mm:

- name: mm(Tensor self, Tensor mat2) -> Tensor
  self: mm_mat1_backward(grad, mat2, self.sym_sizes(), self.sym_strides(), self.layout(), 1)
  mat2: mm_mat2_backward(grad, self, mat2.sym_sizes(), mat2.sym_strides(), mat2.layout(), 1)
  result: at::mm(self_t, mat2_p) + at::mm(self_p, mat2_t)

The structure informs autograd codegen what tensors need to be saved to compute which gradients, i.e. if only self requires grad, I only need to save mat2, and if only mat2 requires grad, I only need to save self.

In VariableType, the following logic is generated for mm:

if (_any_requires_grad) {
    grad_fn = std::shared_ptr<MmBackward0>(new MmBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self, mat2 ));
    if (grad_fn->should_compute_output(1)) {
      grad_fn->self_ = SavedVariable(self, false);
    }
  ...
    if (grad_fn->should_compute_output(0)) {
      grad_fn->mat2_ = SavedVariable(mat2, false);
    }
}

However, when you have a single derivative formula that produces multiple outputs, autograd codegen no longer has visibility into what tensors need to be saved in order to compute which gradients.

- name: matmul(Tensor self, Tensor other) -> Tensor
  self, other: matmul_backward(grad, self, other, grad_input_mask)

This yields the following unoptimized logic in autograd kernel:

 if (_any_requires_grad) {
    grad_fn = std::shared_ptr<MatmulBackward0>(new MatmulBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self, other ));
    grad_fn->self_ = SavedVariable(self, false);
    grad_fn->other_ = SavedVariable(other, false);
  }

One common case where this can matter is if I'm doing some fine-tuning, e.g. I want to requires_grad=False for the parameters of a bunch of layers in the middle of a network.

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @lezcano @Varal7 @cpuhrsch @jbschlosser @bhosmer @drisspg @gchanan

The text was updated successfully, but these errors were encountered:

zou3519 · 2023-03-27T14:06:30Z

How would you fix this? By splitting single derivative formulas into multiple?

albanD · 2023-03-27T14:14:20Z

I don't think there is a way for us to do anything here: if you can split up the formula, you can do so.
But if computing each gradient separately is more expensive, then we could imagine a custom rule where you can provide both separate and merged formulas. But that would be quite a bit of work and I don't think we see a significant benefit for it?

Also in the great new world of torch.compile, the un-used computations will be DCEd away :D

soulitzer · 2023-03-27T14:18:08Z

I haven't checked too many other ones, but matmul autograd explicit kernel (specific to nested) seems splittable

lezcano · 2023-03-27T14:19:44Z

Sometimes, the issue with splitting them is that, if you have to implement the second derivatives, you need to implement the cross derivatives twice (and these will be recomputed if the user asks for these) which is a bit annoying.

lezcano · 2023-03-27T14:20:04Z

of course, this is not the case of matmul

ezyang · 2023-03-27T17:32:03Z

Another crappy fix is to do something like:

- name: matmul(Tensor self, Tensor other) -> Tensor
  self, other: matmul_backward(grad, self, other, grad_input_mask)
  self: blah blah
  other: blah blah

albanD · 2023-03-27T20:16:22Z

Yeah that's what I hinted at above, but we should only do that if we have cases with clear benefit. And given the new compile world, I don't think it will happen.

soulitzer · 2023-06-23T15:14:21Z

@nikitaved has a fix for this now in #103750 which introduces a macro to hint autograd codegen at what to save.
What's left to do is update existing formulas to use this.

soulitzer added the module: autograd Related to torch.autograd, and the autograd engine in general label Mar 24, 2023

dagitses added high priority triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 27, 2023

pytorch-bot bot added the triage review label Mar 27, 2023

albanD removed high priority triage review labels Mar 27, 2023

soulitzer added the module: nestedtensor NestedTensor tag see issue #25032 label Mar 27, 2023

soulitzer mentioned this issue Jun 14, 2023

sampled_addmm: backward performance improvements #103544

Closed

soulitzer added the actionable label Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-output derivative formulas can save unnecessary tensors #97575

Multi-output derivative formulas can save unnecessary tensors #97575

soulitzer commented Mar 24, 2023 •

edited by pytorch-bot bot

zou3519 commented Mar 27, 2023

albanD commented Mar 27, 2023

soulitzer commented Mar 27, 2023

lezcano commented Mar 27, 2023

lezcano commented Mar 27, 2023

ezyang commented Mar 27, 2023

albanD commented Mar 27, 2023

soulitzer commented Jun 23, 2023

Multi-output derivative formulas can save unnecessary tensors #97575

Multi-output derivative formulas can save unnecessary tensors #97575

Comments

soulitzer commented Mar 24, 2023 • edited by pytorch-bot bot

zou3519 commented Mar 27, 2023

albanD commented Mar 27, 2023

soulitzer commented Mar 27, 2023

lezcano commented Mar 27, 2023

lezcano commented Mar 27, 2023

ezyang commented Mar 27, 2023

albanD commented Mar 27, 2023

soulitzer commented Jun 23, 2023

soulitzer commented Mar 24, 2023 •

edited by pytorch-bot bot