New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-output derivative formulas can save unnecessary tensors #97575
Comments
How would you fix this? By splitting single derivative formulas into multiple? |
I don't think there is a way for us to do anything here: if you can split up the formula, you can do so. Also in the great new world of torch.compile, the un-used computations will be DCEd away :D |
I haven't checked too many other ones, but matmul autograd explicit kernel (specific to nested) seems splittable |
Sometimes, the issue with splitting them is that, if you have to implement the second derivatives, you need to implement the cross derivatives twice (and these will be recomputed if the user asks for these) which is a bit annoying. |
of course, this is not the case of matmul |
Another crappy fix is to do something like:
|
Yeah that's what I hinted at above, but we should only do that if we have cases with clear benefit. And given the new compile world, I don't think it will happen. |
@nikitaved has a fix for this now in #103750 which introduces a macro to hint autograd codegen at what to save. |
Usually when you have a derivative formula, e.g.
mm
:The structure informs autograd codegen what tensors need to be saved to compute which gradients, i.e. if only
self
requires grad, I only need to savemat2
, and if onlymat2
requires grad, I only need to saveself
.In VariableType, the following logic is generated for
mm
:However, when you have a single derivative formula that produces multiple outputs, autograd codegen no longer has visibility into what tensors need to be saved in order to compute which gradients.
This yields the following unoptimized logic in autograd kernel:
One common case where this can matter is if I'm doing some fine-tuning, e.g. I want to requires_grad=False for the parameters of a bunch of layers in the middle of a network.
cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @lezcano @Varal7 @cpuhrsch @jbschlosser @bhosmer @drisspg @gchanan
The text was updated successfully, but these errors were encountered: