-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mta] Backward of unary foreach functions #89591
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89591
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit 01d6f9e: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
regarding inplace ops, I locally confirmed the version seems to be bumped appropriately
|
So the backward of foreach will always fall back to slow version, and codegen would make sure that we properly error out if versions don't match? That looks fine, but I'll let @albanD take a look. |
I think this is only because you're on CPU: the slow code calls the single-Tensor op which properly bumps the version counter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of code duplication.
Should we just have the codegen take care of it? The same way we re-use the foo
formula for foo_
when no explicit formula for foo_
is provided. We could codegen a slow _foreach_foo
formula when it is not provided. WDYT?
torch/_torch_docs.py
Outdated
Only CPU and CUDA are supported. Forward-mode AD is not supported. | ||
|
||
Args: | ||
self (list of Tensors): Input list of Tensors. Each Tensor can have an arbitrary shape, dtype, device, and strides. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works for difference dtype and device? Why do we collect grads to zero out by device/dtype here then:
pytorch/torch/optim/optimizer.py
Line 394 in 274d3b2
per_device_and_dtype_grads = defaultdict(lambda: defaultdict(list)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce the number of CUDA kernels launched as possible, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to follow? How does the two have different number of kernel launched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we pass a list of tensors of different dtypes and/or devices to a foreach function, it'll call the corresponding aten native function the number of Tensors times, i.e. len(self) cuda kernels. With the grouping, it could reduce the number to len(per_device_and_dtype_grads)
, assuming all the tensors are a CUDA tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD _for_each ops support args on the different device and of the different dtypes, but they'll fall back to slow implementation. If, instead of just calling for_each
we pre-sort the args, we can call for_each
with a subset of tensors that'll go to the fast path. Really this sorting should be done in for_each itself though, and not in the optimizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
Should we really advertise this then as it will not do what you expect these functions to do.
63ee0dc
to
4c132ba
Compare
~~~~~~~~~~~~~~~~~~ | ||
|
||
.. warning:: | ||
This API is in beta and subject to future changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: beta or prototype?
246cad6
to
a0df3ef
Compare
Added a codegen and a version bump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
All the codegen part looks good! Only questions about the InplaceOrView key side.
@@ -73,6 +73,7 @@ template <typename scalar_t, template<class> class Op> void foreach_unary_op_(Te | |||
/* r_args_depth */ 1, | |||
/* res_arg_index */ 0>(), | |||
Op<opmath_t>()); | |||
maybe_increment_version(tensors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I misremember our discussion where you showed me an example where the version was bumped properly? Why does this need to be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That happened when inputs tensors are a CPU tensor not when fast path is chosen.
Even the functions registered to native_functions.yaml with CUDA key could go into the slow path which just calls aten native function. Therefore I decided to put this manual version bump here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ho that's right!
I'm still not sure why the InplaceOrView kernel doesn't get generated automatically to do that already? But we can look into that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought with
Lines 1388 to 1405 in b33d9e2
if self.name.name.inplace: | |
self_a = self.arguments.self_arg | |
assert ( | |
self_a | |
and self_a.argument.annotation | |
and self_a.argument.annotation.is_write | |
) | |
if self_a.argument.type == BaseType(BaseTy.Tensor): | |
# All inplace ops with an ordinary `Tensor self` argument should return self, | |
# to allow for method chaining. | |
assert ( | |
len(self.returns) == 1 | |
and self.returns[0].annotation == self_a.argument.annotation | |
) | |
else: | |
# You can't method chain on non-tensor self arguments though (like a List[Tensor]) | |
# so in all other cases we expect the return type to be none. | |
assert len(self.returns) == 0 |
pytorch/tools/autograd/gen_inplace_or_view_type.py
Lines 481 to 482 in 9cf8434
for r in cpp.return_names(f): | |
inplace_view_body.append(f"increment_version({r});") |
following pytorch#89591 (comment) Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Sorry for all the back and forth, this is definitely a challenging one...
@pytorchbot merge (knowing this won’t work this time but want to trigger more jobs) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 mandatory check(s) failed (Rule
Dig deeper by viewing the failures on hud Details for Dev Infra teamRaised by workflow job |
2f20cef
to
0aac180
Compare
following pytorch#89591 (comment) Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
following pytorch#89591 (comment) Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
b0c89f2
to
42ee873
Compare
following pytorch#89591 (comment) Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
You have a |
with a new CodeTemplate of DERIVATIVE_SINGLE_FOREACH Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Pushing the heavy lifting to torchgen Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
following pytorch#89591 (comment) Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
42ee873
to
01d6f9e
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…possible) (#93901) ## summary - increment tensor versions in inplace foreach functions - add a logic to take care of `ArrayRef<Scalar>` rel: #58833, #89591 Pull Request resolved: #93901 Approved by: https://github.com/albanD
…possible) (pytorch#93901) ## summary - increment tensor versions in inplace foreach functions - add a logic to take care of `ArrayRef<Scalar>` rel: pytorch#58833, pytorch#89591 Pull Request resolved: pytorch#93901 Approved by: https://github.com/albanD
as per title, this PR defines backward of those.
This doesn't implement forward-mode automatic differentiation as the current codegen doesn't seem to handle
ArrayRef<Tensor>
.Rel:
cc @mcarilli @ngimel