New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register DefaultBackend implementations for functional/inplace structured operators #53037
Conversation
…ured operators As remarked in #52277 it is easy to give an (inefficient, due to extra redispatches) DefaultBackend implementation of foo and foo_ in terms of foo_out. This patch enables code generation for DefaultBackend in these cases by default for all structured kernels. You can see the payoff in MSNPU extension: it only has to register a kernel for add.out, and it gets add and add_ kernels automatically. The actual code changes are very modest: - When DefaultBackend, call the dispatched (not direct native::) functions to allocate tensors, change device guard, etc - Don't call impl() for DefaultBackend (as it doesn't exist); instead, directly generate a call to at::foo_out to do the actual work. - Do NOT generate DefaultBackend implementation for foo_out. Actually, there is a case to be made for this being a good idea with more infra; see comments inside. Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit e86c164 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…ured operators As remarked in #52277 it is easy to give an (inefficient, due to extra redispatches) DefaultBackend implementation of foo and foo_ in terms of foo_out. This patch enables code generation for DefaultBackend in these cases by default for all structured kernels. You can see the payoff in MSNPU extension: it only has to register a kernel for add.out, and it gets add and add_ kernels automatically. The actual code changes are very modest: - When DefaultBackend, call the dispatched (not direct native::) functions to allocate tensors, change device guard, etc - Don't call impl() for DefaultBackend (as it doesn't exist); instead, directly generate a call to at::foo_out to do the actual work. - Do NOT generate DefaultBackend implementation for foo_out. Actually, there is a case to be made for this being a good idea with more infra; see comments inside. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: e45111af6bf9c908c894c0576739a9770e1123b2 Pull Request resolved: #53037
if self.dispatch_key == DispatchKey.DefaultBackend and f.func.kind() is SchemaKind.out: | ||
# Never generate a default implementation for out, that's what you | ||
# have to define as a backend implementor | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something- Doesn't the previous assert prevent you from hitting this case? You have an assert near the top of the file to check that an out
variant native_function that's marked structured
doesn't have a DefaultBackend
key registered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're asserting different things.
assert self.dispatch_key not in g.out.dispatch, \
"Do not explicitly specify DefaultBackend dispatch key on structured " \
"functions, they will be automatically generated for you"
asserts that in the dispatch table (in native_functions.yaml
), you're not allowed to have a DefaultBackend entry. But even if you DON'T have a DefaultBackend entry, we will still call into codegen (something similar happens for Meta, where you never write Meta in the dispatch table, but Meta always gets generated. We don't iterate over the dispatch dictionary to figure out what to codegen!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh yep, that's right. Thanks!
# I got in trouble because that meant I needed a DispatchKeySet | ||
# in the wrapper function, which meant I needed a DispatchKeySet | ||
# in the DispatchKeyFunctions declarations, but the defined API | ||
# there does NOT permit a dispatch key set. I think you can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I might take a quick look after the PR lands to see what it would take to get the wrapper to take in a DispatchKeySet. Since if we can get these functions to take in that argument, the dispatcher should just plumb the keyset through for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly attempted it and I mostly concluded that I really wanted DispatchKeySet to be in the argument list, so that translate could handle it automatically.
@@ -257,6 +261,9 @@ def gen_class_set_output_body(self, k: SchemaKind) -> str: | |||
elif self.dispatch_key == DispatchKey.CUDA: | |||
empty_impl = "at::native::empty_cuda" | |||
empty_strided_impl = "at::native::empty_strided_cuda" | |||
elif self.dispatch_key == DispatchKey.DefaultBackend: | |||
empty_impl = "at::empty" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed that this makes the most sense to unblock native functions with DefaultBackend
functional/inplace variants for structured kernels.
I was wondering if we have a more long-term view on what backend guarantees we want to make for functional/inplace ops. It seems a little arbitrary that some unary ops have DefaultBackend
functional/inplace variants, while most other ops give the same backend guarantees to all of their variants (either they're all Math, or they all have backend-specific implementations).
So right now, a custom backend would need to implement their own version of add
/add_
/add_out
, to get full add
functionality, but only sin_out
to get everything for sin
. That also has the minor downside of making those variants just-a-bit-slower by requiring an extra trip through the dispatcher.
It doesn't seem like one is one is obviously better than the other, but do you think we'd ever want to introduce some amount of uniformity for what external backends can expect across ops? The current way of doing that is that backends need to look at entry of native_functions.yaml
individually and implement the variants not listed as Math
/DefaultBackend
. That's also a perfectly fine state to be in, so I don't think this is super important- mostly just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed that this makes the most sense to unblock native functions with DefaultBackend functional/inplace variants for structured kernels.
This is a misunderstanding of what the patch does. Although there is a test for dispatch key for DefaultBackend here, this code is exercised even if a structured kernel doesn't have any DefaultBackend in the dispatch dictionary in native functions yaml. An easy way to see this is add, which is exercised by the tests in this PR, doesn't have DefaultBackend in the dispatch dictionary:
- func: add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
structured: True
structured_inherits: TensorIteratorBase
dispatch:
CPU, CUDA: add_out
SparseCPU: add_out_sparse_cpu
SparseCUDA: add_out_sparse_cuda
MkldnnCPU: mkldnn_add_out
So right now, a custom backend would need to implement their own version of add/add_/add_out, to get full add functionality, but only sin_out to get everything for sin.
This is factually incorrect. Both add and sin (after @hameerabbasi's PR) are structured, and in fact in both cases you only need to implement add_out
and sin_out
to get full functionality. If you registered a specialized implementation you can get it a little faster (from skipped redispatches), but it is not necessary.
It is true that it is a bit arbitrary whether or not something is structured or not to say what you need to implement. The fix for that is to port more stuff to structured.
Let me know if there is something else you were getting at here that I missed. I think there is generally a question about "what should we expect backends to implement" and in some cases (e.g., XLA) I don't think it is an improvement to only have to implement the out version; they'd rather implement the functional version. But we can only pick one direction to do the defaulting the way the dispatcher is currently setup, need some more fancy gear otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing that out! Staring at the codegen output from this PR also helped a little. Just to lay it out, what I had in my head was that we had to choose between two options (taking sin
as an example):
- Register a specialized version of
sin
that skips redispatches, but requires separate implementations for other backends - Register a backend-agnostic version of
sin
. Backends don't need to implement their own version, but it requires extra dispatches
The structured kernel codegen now just picks both choices, and uses the fact that DefaultBackend
is an alias key to ensure that "the right thing" happens. If another backend registers their own specialized implementation, the dispatcher will give that implementation precedence. Otherwise, we default to the DefaultBackend
implementation.
You probably already had this in mind with this issue, but we can probably do something similar with overloads- potentially also for overloads that don't necessarily call each other for performance reasons. For example, I started a port of pow
, and it has an overload pow.Tensor_Scalar(Tensor self, Scalar exponent) -> Tensor
that's implemented directly as a unary op, even though it could technically be written a little less efficiently to call out to the more general Tensor overload. I could imagine doing something similar to here, where we keep the specialized implementation of the overload for perf reasons, but also codegen a slower, backend-agnostic variant of it that calls the more general Tensor overload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang @bdhirsh just to check my understanding, I think this PR narrows the BC-breakage window substantially but doesn't entirely close it. (Not that it should, just want to make sure I'm thinking about all the moving parts properly.)
E.g. if we have
- an unstructured op with dispatch configured a la
sin
innative_functions.yaml
:
- func: foo ...
dispatch:
DefaultBackend: foo
- func: foo_ ...
dispatch:
DefaultBackend: foo_
- func: foo.out ...
dispatch:
CPU, CUDA: foo_out
-
where
foo
andfoo_
kernels really do work for all backends -
and an out-of-tree backend provider relies on this support for
foo
andfoo_
-
and the provider doesn't care about, and has not registered a kernel for,
foo.out
Then before this PR, moving foo
/foo_
/foo.out
to SK would suddenly oblige the backend provider to register kernels for foo
and foo_
to keep the same level of support it had previously gotten for free.
After this PR, the generated DefaultBackend support means this is no longer necessary (though still allowed), but IIUC the backend provider will still need to have registered a kernel for foo.out
for the op to actually run, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhosmer I don't know how to square (4) with (3).
and the provider doesn't care about, and has not registered a kernel for, foo.out
OK, so the provider hasn't registered foo.out. So.... how exactly was the DefaultBackend foo
and foo_
implemented? Can't be by calling foo.out
, because if they actually did that then it would be broken if they hadn't registered foo.out
. So you're saying that foo
in some sort of other way that doesn't use out... to which I say, "Don't do that, and structured kernels will make you not do that."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm just confirming that this PR assumes that dependency between unstructured variants, and that it's possible for an unstructured op to break the assumption while still coloring inside the lines, registration-wise. I don't think we should try to handle this case - I have zero ideas for real-world ops that would actually use it - but wanted to confirm my understanding of exactly which configurations this PR guarantees BC for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a few comments but none that should require code changes. LGTM!
…lace structured operators" As remarked in #52277 it is easy to give an (inefficient, due to extra redispatches) DefaultBackend implementation of foo and foo_ in terms of foo_out. This patch enables code generation for DefaultBackend in these cases by default for all structured kernels. You can see the payoff in MSNPU extension: it only has to register a kernel for add.out, and it gets add and add_ kernels automatically. The actual code changes are very modest: - When DefaultBackend, call the dispatched (not direct native::) functions to allocate tensors, change device guard, etc - Don't call impl() for DefaultBackend (as it doesn't exist); instead, directly generate a call to at::foo_out to do the actual work. - Do NOT generate DefaultBackend implementation for foo_out. Actually, there is a case to be made for this being a good idea with more infra; see comments inside. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D26731225](https://our.internmc.facebook.com/intern/diff/D26731225) [ghstack-poisoned]
…ured operators As remarked in #52277 it is easy to give an (inefficient, due to extra redispatches) DefaultBackend implementation of foo and foo_ in terms of foo_out. This patch enables code generation for DefaultBackend in these cases by default for all structured kernels. You can see the payoff in MSNPU extension: it only has to register a kernel for add.out, and it gets add and add_ kernels automatically. The actual code changes are very modest: - When DefaultBackend, call the dispatched (not direct native::) functions to allocate tensors, change device guard, etc - Don't call impl() for DefaultBackend (as it doesn't exist); instead, directly generate a call to at::foo_out to do the actual work. - Do NOT generate DefaultBackend implementation for foo_out. Actually, there is a case to be made for this being a good idea with more infra; see comments inside. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: c3ead7bd9eda790e22ce5776cc3168c63d63a00b Pull Request resolved: #53037
…ured operators (pytorch#53037) Summary: Pull Request resolved: pytorch#53037 As remarked in pytorch#52277 it is easy to give an (inefficient, due to extra redispatches) DefaultBackend implementation of foo and foo_ in terms of foo_out. This patch enables code generation for DefaultBackend in these cases by default for all structured kernels. You can see the payoff in MSNPU extension: it only has to register a kernel for add.out, and it gets add and add_ kernels automatically. The actual code changes are very modest: - When DefaultBackend, call the dispatched (not direct native::) functions to allocate tensors, change device guard, etc - Don't call impl() for DefaultBackend (as it doesn't exist); instead, directly generate a call to at::foo_out to do the actual work. - Do NOT generate DefaultBackend implementation for foo_out. Actually, there is a case to be made for this being a good idea with more infra; see comments inside. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D26731225 Pulled By: ezyang fbshipit-source-id: 939da7cb69f694722ec293e5e42e74a755dd0985
…ured operators (pytorch#53037) Summary: Pull Request resolved: pytorch#53037 As remarked in pytorch#52277 it is easy to give an (inefficient, due to extra redispatches) DefaultBackend implementation of foo and foo_ in terms of foo_out. This patch enables code generation for DefaultBackend in these cases by default for all structured kernels. You can see the payoff in MSNPU extension: it only has to register a kernel for add.out, and it gets add and add_ kernels automatically. The actual code changes are very modest: - When DefaultBackend, call the dispatched (not direct native::) functions to allocate tensors, change device guard, etc - Don't call impl() for DefaultBackend (as it doesn't exist); instead, directly generate a call to at::foo_out to do the actual work. - Do NOT generate DefaultBackend implementation for foo_out. Actually, there is a case to be made for this being a good idea with more infra; see comments inside. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D26731225 Pulled By: ezyang fbshipit-source-id: 939da7cb69f694722ec293e5e42e74a755dd0985
Stack from ghstack:
As remarked in #52277 it is easy to give an (inefficient, due to extra
redispatches) DefaultBackend implementation of foo and foo_ in terms of
foo_out. This patch enables code generation for DefaultBackend in these
cases by default for all structured kernels. You can see the payoff
in MSNPU extension: it only has to register a kernel for add.out, and it
gets add and add_ kernels automatically.
The actual code changes are very modest:
functions to allocate tensors, change device guard, etc
directly generate a call to at::foo_out to do the actual work.
there is a case to be made for this being a good idea with more infra;
see comments inside.
Signed-off-by: Edward Z. Yang ezyang@fb.com
Differential Revision: D26731225