Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register DefaultBackend implementations for functional/inplace structured operators #53037

Closed
wants to merge 2 commits into from

Conversation

ezyang
Copy link
Contributor

@ezyang ezyang commented Mar 1, 2021

Stack from ghstack:

As remarked in #52277 it is easy to give an (inefficient, due to extra
redispatches) DefaultBackend implementation of foo and foo_ in terms of
foo_out. This patch enables code generation for DefaultBackend in these
cases by default for all structured kernels. You can see the payoff
in MSNPU extension: it only has to register a kernel for add.out, and it
gets add and add_ kernels automatically.

The actual code changes are very modest:

  • When DefaultBackend, call the dispatched (not direct native::)
    functions to allocate tensors, change device guard, etc
  • Don't call impl() for DefaultBackend (as it doesn't exist); instead,
    directly generate a call to at::foo_out to do the actual work.
  • Do NOT generate DefaultBackend implementation for foo_out. Actually,
    there is a case to be made for this being a good idea with more infra;
    see comments inside.

Signed-off-by: Edward Z. Yang ezyang@fb.com

Differential Revision: D26731225

…ured operators

As remarked in #52277 it is easy to give an (inefficient, due to extra
redispatches) DefaultBackend implementation of foo and foo_ in terms of
foo_out.  This patch enables code generation for DefaultBackend in these
cases by default for all structured kernels.  You can see the payoff
in MSNPU extension: it only has to register a kernel for add.out, and it
gets add and add_ kernels automatically.

The actual code changes are very modest:
- When DefaultBackend, call the dispatched (not direct native::)
  functions to allocate tensors, change device guard, etc
- Don't call impl() for DefaultBackend (as it doesn't exist); instead,
  directly generate a call to at::foo_out to do the actual work.
- Do NOT generate DefaultBackend implementation for foo_out.  Actually,
  there is a case to be made for this being a good idea with more infra;
  see comments inside.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 1, 2021

💊 CI failures summary and remediations

As of commit e86c164 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

ezyang added a commit that referenced this pull request Mar 1, 2021
…ured operators

As remarked in #52277 it is easy to give an (inefficient, due to extra
redispatches) DefaultBackend implementation of foo and foo_ in terms of
foo_out.  This patch enables code generation for DefaultBackend in these
cases by default for all structured kernels.  You can see the payoff
in MSNPU extension: it only has to register a kernel for add.out, and it
gets add and add_ kernels automatically.

The actual code changes are very modest:
- When DefaultBackend, call the dispatched (not direct native::)
  functions to allocate tensors, change device guard, etc
- Don't call impl() for DefaultBackend (as it doesn't exist); instead,
  directly generate a call to at::foo_out to do the actual work.
- Do NOT generate DefaultBackend implementation for foo_out.  Actually,
  there is a case to be made for this being a good idea with more infra;
  see comments inside.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: e45111af6bf9c908c894c0576739a9770e1123b2
Pull Request resolved: #53037
if self.dispatch_key == DispatchKey.DefaultBackend and f.func.kind() is SchemaKind.out:
# Never generate a default implementation for out, that's what you
# have to define as a backend implementor
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something- Doesn't the previous assert prevent you from hitting this case? You have an assert near the top of the file to check that an out variant native_function that's marked structured doesn't have a DefaultBackend key registered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're asserting different things.

            assert self.dispatch_key not in g.out.dispatch, \
                "Do not explicitly specify DefaultBackend dispatch key on structured " \
                "functions, they will be automatically generated for you"

asserts that in the dispatch table (in native_functions.yaml), you're not allowed to have a DefaultBackend entry. But even if you DON'T have a DefaultBackend entry, we will still call into codegen (something similar happens for Meta, where you never write Meta in the dispatch table, but Meta always gets generated. We don't iterate over the dispatch dictionary to figure out what to codegen!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh yep, that's right. Thanks!

# I got in trouble because that meant I needed a DispatchKeySet
# in the wrapper function, which meant I needed a DispatchKeySet
# in the DispatchKeyFunctions declarations, but the defined API
# there does NOT permit a dispatch key set. I think you can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I might take a quick look after the PR lands to see what it would take to get the wrapper to take in a DispatchKeySet. Since if we can get these functions to take in that argument, the dispatcher should just plumb the keyset through for free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly attempted it and I mostly concluded that I really wanted DispatchKeySet to be in the argument list, so that translate could handle it automatically.

@@ -257,6 +261,9 @@ def gen_class_set_output_body(self, k: SchemaKind) -> str:
elif self.dispatch_key == DispatchKey.CUDA:
empty_impl = "at::native::empty_cuda"
empty_strided_impl = "at::native::empty_strided_cuda"
elif self.dispatch_key == DispatchKey.DefaultBackend:
empty_impl = "at::empty"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed that this makes the most sense to unblock native functions with DefaultBackend functional/inplace variants for structured kernels.

I was wondering if we have a more long-term view on what backend guarantees we want to make for functional/inplace ops. It seems a little arbitrary that some unary ops have DefaultBackend functional/inplace variants, while most other ops give the same backend guarantees to all of their variants (either they're all Math, or they all have backend-specific implementations).

So right now, a custom backend would need to implement their own version of add/add_/add_out, to get full add functionality, but only sin_out to get everything for sin. That also has the minor downside of making those variants just-a-bit-slower by requiring an extra trip through the dispatcher.

It doesn't seem like one is one is obviously better than the other, but do you think we'd ever want to introduce some amount of uniformity for what external backends can expect across ops? The current way of doing that is that backends need to look at entry of native_functions.yaml individually and implement the variants not listed as Math/DefaultBackend. That's also a perfectly fine state to be in, so I don't think this is super important- mostly just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed that this makes the most sense to unblock native functions with DefaultBackend functional/inplace variants for structured kernels.

This is a misunderstanding of what the patch does. Although there is a test for dispatch key for DefaultBackend here, this code is exercised even if a structured kernel doesn't have any DefaultBackend in the dispatch dictionary in native functions yaml. An easy way to see this is add, which is exercised by the tests in this PR, doesn't have DefaultBackend in the dispatch dictionary:

- func: add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
  structured: True
  structured_inherits: TensorIteratorBase
  dispatch:
    CPU, CUDA: add_out
    SparseCPU: add_out_sparse_cpu
    SparseCUDA: add_out_sparse_cuda
    MkldnnCPU: mkldnn_add_out

So right now, a custom backend would need to implement their own version of add/add_/add_out, to get full add functionality, but only sin_out to get everything for sin.

This is factually incorrect. Both add and sin (after @hameerabbasi's PR) are structured, and in fact in both cases you only need to implement add_out and sin_out to get full functionality. If you registered a specialized implementation you can get it a little faster (from skipped redispatches), but it is not necessary.

It is true that it is a bit arbitrary whether or not something is structured or not to say what you need to implement. The fix for that is to port more stuff to structured.

Let me know if there is something else you were getting at here that I missed. I think there is generally a question about "what should we expect backends to implement" and in some cases (e.g., XLA) I don't think it is an improvement to only have to implement the out version; they'd rather implement the functional version. But we can only pick one direction to do the defaulting the way the dispatcher is currently setup, need some more fancy gear otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing that out! Staring at the codegen output from this PR also helped a little. Just to lay it out, what I had in my head was that we had to choose between two options (taking sin as an example):

  • Register a specialized version of sin that skips redispatches, but requires separate implementations for other backends
  • Register a backend-agnostic version of sin. Backends don't need to implement their own version, but it requires extra dispatches

The structured kernel codegen now just picks both choices, and uses the fact that DefaultBackend is an alias key to ensure that "the right thing" happens. If another backend registers their own specialized implementation, the dispatcher will give that implementation precedence. Otherwise, we default to the DefaultBackend implementation.

You probably already had this in mind with this issue, but we can probably do something similar with overloads- potentially also for overloads that don't necessarily call each other for performance reasons. For example, I started a port of pow, and it has an overload pow.Tensor_Scalar(Tensor self, Scalar exponent) -> Tensor that's implemented directly as a unary op, even though it could technically be written a little less efficiently to call out to the more general Tensor overload. I could imagine doing something similar to here, where we keep the specialized implementation of the overload for perf reasons, but also codegen a slower, backend-agnostic variant of it that calls the more general Tensor overload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang @bdhirsh just to check my understanding, I think this PR narrows the BC-breakage window substantially but doesn't entirely close it. (Not that it should, just want to make sure I'm thinking about all the moving parts properly.)

E.g. if we have

  1. an unstructured op with dispatch configured a la sin in native_functions.yaml:
- func: foo ...
  dispatch:
    DefaultBackend: foo

- func: foo_ ...
  dispatch:
    DefaultBackend: foo_

- func: foo.out ...
  dispatch:
    CPU, CUDA: foo_out
  1. where foo and foo_ kernels really do work for all backends

  2. and an out-of-tree backend provider relies on this support for foo and foo_

  3. and the provider doesn't care about, and has not registered a kernel for, foo.out

Then before this PR, moving foo/foo_/foo.out to SK would suddenly oblige the backend provider to register kernels for foo and foo_ to keep the same level of support it had previously gotten for free.

After this PR, the generated DefaultBackend support means this is no longer necessary (though still allowed), but IIUC the backend provider will still need to have registered a kernel for foo.out for the op to actually run, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhosmer I don't know how to square (4) with (3).

and the provider doesn't care about, and has not registered a kernel for, foo.out

OK, so the provider hasn't registered foo.out. So.... how exactly was the DefaultBackend foo and foo_ implemented? Can't be by calling foo.out, because if they actually did that then it would be broken if they hadn't registered foo.out. So you're saying that foo in some sort of other way that doesn't use out... to which I say, "Don't do that, and structured kernels will make you not do that."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm just confirming that this PR assumes that dependency between unstructured variants, and that it's possible for an unstructured op to break the assumption while still coloring inside the lines, registration-wise. I don't think we should try to handle this case - I have zero ideas for real-world ops that would actually use it - but wanted to confirm my understanding of exactly which configurations this PR guarantees BC for.

Copy link
Contributor

@bdhirsh bdhirsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few comments but none that should require code changes. LGTM!

…lace structured operators"

As remarked in #52277 it is easy to give an (inefficient, due to extra
redispatches) DefaultBackend implementation of foo and foo_ in terms of
foo_out.  This patch enables code generation for DefaultBackend in these
cases by default for all structured kernels.  You can see the payoff
in MSNPU extension: it only has to register a kernel for add.out, and it
gets add and add_ kernels automatically.

The actual code changes are very modest:
- When DefaultBackend, call the dispatched (not direct native::)
  functions to allocate tensors, change device guard, etc
- Don't call impl() for DefaultBackend (as it doesn't exist); instead,
  directly generate a call to at::foo_out to do the actual work.
- Do NOT generate DefaultBackend implementation for foo_out.  Actually,
  there is a case to be made for this being a good idea with more infra;
  see comments inside.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D26731225](https://our.internmc.facebook.com/intern/diff/D26731225)

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Mar 2, 2021
…ured operators

As remarked in #52277 it is easy to give an (inefficient, due to extra
redispatches) DefaultBackend implementation of foo and foo_ in terms of
foo_out.  This patch enables code generation for DefaultBackend in these
cases by default for all structured kernels.  You can see the payoff
in MSNPU extension: it only has to register a kernel for add.out, and it
gets add and add_ kernels automatically.

The actual code changes are very modest:
- When DefaultBackend, call the dispatched (not direct native::)
  functions to allocate tensors, change device guard, etc
- Don't call impl() for DefaultBackend (as it doesn't exist); instead,
  directly generate a call to at::foo_out to do the actual work.
- Do NOT generate DefaultBackend implementation for foo_out.  Actually,
  there is a case to be made for this being a good idea with more infra;
  see comments inside.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: c3ead7bd9eda790e22ce5776cc3168c63d63a00b
Pull Request resolved: #53037
@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 37bf6c1.

@facebook-github-bot facebook-github-bot deleted the gh/ezyang/917/head branch March 6, 2021 15:17
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
…ured operators (pytorch#53037)

Summary:
Pull Request resolved: pytorch#53037

As remarked in pytorch#52277 it is easy to give an (inefficient, due to extra
redispatches) DefaultBackend implementation of foo and foo_ in terms of
foo_out.  This patch enables code generation for DefaultBackend in these
cases by default for all structured kernels.  You can see the payoff
in MSNPU extension: it only has to register a kernel for add.out, and it
gets add and add_ kernels automatically.

The actual code changes are very modest:
- When DefaultBackend, call the dispatched (not direct native::)
  functions to allocate tensors, change device guard, etc
- Don't call impl() for DefaultBackend (as it doesn't exist); instead,
  directly generate a call to at::foo_out to do the actual work.
- Do NOT generate DefaultBackend implementation for foo_out.  Actually,
  there is a case to be made for this being a good idea with more infra;
  see comments inside.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26731225

Pulled By: ezyang

fbshipit-source-id: 939da7cb69f694722ec293e5e42e74a755dd0985
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
…ured operators (pytorch#53037)

Summary:
Pull Request resolved: pytorch#53037

As remarked in pytorch#52277 it is easy to give an (inefficient, due to extra
redispatches) DefaultBackend implementation of foo and foo_ in terms of
foo_out.  This patch enables code generation for DefaultBackend in these
cases by default for all structured kernels.  You can see the payoff
in MSNPU extension: it only has to register a kernel for add.out, and it
gets add and add_ kernels automatically.

The actual code changes are very modest:
- When DefaultBackend, call the dispatched (not direct native::)
  functions to allocate tensors, change device guard, etc
- Don't call impl() for DefaultBackend (as it doesn't exist); instead,
  directly generate a call to at::foo_out to do the actual work.
- Do NOT generate DefaultBackend implementation for foo_out.  Actually,
  there is a case to be made for this being a good idea with more infra;
  see comments inside.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26731225

Pulled By: ezyang

fbshipit-source-id: 939da7cb69f694722ec293e5e42e74a755dd0985
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants