-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-0005: Structured kernel definitions RFC #9
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
See pytorch/rfcs#9 This mostly follows the same structure as the proposal, though there have been some short cuts taken. It doesn't currently build because I haven't actually implemented the meta function for the function I marked as structured. Still needs a lot of work, the prototype is here just to show feasibility. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
See pytorch/rfcs#9 This mostly follows the same structure as the proposal, though there have been some short cuts taken. It doesn't currently build because I haven't actually implemented the meta function for the function I marked as structured. Still needs a lot of work, the prototype is here just to show feasibility. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
See pytorch/rfcs#9 This mostly follows the same structure as the proposal, though there have been some short cuts taken. It doesn't currently build because I haven't actually implemented the meta function for the function I marked as structured. Still needs a lot of work, the prototype is here just to show feasibility. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
See pytorch/rfcs#9 This mostly follows the same structure as the proposal, though there have been some short cuts taken. It doesn't currently build because I haven't actually implemented the meta function for the function I marked as structured. Still needs a lot of work, the prototype is here just to show feasibility. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: cefa99e15ce613326c5cd6ac804ff1ac54339ac7 Pull Request resolved: #45277
void upsample_nearest1d_structured_cpu( | ||
const TensorMeta& out_meta, const Tensor& out, const Tensor& self, IntArrayRef output_size, optional<double> scales); | ||
void upsample_nearest1d_structured_cuda( | ||
const TensorMeta& out_meta, const Tensor& out, const Tensor& self, IntArrayRef output_size, optional<double> scales); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth noting somewhere in here that you're limiting the example to CPU and CUDA for expository purposes? the use of explicit dispatch keys above should probably imply to the careful reader that this setup is parameterized over all backend dispatch keys in the usual way (IIUC), but if so probably useful to make it explicit
* Performance reasons. Introducing the common key would induce an | ||
extra redispatch, which at time of writing would give up quite a | ||
bit of performance due to dispatch overhead, for no particularly | ||
good reason. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth thinking carefully about whether this is the best way to make structured kernels available out of tree, given the modality it introduces, risk of drift, perf handicap (albeit diminishing over time) it saddles out of tree backends with, etc.
E.g. once the rest of the codegen has been ported, it's not absurd to imagine fitting it with a frontend that can take inputs other than the in-tree yaml...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The common dispatch key is primarily oriented towards backend implementers.
Let's suppose for a moment, that an overriding design goal is safety/correctness first, with the ability to opt into performance. Then it feels feel like I am forced to introduce the Common dispatch key, because without it, an average backend overrider has to faithfully replicate all of the functionality that we have otherwise fused into CPU/CUDA operators (shape checking, device guards, and with some refactors coming soon, it will also include version counter bumps). It's unrealistic to expect a backend implementer to actually manage all of this without code generation.
That being said, there is a certain optionality to the Common dispatch key. We don't have to implement it (and indeed, in the current posted PR, it is not implemented), and if it is not implemented, the burden is simply on backend implementers to implement all of the necessary scaffolding (which is the de facto situation today). If, for example, we pivoted to publishing code generation for backend implementers, that would alleviate or perhaps eliminate the need for a common dispatch key.
The common dispatch key is mostly irrelevant for custom operators, since average use of common operators is via catch all registration and there will not be any common dispatch key in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree on all the motivations. My pitch would be to (wait and) audition a lightly parameterized codegen pipeline as an alternative, once the rewrite is complete. (Waiting would serve not just to let the codegen settle down, but also give us more time to see how quickly dispatcher overhead was coming down.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of the weirdness like making native_functions.yaml entries depend on each other or having two ways of doing things (dispatch key and fusing into kernels) is because you designed with the constraint of not changing the fact that the three operator overloads (regular, inplace, out) are actually registered as separate operators to the dispatcher.
Have we considered an alternative solution where the dispatcher only knows about the out overload and the regular and inplace variants are generated in the frontend before the call to the dispatcher? That would also make it extensible for out-of-tree ops and take the burden away from backend implementors.
provide public API for running shape computations without any | ||
kernel. | ||
|
||
* Generated code is augmented to do version counter bumps and view |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we are using this mechanism for all ops, even ops that don't have out or inplace variants, right? Otherwise those ops wouldn't get device guards or version counter bumps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is an orthogonal change that we can do with or without structured kernels.
in to higher performance. There is always an escape hatch to be high | ||
performance if absolutely necessary | ||
|
||
* **No codegen**: As long as it is possible to implement things out of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we use this mechanism internally for all ops, we have meta functions for them defined and backends adding new kernels don't need to care about it. Only backends that add new operators would have to use it. Did I understand this correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. So if you don't care about performance, you don't even have to lift a finger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait I think I misunderstood this. To get device guards, you'll still have to go through the dispatch key for those backends, but not if calling into internal backends like CPU. How do you selectively make that dispatch key fallthrough for some backends but not for others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I claimed this would work in our meeting, but rereading your comment, I take it back, you really do need a per backend key here :(
|
||
* Extensions | ||
|
||
* Add a new dispatch key (name tbd) which contains shape checking, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dual approach makes the system a bit complex, it can be hard to figure out where version counter bumps, device guards or shape checking happen for a corresponding op, especially when you're not familiar with this system (yet).
new *structured* format for writing kernels. We’ll do this by marking | ||
the out version of this operator as structured and deleting dispatch | ||
entries from the functional version (the functional operator is | ||
*implicitly* associated with the out-of-place version in the same way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implicit connection makes reasoning about entries in native_functions.yaml
harder since they're not independent entries anymore.
|
||
``` | ||
- func: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor | ||
# [NEW] dispatches for this function are omitted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm maybe we should make this explicit by also adding structured: True
here, to make the information more local and easier to read. Otherwise you have to look at other entries in native_functions.yaml to know what this entry is actually doing. Omitting dispatch keys is already a valid kernel definition even if it's not structured, so by just looking at this one it would be ambigious.
``` | ||
namespace native { | ||
|
||
Tensor upsample_nearest1d_cuda(const Tensor& self, IntArrayRef output_size, optional<double> scales) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be easy to generate with templates instead and it should be possible to do that in a readable way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current expressivity problems:
- Guard logic is weird and special-casey in codegen right now. We probably should be able to simplify it a bit, but right now it would be quite difficult to faithfully replicate the logic in a template
- The upcoming version counter bumps will be difficult to do without out-of-band information about what arguments are mutable or not (right now you can check this using
Tensor&
but when we fix everything to uniformly beconst Tensor&
you'll lose this type info)
// functionality here is common to all backends. This is an alias key | ||
// that resolves CommonXLA/CommonMSNPU/... in the same way as Autograd. | ||
|
||
Tensor upsample_nearest1d_common(const Tensor& self, IntArrayRef output_size, optional<double> scales) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could offer a simple metaprogram for them so they don't have to manually write this for each op. But if I understand correctly, they only have to write this for ops they introduce right? Ops from native_functions.yaml that the backend only extends will already have shape checking etc. through our codegen right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
* An earlier version of this proposal had the boilerplate | ||
generated using C++ templates rather than codegen. However, we | ||
think the formulation in this proposal is superior under the | ||
constraint that mobile selective build must keep working, as we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the connection to mobile selective build yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can talk about this in the meeting. It's constraint solving from the problem "mobile requires registrations to be a separate compilation unit from kernels"
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This is very exciting! Looking forward having this structure. Do you guys plan to support all GPU standard kernels using this new framework? (so they have shape inference and _out versions). |
Yeah. The aspiration is every kernel in PyTorch is in this framework. That's gonna be a lot of work, but hopefully we can lay the groundwork and then roll it out as we go.
This is not entirely settled yet, but the intention for the section at #9 (comment) was to make this possible. So if you are doing a custom op, you now define (e.g.) three parts: a CPU part, a CUDA part, and the static shape checking part, and the framework would put it all together for you. One extra thing that I'd add, though, is when I've been chatting with other static runtime people at FB, they seem to want strange things like being able to run the static inference really fast at the beginning of each run to work out the preallocation. At least the first iteration, the API for actually running these shape computations won't be particular fast, and will be mostly useful for offline use cases. |
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 190986c423098a08df756861ae97ee998a242329 Pull Request resolved: #45277
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: afc9a62ca4dd197a140ea1c60ae4e4358415aaaa Pull Request resolved: #45277
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 84a84ab8548741d54f5c1003886e0319294f22aa Pull Request resolved: #45277
Summary: Pull Request resolved: #45277 Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work This PR improves instruction counts on `upsample_nearest1d` because it eliminates an extra redispatch. Testing `at::upsample_nearest1d(x, {10});` * Functional: before 1314105, after 1150705 * Out: before 915705, after 838405 These numbers may be jittered up to +-16400 (which is the difference when I tested against an unaffected operator `at::upsample_linear1d`), though that may also because unrelated changes affected all operators globally. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D24253555 Test Plan: Imported from OSS Reviewed By: smessmer Pulled By: ezyang fbshipit-source-id: 4ef58dd911991060f13576864c8171f9cc614456
…tion of add to framework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…tion of add to framework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…tion of add to framework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…tion of add to framework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…tion of add to framework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…tion of add to framework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]
…48718) Summary: Pull Request resolved: #48718 This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. TODO: * Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated * Refactor TensorIteratorConfig construction into helper functions, like before * Make Tensor-Scalar addition structured to fix perf regression * Fix `verify_api_visibility.cpp` * Refactor tools/codegen/gen.py for clarity * Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D25278031 Pulled By: ezyang fbshipit-source-id: 57c43a6e5df21929b68964d485995fbbae4d1f7b
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Fair enough. We can use your naming! |
template metaprogramming machinery (to detect if arguments are out | ||
tensors or not); however, because the implementations of structured | ||
kernels are a layer below the operator registration layer, the | ||
const modifier can be eliminated from the `TORCH_IMPL_FUNC` API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean the const modifier can be added, not eliminated, right? Like, out arguments can now be typed as const Tensor&
in TORCH_IMPL_FUNC
declarations, as in
TORCH_IMPL_FUNC(upsample_nearest1d_out_cpu) (
const Tensor& input,
IntArrayRef output_size,
c10::optional<double> scales,
const Tensor& output // currently just Tensor& output
) {
upsample_nearest1d_kernel(kCPU, output, input, scales);
}
Aside: it happens that upsample_nearest1d_kernel
itself takes Tensor& output
, which might hint at additional porting work for at least some kernels to adopt this change.
your Tensor to a CPUTensor and then utilize the regular API.) One | ||
possible argument for retaining the `at::cpu::` namespace is that these | ||
functions are guaranteed to bypass dispatching, whereas other functions | ||
may implicitly downcast to `Tensor` and do an optimized call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upcast
a reference count bump). | ||
|
||
One question is wheter or not the existence of CPUTensor means we should | ||
eliminate the `at::cpu::` namespace (as they serve near equivalent purposes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at::cpu
seems much less risky around the edges than CPUTensor
[1] - if they serve near equivalent purposes, what's an example that motivates adding CPUTensor either instead of or in addition to at::cpu
?
[1] stemming from CPUTensor not reeeally being a subtype of Tensor. A stray call to something like set_storage_and_dtype
would damage a CPUTensor that's been upcast to a Tensor (or provoke a runtime error, if we've put safety measures in place).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an ergonomics thing (but important ergonomics): with at::cpu
, you have to remember to use the CPU only function (and you have to make sure not to accidentally call it with a non-CPU tensor). CPUTensor ensures that the CPU tensor is used, and you only need to prove you have CPU tensor when you initially construct the CPUTensor.
never become structured; to make an analogy, sometimes you have to | ||
write assembly, and as long as it is not too frequent, there is not | ||
too much to be gained from trying to extend the functionality of your | ||
system to expunge these entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A counterargument to consider here I think is that making structured kernels ubiquitous would allow other system capabilities to be defined in terms of them - e.g., shape analysis, memory planning.
The answer wouldn't be to torture everything into having in/func/out variants, but probably to add a handful of other little composite kernel system variations, each with its own codegen and triggering annotation in native_functions.yaml. Factory functions are an obvious fit for this kind of approach, but maybe they're the only good fit? Haven't thought it through any further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this certainly would work. What is less certain is if the number of operators that would be covered by the system variation warrant the variations. Well, at least we can delay designing this for later...
Summary: See the structured kernel definition [RFC](pytorch/rfcs#9) for context. Pull Request resolved: #50189 Reviewed By: mrshenli Differential Revision: D25903846 Pulled By: soulitzer fbshipit-source-id: 0059fda9b7d86f596ca35d830562dd4b859293a0
Fix a typo and add syntax highlighting to structured kernel rfc
This is a proposal for a new code generation facility for writing kernels in PyTorch, where we will automatically generate easy-to-get-wrong boilerplate for functional (add), inplace (add_) and out (add_out) variants of functions, as well as common code (device guards, version counter tracking). The net result is you only need to write a shape checking function and an out-kernel when writing a function.
Rendered