RFC-0005: Structured kernel definitions RFC #9

ezyang · 2020-10-13T19:19:42Z

This is a proposal for a new code generation facility for writing kernels in PyTorch, where we will automatically generate easy-to-get-wrong boilerplate for functional (add), inplace (add_) and out (add_out) variants of functions, as well as common code (device guards, version counter tracking). The net result is you only need to write a shape checking function and an out-kernel when writing a function.

Rendered

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

See pytorch/rfcs#9 This mostly follows the same structure as the proposal, though there have been some short cuts taken. It doesn't currently build because I haven't actually implemented the meta function for the function I marked as structured. Still needs a lot of work, the prototype is here just to show feasibility. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]

Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]

Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: cefa99e15ce613326c5cd6ac804ff1ac54339ac7 Pull Request resolved: #45277

bhosmer · 2020-10-27T06:35:04Z

RFC-0005-structured-kernel-definitions.md

+ void upsample_nearest1d_structured_cpu(
+const TensorMeta& out_meta, const Tensor& out, const Tensor& self, IntArrayRef output_size, optional<double> scales);
+ void upsample_nearest1d_structured_cuda(
+const TensorMeta& out_meta,  const Tensor& out, const Tensor& self, IntArrayRef output_size, optional<double> scales);


maybe worth noting somewhere in here that you're limiting the example to CPU and CUDA for expository purposes? the use of explicit dispatch keys above should probably imply to the careful reader that this setup is parameterized over all backend dispatch keys in the usual way (IIUC), but if so probably useful to make it explicit

bhosmer · 2020-10-27T06:42:23Z

RFC-0005-structured-kernel-definitions.md

+    * Performance reasons. Introducing the common key would induce an
+      extra redispatch, which at time of writing would give up quite a
+      bit of performance due to dispatch overhead, for no particularly
+      good reason.


I think it's worth thinking carefully about whether this is the best way to make structured kernels available out of tree, given the modality it introduces, risk of drift, perf handicap (albeit diminishing over time) it saddles out of tree backends with, etc.

E.g. once the rest of the codegen has been ported, it's not absurd to imagine fitting it with a frontend that can take inputs other than the in-tree yaml...

The common dispatch key is primarily oriented towards backend implementers.

Let's suppose for a moment, that an overriding design goal is safety/correctness first, with the ability to opt into performance. Then it feels feel like I am forced to introduce the Common dispatch key, because without it, an average backend overrider has to faithfully replicate all of the functionality that we have otherwise fused into CPU/CUDA operators (shape checking, device guards, and with some refactors coming soon, it will also include version counter bumps). It's unrealistic to expect a backend implementer to actually manage all of this without code generation.

That being said, there is a certain optionality to the Common dispatch key. We don't have to implement it (and indeed, in the current posted PR, it is not implemented), and if it is not implemented, the burden is simply on backend implementers to implement all of the necessary scaffolding (which is the de facto situation today). If, for example, we pivoted to publishing code generation for backend implementers, that would alleviate or perhaps eliminate the need for a common dispatch key.

The common dispatch key is mostly irrelevant for custom operators, since average use of common operators is via catch all registration and there will not be any common dispatch key in any case.

Yeah, agree on all the motivations. My pitch would be to (wait and) audition a lightly parameterized codegen pipeline as an alternative, once the rewrite is complete. (Waiting would serve not just to let the codegen settle down, but also give us more time to see how quickly dispatcher overhead was coming down.)

smessmer

I think some of the weirdness like making native_functions.yaml entries depend on each other or having two ways of doing things (dispatch key and fusing into kernels) is because you designed with the constraint of not changing the fact that the three operator overloads (regular, inplace, out) are actually registered as separate operators to the dispatcher.

Have we considered an alternative solution where the dispatcher only knows about the out overload and the regular and inplace variants are generated in the frontend before the call to the dispatcher? That would also make it extensible for out-of-tree ops and take the burden away from backend implementors.

smessmer · 2020-10-29T14:30:10Z

RFC-0005-structured-kernel-definitions.md

+      provide public API for running shape computations without any
+      kernel.
+
+    * Generated code is augmented to do version counter bumps and view


This means we are using this mechanism for all ops, even ops that don't have out or inplace variants, right? Otherwise those ops wouldn't get device guards or version counter bumps.

Yeah, this is an orthogonal change that we can do with or without structured kernels.

smessmer · 2020-10-29T14:31:13Z

RFC-0005-structured-kernel-definitions.md

+  in to higher performance. There is always an escape hatch to be high
+  performance if absolutely necessary
+
+* **No codegen**: As long as it is possible to implement things out of


As long as we use this mechanism internally for all ops, we have meta functions for them defined and backends adding new kernels don't need to care about it. Only backends that add new operators would have to use it. Did I understand this correctly?

That's right. So if you don't care about performance, you don't even have to lift a finger.

wait I think I misunderstood this. To get device guards, you'll still have to go through the dispatch key for those backends, but not if calling into internal backends like CPU. How do you selectively make that dispatch key fallthrough for some backends but not for others?

I claimed this would work in our meeting, but rereading your comment, I take it back, you really do need a per backend key here :(

smessmer · 2020-10-29T14:32:35Z

RFC-0005-structured-kernel-definitions.md

+
+* Extensions
+
+    * Add a new dispatch key (name tbd) which contains shape checking,


This dual approach makes the system a bit complex, it can be hard to figure out where version counter bumps, device guards or shape checking happen for a corresponding op, especially when you're not familiar with this system (yet).

smessmer · 2020-10-29T14:35:17Z

RFC-0005-structured-kernel-definitions.md

+new *structured* format for writing kernels. We’ll do this by marking
+the out version of this operator as structured and deleting dispatch
+entries from the functional version (the functional operator is
+*implicitly* associated with the out-of-place version in the same way


This implicit connection makes reasoning about entries in native_functions.yaml harder since they're not independent entries anymore.

smessmer · 2020-10-29T14:36:56Z

RFC-0005-structured-kernel-definitions.md

+
+```
+- func: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
+  # [NEW] dispatches for this function are omitted


hm maybe we should make this explicit by also adding structured: True here, to make the information more local and easier to read. Otherwise you have to look at other entries in native_functions.yaml to know what this entry is actually doing. Omitting dispatch keys is already a valid kernel definition even if it's not structured, so by just looking at this one it would be ambigious.

smessmer · 2020-10-29T14:39:47Z

RFC-0005-structured-kernel-definitions.md

+```
+namespace native {
+
+Tensor upsample_nearest1d_cuda(const Tensor& self, IntArrayRef output_size, optional<double> scales) {


these should be easy to generate with templates instead and it should be possible to do that in a readable way

The current expressivity problems:

Guard logic is weird and special-casey in codegen right now. We probably should be able to simplify it a bit, but right now it would be quite difficult to faithfully replicate the logic in a template

The upcoming version counter bumps will be difficult to do without out-of-band information about what arguments are mutable or not (right now you can check this using Tensor& but when we fix everything to uniformly be const Tensor& you'll lose this type info)

smessmer · 2020-10-29T14:44:46Z

RFC-0005-structured-kernel-definitions.md

+// functionality here is common to all backends. This is an alias key
+// that resolves CommonXLA/CommonMSNPU/... in the same way as Autograd.
+
+Tensor upsample_nearest1d_common(const Tensor& self, IntArrayRef output_size, optional<double> scales) {


We could offer a simple metaprogram for them so they don't have to manually write this for each op. But if I understand correctly, they only have to write this for ops they introduce right? Ops from native_functions.yaml that the backend only extends will already have shape checking etc. through our codegen right?

smessmer · 2020-10-29T14:46:46Z

RFC-0005-structured-kernel-definitions.md

+    * An earlier version of this proposal had the boilerplate
+      generated using C++ templates rather than codegen. However, we
+      think the formulation in this proposal is superior under the
+      constraint that mobile selective build must keep working, as we


I don't understand the connection to mobile selective build yet

We can talk about this in the meeting. It's constraint solving from the problem "mobile requires registrations to be a separate compilation unit from kernels"

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

salexspb · 2020-11-05T03:23:11Z

This is very exciting! Looking forward having this structure.
Ability to infer shapes statically and pre-allocate all the memory to be passed on later to _out kernel versions is going to bring PyTorch runtime to the next level, I think.

Do you guys plan to support all GPU standard kernels using this new framework? (so they have shape inference and _out versions).
And what about custom ops / classes, do you plan to support them in the kernel definition framework as well? Ideally I would like to be able to have 100% static models via using a mix of built-in _out ops versions and custom ops (also accepting external memory and providing shape checking / inference functionality) one may implement.

ezyang · 2020-11-05T04:05:21Z

Do you guys plan to support all GPU standard kernels using this new framework? (so they have shape inference and _out versions).

Yeah. The aspiration is every kernel in PyTorch is in this framework. That's gonna be a lot of work, but hopefully we can lay the groundwork and then roll it out as we go.

And what about custom ops / classes, do you plan to support them in the kernel definition framework as well?

This is not entirely settled yet, but the intention for the section at #9 (comment) was to make this possible. So if you are doing a custom op, you now define (e.g.) three parts: a CPU part, a CUDA part, and the static shape checking part, and the framework would put it all together for you.

One extra thing that I'd add, though, is when I've been chatting with other static runtime people at FB, they seem to want strange things like being able to run the static inference really fast at the beginning of each run to work out the preallocation. At least the first iteration, the API for actually running these shape computations won't be particular fast, and will be mostly useful for offline use cases.

Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 190986c423098a08df756861ae97ee998a242329 Pull Request resolved: #45277

Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]

Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]

Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: afc9a62ca4dd197a140ea1c60ae4e4358415aaaa Pull Request resolved: #45277

Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555) [ghstack-poisoned]

Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. There is a new meta api which is the calling convention for TensorMeta calculation functions. Most of the new codegen lives in structured_func; check out the RFC for an explanation of what the code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work There's some hacks which I can work harder to unwind: - I need to get upsample_nearest1d to be registered as abstract: True in Declarations.yaml even though it has no dispatch table (as it is implicitly filled by upsample_nearest1d.out). I ended up hacking this up by just adding a new field 'abstract: True' that lets you manually override the abstractness. Better would be to just teach the codegen to fill this correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 84a84ab8548741d54f5c1003886e0319294f22aa Pull Request resolved: #45277

Summary: Pull Request resolved: #45277 Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework. The general structure of this diff: - Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model - NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model - When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go. - The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out. - Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like. Missing pieces: - Stride calculation in TensorMeta - Sufficient sanity checking for inplace/out variants - Enough rope to make TensorIterator work This PR improves instruction counts on `upsample_nearest1d` because it eliminates an extra redispatch. Testing `at::upsample_nearest1d(x, {10});` * Functional: before 1314105, after 1150705 * Out: before 915705, after 838405 These numbers may be jittered up to +-16400 (which is the difference when I tested against an unaffected operator `at::upsample_linear1d`), though that may also because unrelated changes affected all operators globally. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D24253555 Test Plan: Imported from OSS Reviewed By: smessmer Pulled By: ezyang fbshipit-source-id: 4ef58dd911991060f13576864c8171f9cc614456

…tion of add to framework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]

…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. * To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime. TODO: * Make Tensor-Scalar addition structured to fix perf regression * Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]