Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-0005: Structured kernel definitions RFC #9

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Conversation

ezyang
Copy link
Contributor

@ezyang ezyang commented Oct 13, 2020

This is a proposal for a new code generation facility for writing kernels in PyTorch, where we will automatically generate easy-to-get-wrong boilerplate for functional (add), inplace (add_) and out (add_out) variants of functions, as well as common code (device guards, version counter tracking). The net result is you only need to write a shape checking function and an out-kernel when writing a function.

Rendered

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@ezyang ezyang changed the title Structured kernel definitions RFC RFC-0005: Structured kernel definitions RFC Oct 14, 2020
ezyang added a commit to pytorch/pytorch that referenced this pull request Oct 14, 2020
See pytorch/rfcs#9

This mostly follows the same structure as the proposal, though there have been some short cuts taken. It doesn't currently build because I haven't actually implemented the meta function for the function I marked as structured. Still needs a lot of work, the prototype is here just to show feasibility.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Oct 14, 2020
See pytorch/rfcs#9

This mostly follows the same structure as the proposal, though there have been some short cuts taken. It doesn't currently build because I haven't actually implemented the meta function for the function I marked as structured. Still needs a lot of work, the prototype is here just to show feasibility.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Oct 14, 2020
See pytorch/rfcs#9

This mostly follows the same structure as the proposal, though there have been some short cuts taken. It doesn't currently build because I haven't actually implemented the meta function for the function I marked as structured. Still needs a lot of work, the prototype is here just to show feasibility.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Oct 14, 2020
See pytorch/rfcs#9

This mostly follows the same structure as the proposal, though there have been some short cuts taken. It doesn't currently build because I haven't actually implemented the meta function for the function I marked as structured. Still needs a lot of work, the prototype is here just to show feasibility.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Oct 22, 2020
Implements structured kernels as per pytorch/rfcs#9
and ports upsample_nearest1d to use the framework.

There is a new meta api which is the calling convention for TensorMeta
calculation functions.  Most of the new codegen lives in
structured_func; check out the RFC for an explanation of what the code
looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

There's some hacks which I can work harder to unwind:

- I need to get upsample_nearest1d to be registered as abstract: True
  in Declarations.yaml even though it has no dispatch table (as it
  is implicitly filled by upsample_nearest1d.out).  I ended up
  hacking this up by just adding a new field 'abstract: True'
  that lets you manually override the abstractness.  Better would
  be to just teach the codegen to fill this correctly

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Oct 22, 2020
Implements structured kernels as per pytorch/rfcs#9
and ports upsample_nearest1d to use the framework.

There is a new meta api which is the calling convention for TensorMeta
calculation functions.  Most of the new codegen lives in
structured_func; check out the RFC for an explanation of what the code
looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

There's some hacks which I can work harder to unwind:

- I need to get upsample_nearest1d to be registered as abstract: True
  in Declarations.yaml even though it has no dispatch table (as it
  is implicitly filled by upsample_nearest1d.out).  I ended up
  hacking this up by just adding a new field 'abstract: True'
  that lets you manually override the abstractness.  Better would
  be to just teach the codegen to fill this correctly

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Oct 22, 2020
Implements structured kernels as per pytorch/rfcs#9
and ports upsample_nearest1d to use the framework.

There is a new meta api which is the calling convention for TensorMeta
calculation functions.  Most of the new codegen lives in
structured_func; check out the RFC for an explanation of what the code
looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

There's some hacks which I can work harder to unwind:

- I need to get upsample_nearest1d to be registered as abstract: True
  in Declarations.yaml even though it has no dispatch table (as it
  is implicitly filled by upsample_nearest1d.out).  I ended up
  hacking this up by just adding a new field 'abstract: True'
  that lets you manually override the abstractness.  Better would
  be to just teach the codegen to fill this correctly

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: cefa99e15ce613326c5cd6ac804ff1ac54339ac7
Pull Request resolved: #45277
void upsample_nearest1d_structured_cpu(
const TensorMeta& out_meta, const Tensor& out, const Tensor& self, IntArrayRef output_size, optional<double> scales);
void upsample_nearest1d_structured_cuda(
const TensorMeta& out_meta, const Tensor& out, const Tensor& self, IntArrayRef output_size, optional<double> scales);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth noting somewhere in here that you're limiting the example to CPU and CUDA for expository purposes? the use of explicit dispatch keys above should probably imply to the careful reader that this setup is parameterized over all backend dispatch keys in the usual way (IIUC), but if so probably useful to make it explicit

* Performance reasons. Introducing the common key would induce an
extra redispatch, which at time of writing would give up quite a
bit of performance due to dispatch overhead, for no particularly
good reason.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth thinking carefully about whether this is the best way to make structured kernels available out of tree, given the modality it introduces, risk of drift, perf handicap (albeit diminishing over time) it saddles out of tree backends with, etc.

E.g. once the rest of the codegen has been ported, it's not absurd to imagine fitting it with a frontend that can take inputs other than the in-tree yaml...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common dispatch key is primarily oriented towards backend implementers.

Let's suppose for a moment, that an overriding design goal is safety/correctness first, with the ability to opt into performance. Then it feels feel like I am forced to introduce the Common dispatch key, because without it, an average backend overrider has to faithfully replicate all of the functionality that we have otherwise fused into CPU/CUDA operators (shape checking, device guards, and with some refactors coming soon, it will also include version counter bumps). It's unrealistic to expect a backend implementer to actually manage all of this without code generation.

That being said, there is a certain optionality to the Common dispatch key. We don't have to implement it (and indeed, in the current posted PR, it is not implemented), and if it is not implemented, the burden is simply on backend implementers to implement all of the necessary scaffolding (which is the de facto situation today). If, for example, we pivoted to publishing code generation for backend implementers, that would alleviate or perhaps eliminate the need for a common dispatch key.

The common dispatch key is mostly irrelevant for custom operators, since average use of common operators is via catch all registration and there will not be any common dispatch key in any case.

Copy link

@bhosmer bhosmer Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree on all the motivations. My pitch would be to (wait and) audition a lightly parameterized codegen pipeline as an alternative, once the rewrite is complete. (Waiting would serve not just to let the codegen settle down, but also give us more time to see how quickly dispatcher overhead was coming down.)

Copy link

@smessmer smessmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of the weirdness like making native_functions.yaml entries depend on each other or having two ways of doing things (dispatch key and fusing into kernels) is because you designed with the constraint of not changing the fact that the three operator overloads (regular, inplace, out) are actually registered as separate operators to the dispatcher.

Have we considered an alternative solution where the dispatcher only knows about the out overload and the regular and inplace variants are generated in the frontend before the call to the dispatcher? That would also make it extensible for out-of-tree ops and take the burden away from backend implementors.

provide public API for running shape computations without any
kernel.

* Generated code is augmented to do version counter bumps and view

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we are using this mechanism for all ops, even ops that don't have out or inplace variants, right? Otherwise those ops wouldn't get device guards or version counter bumps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is an orthogonal change that we can do with or without structured kernels.

in to higher performance. There is always an escape hatch to be high
performance if absolutely necessary

* **No codegen**: As long as it is possible to implement things out of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we use this mechanism internally for all ops, we have meta functions for them defined and backends adding new kernels don't need to care about it. Only backends that add new operators would have to use it. Did I understand this correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. So if you don't care about performance, you don't even have to lift a finger.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait I think I misunderstood this. To get device guards, you'll still have to go through the dispatch key for those backends, but not if calling into internal backends like CPU. How do you selectively make that dispatch key fallthrough for some backends but not for others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I claimed this would work in our meeting, but rereading your comment, I take it back, you really do need a per backend key here :(


* Extensions

* Add a new dispatch key (name tbd) which contains shape checking,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dual approach makes the system a bit complex, it can be hard to figure out where version counter bumps, device guards or shape checking happen for a corresponding op, especially when you're not familiar with this system (yet).

new *structured* format for writing kernels. We’ll do this by marking
the out version of this operator as structured and deleting dispatch
entries from the functional version (the functional operator is
*implicitly* associated with the out-of-place version in the same way

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implicit connection makes reasoning about entries in native_functions.yaml harder since they're not independent entries anymore.


```
- func: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor
# [NEW] dispatches for this function are omitted

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm maybe we should make this explicit by also adding structured: True here, to make the information more local and easier to read. Otherwise you have to look at other entries in native_functions.yaml to know what this entry is actually doing. Omitting dispatch keys is already a valid kernel definition even if it's not structured, so by just looking at this one it would be ambigious.

```
namespace native {

Tensor upsample_nearest1d_cuda(const Tensor& self, IntArrayRef output_size, optional<double> scales) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be easy to generate with templates instead and it should be possible to do that in a readable way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current expressivity problems:

  1. Guard logic is weird and special-casey in codegen right now. We probably should be able to simplify it a bit, but right now it would be quite difficult to faithfully replicate the logic in a template
  2. The upcoming version counter bumps will be difficult to do without out-of-band information about what arguments are mutable or not (right now you can check this using Tensor& but when we fix everything to uniformly be const Tensor& you'll lose this type info)

// functionality here is common to all backends. This is an alias key
// that resolves CommonXLA/CommonMSNPU/... in the same way as Autograd.

Tensor upsample_nearest1d_common(const Tensor& self, IntArrayRef output_size, optional<double> scales) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could offer a simple metaprogram for them so they don't have to manually write this for each op. But if I understand correctly, they only have to write this for ops they introduce right? Ops from native_functions.yaml that the backend only extends will already have shape checking etc. through our codegen right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

* An earlier version of this proposal had the boilerplate
generated using C++ templates rather than codegen. However, we
think the formulation in this proposal is superior under the
constraint that mobile selective build must keep working, as we

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the connection to mobile selective build yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can talk about this in the meeting. It's constraint solving from the problem "mobile requires registrations to be a separate compilation unit from kernels"

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@salexspb
Copy link

salexspb commented Nov 5, 2020

This is very exciting! Looking forward having this structure.
Ability to infer shapes statically and pre-allocate all the memory to be passed on later to _out kernel versions is going to bring PyTorch runtime to the next level, I think.

Do you guys plan to support all GPU standard kernels using this new framework? (so they have shape inference and _out versions).
And what about custom ops / classes, do you plan to support them in the kernel definition framework as well? Ideally I would like to be able to have 100% static models via using a mix of built-in _out ops versions and custom ops (also accepting external memory and providing shape checking / inference functionality) one may implement.

@ezyang
Copy link
Contributor Author

ezyang commented Nov 5, 2020

Do you guys plan to support all GPU standard kernels using this new framework? (so they have shape inference and _out versions).

Yeah. The aspiration is every kernel in PyTorch is in this framework. That's gonna be a lot of work, but hopefully we can lay the groundwork and then roll it out as we go.

And what about custom ops / classes, do you plan to support them in the kernel definition framework as well?

This is not entirely settled yet, but the intention for the section at #9 (comment) was to make this possible. So if you are doing a custom op, you now define (e.g.) three parts: a CPU part, a CUDA part, and the static shape checking part, and the framework would put it all together for you.

One extra thing that I'd add, though, is when I've been chatting with other static runtime people at FB, they seem to want strange things like being able to run the static inference really fast at the beginning of each run to work out the preallocation. At least the first iteration, the API for actually running these shape computations won't be particular fast, and will be mostly useful for offline use cases.

ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 11, 2020
Implements structured kernels as per pytorch/rfcs#9
and ports upsample_nearest1d to use the framework.

There is a new meta api which is the calling convention for TensorMeta
calculation functions.  Most of the new codegen lives in
structured_func; check out the RFC for an explanation of what the code
looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

There's some hacks which I can work harder to unwind:

- I need to get upsample_nearest1d to be registered as abstract: True
  in Declarations.yaml even though it has no dispatch table (as it
  is implicitly filled by upsample_nearest1d.out).  I ended up
  hacking this up by just adding a new field 'abstract: True'
  that lets you manually override the abstractness.  Better would
  be to just teach the codegen to fill this correctly

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 190986c423098a08df756861ae97ee998a242329
Pull Request resolved: #45277
ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 11, 2020
Implements structured kernels as per pytorch/rfcs#9
and ports upsample_nearest1d to use the framework.

There is a new meta api which is the calling convention for TensorMeta
calculation functions.  Most of the new codegen lives in
structured_func; check out the RFC for an explanation of what the code
looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

There's some hacks which I can work harder to unwind:

- I need to get upsample_nearest1d to be registered as abstract: True
  in Declarations.yaml even though it has no dispatch table (as it
  is implicitly filled by upsample_nearest1d.out).  I ended up
  hacking this up by just adding a new field 'abstract: True'
  that lets you manually override the abstractness.  Better would
  be to just teach the codegen to fill this correctly

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 11, 2020
Implements structured kernels as per pytorch/rfcs#9
and ports upsample_nearest1d to use the framework.

There is a new meta api which is the calling convention for TensorMeta
calculation functions.  Most of the new codegen lives in
structured_func; check out the RFC for an explanation of what the code
looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

There's some hacks which I can work harder to unwind:

- I need to get upsample_nearest1d to be registered as abstract: True
  in Declarations.yaml even though it has no dispatch table (as it
  is implicitly filled by upsample_nearest1d.out).  I ended up
  hacking this up by just adding a new field 'abstract: True'
  that lets you manually override the abstractness.  Better would
  be to just teach the codegen to fill this correctly

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 12, 2020
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework.

The general structure of this diff:

- Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model
- NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model
- When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go.
- The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out.
- Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 12, 2020
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework.

The general structure of this diff:

- Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model
- NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model
- When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go.
- The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out.
- Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 12, 2020
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework.

The general structure of this diff:

- Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model
- NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model
- When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go.
- The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out.
- Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 12, 2020
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework.

The general structure of this diff:

- Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model
- NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model
- When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go.
- The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out.
- Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 12, 2020
Implements structured kernels as per pytorch/rfcs#9
and ports upsample_nearest1d to use the framework.

There is a new meta api which is the calling convention for TensorMeta
calculation functions.  Most of the new codegen lives in
structured_func; check out the RFC for an explanation of what the code
looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

There's some hacks which I can work harder to unwind:

- I need to get upsample_nearest1d to be registered as abstract: True
  in Declarations.yaml even though it has no dispatch table (as it
  is implicitly filled by upsample_nearest1d.out).  I ended up
  hacking this up by just adding a new field 'abstract: True'
  that lets you manually override the abstractness.  Better would
  be to just teach the codegen to fill this correctly

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: afc9a62ca4dd197a140ea1c60ae4e4358415aaaa
Pull Request resolved: #45277
ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 16, 2020
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework.

The general structure of this diff:

- Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model
- NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model
- When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go.
- The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out.
- Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 16, 2020
Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework.

The general structure of this diff:

- Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model
- NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model
- When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go.
- The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out.
- Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D24253555](https://our.internmc.facebook.com/intern/diff/D24253555)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Nov 16, 2020
Implements structured kernels as per pytorch/rfcs#9
and ports upsample_nearest1d to use the framework.

There is a new meta api which is the calling convention for TensorMeta
calculation functions.  Most of the new codegen lives in
structured_func; check out the RFC for an explanation of what the code
looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

There's some hacks which I can work harder to unwind:

- I need to get upsample_nearest1d to be registered as abstract: True
  in Declarations.yaml even though it has no dispatch table (as it
  is implicitly filled by upsample_nearest1d.out).  I ended up
  hacking this up by just adding a new field 'abstract: True'
  that lets you manually override the abstractness.  Better would
  be to just teach the codegen to fill this correctly

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 84a84ab8548741d54f5c1003886e0319294f22aa
Pull Request resolved: #45277
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Nov 17, 2020
Summary:
Pull Request resolved: #45277

Implements structured kernels as per pytorch/rfcs#9 and ports upsample_nearest1d to use the framework.

The general structure of this diff:

- Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model
- NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model
- When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go.
- The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out.
- Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

This PR improves instruction counts on `upsample_nearest1d` because it eliminates an extra redispatch. Testing `at::upsample_nearest1d(x, {10});`

* Functional: before 1314105, after 1150705
* Out: before 915705, after 838405

These numbers may be jittered up to +-16400 (which is the difference when I tested against an unaffected operator `at::upsample_linear1d`), though that may also because unrelated changes affected all operators globally.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D24253555

Test Plan: Imported from OSS

Reviewed By: smessmer

Pulled By: ezyang

fbshipit-source-id: 4ef58dd911991060f13576864c8171f9cc614456
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…tion of add to framework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…tion of add to framework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…tion of add to framework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…tion of add to framework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…tion of add to framework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…tion of add to framework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit to pytorch/pytorch that referenced this pull request Dec 8, 2020
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Dec 9, 2020
…48718)

Summary:
Pull Request resolved: #48718

This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.

TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Fix `verify_api_visibility.cpp`
* Refactor tools/codegen/gen.py for clarity
* Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const`

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25278031

Pulled By: ezyang

fbshipit-source-id: 57c43a6e5df21929b68964d485995fbbae4d1f7b
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@ezyang
Copy link
Contributor Author

ezyang commented Jan 7, 2021

The *this is deeply weird until you know the context this code is being defined in... it makes me think we might want the macro names to avoid overselling that it's a freestanding function. It wouldn't need to literally say "you're looking at a method" but like TORCH_STRUCTURED_IMPL or something.

Fair enough. We can use your naming!

template metaprogramming machinery (to detect if arguments are out
tensors or not); however, because the implementations of structured
kernels are a layer below the operator registration layer, the
const modifier can be eliminated from the `TORCH_IMPL_FUNC` API
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean the const modifier can be added, not eliminated, right? Like, out arguments can now be typed as const Tensor& in TORCH_IMPL_FUNC declarations, as in

TORCH_IMPL_FUNC(upsample_nearest1d_out_cpu) (
    const Tensor& input,
    IntArrayRef output_size,
    c10::optional<double> scales,
    const Tensor& output             // currently just Tensor& output
) {
  upsample_nearest1d_kernel(kCPU, output, input, scales);
}

Aside: it happens that upsample_nearest1d_kernel itself takes Tensor& output, which might hint at additional porting work for at least some kernels to adopt this change.

your Tensor to a CPUTensor and then utilize the regular API.) One
possible argument for retaining the `at::cpu::` namespace is that these
functions are guaranteed to bypass dispatching, whereas other functions
may implicitly downcast to `Tensor` and do an optimized call.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upcast

a reference count bump).

One question is wheter or not the existence of CPUTensor means we should
eliminate the `at::cpu::` namespace (as they serve near equivalent purposes;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at::cpu seems much less risky around the edges than CPUTensor[1] - if they serve near equivalent purposes, what's an example that motivates adding CPUTensor either instead of or in addition to at::cpu?

[1] stemming from CPUTensor not reeeally being a subtype of Tensor. A stray call to something like set_storage_and_dtype would damage a CPUTensor that's been upcast to a Tensor (or provoke a runtime error, if we've put safety measures in place).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an ergonomics thing (but important ergonomics): with at::cpu, you have to remember to use the CPU only function (and you have to make sure not to accidentally call it with a non-CPU tensor). CPUTensor ensures that the CPU tensor is used, and you only need to prove you have CPU tensor when you initially construct the CPUTensor.

never become structured; to make an analogy, sometimes you have to
write assembly, and as long as it is not too frequent, there is not
too much to be gained from trying to extend the functionality of your
system to expunge these entirely.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A counterargument to consider here I think is that making structured kernels ubiquitous would allow other system capabilities to be defined in terms of them - e.g., shape analysis, memory planning.

The answer wouldn't be to torture everything into having in/func/out variants, but probably to add a handful of other little composite kernel system variations, each with its own codegen and triggering annotation in native_functions.yaml. Factory functions are an obvious fit for this kind of approach, but maybe they're the only good fit? Haven't thought it through any further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this certainly would work. What is less certain is if the number of operators that would be covered by the system variation warrant the variations. Well, at least we can delay designing this for later...

facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Jan 14, 2021
Summary:
See the structured kernel definition [RFC](pytorch/rfcs#9) for context.

Pull Request resolved: #50189

Reviewed By: mrshenli

Differential Revision: D25903846

Pulled By: soulitzer

fbshipit-source-id: 0059fda9b7d86f596ca35d830562dd4b859293a0
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
ezyang and others added 4 commits April 14, 2021 11:42
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Fix a typo and add syntax highlighting to structured kernel rfc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants