Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatch-less structured wrapper / composite / alias kernels #50953

Open
ezyang opened this issue Jan 22, 2021 · 22 comments
Open

Dispatch-less structured wrapper / composite / alias kernels #50953

ezyang opened this issue Jan 22, 2021 · 22 comments
Labels
module: internals Related to internal abstractions in c10 and ATen module: structured kernels Related to new structured kernels functionality triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ezyang
Copy link
Contributor

ezyang commented Jan 22, 2021

A common pattern in PyTorch is to have two implementations of a function which have different signatures:

- func: upsample_nearest1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
- func: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor

Typically, one of these functions is implemented in terms of the other:

Tensor upsample_nearest1d(
    const Tensor& input,
    c10::optional<IntArrayRef> output_size,
    c10::optional<ArrayRef<double>> scale_factors) {
  auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
  auto scale_w = get_scale_value(scale_factors, 0);
  return at::upsample_nearest1d(input, osize, scale_w);
}

Now, there is a very irritating problem with upsample_nearest1d as it is written here, which is that it necessitates two dispatches: once to the wrapper function (shown), and then once again when we call at::upsample_nearest1d. Alternately, we could write multiple copies of the wrapper function and bypass the second dispatch (using #49505):

Tensor upsample_nearest1d_cpu(
    const Tensor& input,
    c10::optional<IntArrayRef> output_size,
    c10::optional<ArrayRef<double>> scale_factors) {
  auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
  auto scale_w = get_scale_value(scale_factors, 0);
  return at::cpu::upsample_nearest1d(input, osize, scale_w);
}

But this is irritating, and in the worst case scenario needs to be done per backend (CPU, CUDA) and per variant (out, functional, inplace). Oof!

What you would like to do, instead, is describe how to transform the (functional) input arguments from the wrapper function to the real function, and then automatically generate all of the variants.

It's a little uncertain to me what the parameters of this transformation should be. The easiest way to implement the transformation is to insert C++ code directly into native_functions.yaml, and then generate the multiple copies directly based on this code.

- func: upsample_nearest1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
  - structured_wrapper: |
    auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
    auto scale_w = get_scale_value(scale_factors, 0);
    return upsample_nearest1d(input, osize, scale_w);

This would set a new precedent that it is OK to put C++ code inside native_functions.yaml. Maybe you do not like it, and would like the conversion code to live in C++. Unfortunately, I'm not too sure how to do this: recall that the class hierarchy looks like:

upsample_nearest1d
  +- structured_upsample_nearest1d_cpu
       +- structured_upsample_nearest1d_cpu_out
       +- structured_upsample_nearest1d_cpu_inplace
       +- structured_upsample_nearest1d_cpu_functional
  +- structured_upsample_nearest1d_cuda
       +- structured_upsample_nearest1d_cuda_out
       +- structured_upsample_nearest1d_cuda_inplace
       +- structured_upsample_nearest1d_cuda_functional

There is no logical place to interpose an adapter in the class hierarchy here.

cc @ezyang @bhosmer @smessmer @ljk53 @bdhirsh @ailzhang

@smessmer
Copy link
Contributor

getting it to C++ would be easy, but not sure if that's a better solution. You'd have to specify the interface they'd have to be using. An example could be to ask them to construct a struct that contains references to all the arguments and has a method for each target-argument:

struct UpsampleNearest1d {
  UpsampleNearest1d(const Tensor& input, c10::optional<IntArrayRef> output_size, c10::optional<ArrayRef<double>> scale_factors) {...}

  std::array<int64_t> output_size() const {
    return compute_output_size(input_.sizes(), output_size_, scale_factors_);
  }
  c10::optional<double> scales() const {
    return get_scale_value(scale_factors, 0);
  }
  
private:
  const Tensor& input_;
  c10::optional<IntArrayRef>& output_size_; // reference makes sure we don't copy it. Likely unnecessary for ArrayRef, but nice for other types
  c10::optional<ArrayRef<double>> scale_factors_;
};

@ezyang
Copy link
Contributor Author

ezyang commented Jan 22, 2021

@smessmer yeah, I'm trying hard not to put the arguments into the struct itself, we sort of know that compiler is bad at optimizing this case (you're forcing it to actually allocate and construct memory for the arguments)

@glaringlee glaringlee added module: internals Related to internal abstractions in c10 and ATen module: structured kernels Related to new structured kernels functionality triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 25, 2021
@ezyang ezyang changed the title Dispatch-less structured wrapper kernels Dispatch-less structured wrapper / alias kernels Mar 31, 2021
@ezyang
Copy link
Contributor Author

ezyang commented Mar 31, 2021

A simpler case of this is alias kernels, which don't require any C++ code

@ezyang
Copy link
Contributor Author

ezyang commented Apr 2, 2021

Another interesting situation is pow_Scalar (maybe not truly related to this issue though):

  if (base.isComplex() && base.toComplexDouble() == 1.0) {
    out.fill_(1);
  } else if (!base.isComplex() && base.toDouble() == 1.0) {
    out.fill_(1);
  } else {
    at::pow_out(const_cast<Tensor&>(out), c10::scalar_to_tensor(base, exp.device()), exp); // redispatch!
  }

@ezyang ezyang changed the title Dispatch-less structured wrapper / alias kernels Dispatch-less structured wrapper / composite / alias kernels May 18, 2021
This was referenced May 18, 2021
@ysiraichi
Copy link
Collaborator

I was thinking about this issue, and wondered why not, instead of device namespaces, make them into structs? Then, we could easily replace dispatches with automatic code generation by templates. For example, consider the following kernel:

Correct me if I'm missing anything, but the problem is the redispatch in the last line: at::upsample_nearest1d(input, osize, scale_w). Now, consider that we have structs, instead of namespaces, for both cpu and cuda (and meta), like so:

struct cpu {
    static Tensor upsample_nearest1d(const Tensor& self, c10::optional<IntArrayRef> output_size, c10::optional<ArrayRef<double>> scale_factors);
};

struct cuda {
    static Tensor upsample_nearest1d(const Tensor& self, c10::optional<IntArrayRef> output_size, c10::optional<ArrayRef<double>> scale_factors);
};

Then, templates could help us solve this problem by making upsample_nearest1d a template function:

template <class DEV>
Tensor upsample_nearest1d(const Tensor& self, c10::optional<IntArrayRef> output_size, c10::optional<ArrayRef<double>> scale_factors) {
  auto osize = compute_output_size(self.sizes(), output_size, scale_factors);
  auto scale_w = get_scale_value(scale_factors, 0);
  return DEV::upsample_nearest1d(self, osize, scale_w);
}

As a plus, that would preserve the way of explicitly calling kernels at::cuda::upsample_nearest1d. That said, I understand that it would increase compilation time (not sure to what extent, though).

What do you think?

@ezyang
Copy link
Contributor Author

ezyang commented Jul 19, 2021

This might be a case of "perfect being the enemy of good", but I'm kind of not that keen on a template based approach because (1) templates suck (e.g., you get no typechecking until the template gets instantiated) and (2) there are a bunch of auxiliary issues (such as overloading the meaning of integers, running this code in the Python interpreter, handling out and functional simultaneously) that can't be easily solved with templates. I agree that a template style approach would be relatively easy to implement and would solve the immediate problem.

@ezyang
Copy link
Contributor Author

ezyang commented Jul 21, 2021

It's worth elaborating on the auxiliary issues. A good start is looking at this post https://dev-discuss.pytorch.org/t/where-we-are-headed-and-why-it-looks-a-lot-like-julia-but-not-exactly-like-julia/276 which lays out at a high level what some of the challenges we're facing are.

If my constraint was ONLY that I wanted to avoid extra dispatch, I think templates would probably be the right way to go (indeed, it's basically the only way to do it). But I also (eventually) want to be able to trace through the code in this setting symbolically, overloading the meaning of all types (not just Tensor, which is the only type we can do in C++). If I write int64_t x in C++, I cannot replace x with a symbolic variable and do abstract interpretation on it. I could template over all of the internal types, but you can see the user experience rapidly getting worse and worse.

What I kind of want, but haven't convinced myself is the right thing to do yet, is build some sort of mini-DSL (Python-like, ofc) for writing composite kernels which can compile to C++, but can also be directly run by the Python interpreter. I don't exactly know how it should work in the terminal state, but I know that at least for small examples it should be feasible to do.

@ezyang
Copy link
Contributor Author

ezyang commented Feb 24, 2022

Based on discussions with @mruberry, we're nuking the Python mini-DSL, so I think templates are the way to go now.

@ezyang
Copy link
Contributor Author

ezyang commented Feb 25, 2022

OK, so here's a proposal for how to do this. It's actually a pair of proposals, one that it is simple but a bit boilerplatey, and another that takes more advantage of structured kernels.

Non-structured composites. We'll start off with @ysiraichi's proposed template syntax for writing these composites. These will go in headers like aten/src/ATen/native/composite/sub.h. Because these are "just" composites, you'll have to write a separate implementation per overload; e.g., one per sub/sub_out/sub_. We'll just look at sub functional for this example:

// I renamed DEV to OPS in analogy to torch.ops
template <class OPS>
Tensor sub(const Tensor& self, const Tensor& other, Scalar alpha) {
  return OPS::add(self, other, -alpha);

Yukio originally suggested that we put the operator namespaces into structs, so we can pass them directly. I don't want to do this, because @peterbell10 has been doing a lot of good work in reducing the amount of recompilation we have to do when unrelated function signatures change, and passing a record with ALL of the operators would go against this goal. But remember, we're in codegen world: so we can just generate a struct on the fly of exactly the functions we need, and trust in inliner to remove the indirections.

This implies we must explicitly list what operators the composite depend on in native_functions.yaml:

# native_functions.yaml syntax

- func: sub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
  composite: sub (add.Tensor)

This will give us the following structs (in separate files) which we can now instantiate the template with:

struct sub_ops_cpu {
  static Tensor add(const Tensor& self, const Tensor& other, Scalar alpha) { return at::cpu::add(self, other, alpha); }
};

struct sub_ops_cuda {
  static Tensor add(const Tensor& self, const Tensor& other, Scalar alpha) { return at::cuda::add(self, other, alpha); }
};

struct sub_ops_meta {
  static Tensor add(const Tensor& self, const Tensor& other, Scalar alpha) { return at::meta::add(self, other, alpha); }
};

struct sub_ops_generic {
  static Tensor add(const Tensor& self, const Tensor& other, Scalar alpha) { return at::add(self, other, alpha); }
}

Each instantiation gets registered as the CPU/CUDA/Meta/CompositeImplicitAutograd keys respectively. The operators are generated using CppSignatureGroup so they exactly match the corresponding at::cpu/at::cuda/at:: APIs. That's it.

@ysiraichi are you interested in implementing any of this?

(structured in next comment)

@ezyang
Copy link
Contributor Author

ezyang commented Feb 25, 2022

Structured composites. A big downside of the formulation above is you have to write functional/inplace/out composites. So I was wondering what it might look like to mash up this feature with structured kernels.

Here are some things that aren't quite right: define only the functional composite or the out composite, and try to derive the other from it.

  • The out composite does not work as it will internally call an out function, which will break autograd support if you need CompositeImplicitAutograd to work (this is less of a problem if you have an explicit derivatives.yaml entry).
  • The functional composite is suboptimal: a typical conversion to the out form would be to run the functional version, and then copy_ the result to the final result tensor. This will use one more tensor than an optimal out implementation (which would use out in the tail position).

One idea is to pass in separate structs for the non-tail and tail positions. Sub would look something like this:

template <class OPS, class OUT> // OPS is unused as we don't have any non-tail calls
void sub(OUT& out, const Tensor& self, const Tensor& other, Scalar alpha) {
  out.add(self, other, -alpha);
}

OUT is no longer a struct of static methods; it is an actual object which we will use to pass out the return result if we are functional. Now we have two variants of OUT for functional and out:

struct sub_functional {
  Tensor result_;
  sub_functional() : result_() {}
  void add(const Tensor& self, const Tensor& other, Scalar alpha) {
    result_ = at::add(self, other, alpha);
  }
};

struct sub_inplace {
  sub_out_out() {}
  void add(const Tensor& self, const Tensor& other, Scalar alpha) {
    at::add_(const_cast<Tensor&>(self), other, alpha);
  }
};

struct sub_out {
  Tensor& out_;
  sub_out_out(const Tensor& out) : out_(out) {}
  void add(const Tensor& self, const Tensor& other, Scalar alpha) {
    at::add_out(self, other, alpha, out_);
  }
};

The generated inplace may not be optimal; if there are multiple ops involved, it may have been optimal to do multiple inplaces, but we don't consider this for now.

One downside to this proposal is that resulting kernels are not "really" structured kernels; e.g., if you want to write a traditional structured kernel, you need to still write a TORCH_META_FUNC, we cannot derive it from the composite (why not? Well we could write a basic one, but meta funcs can also compute intermediate values and setup auxiliary structs like TensorIterator, and it's not clear if you had wanted those things for your traditional structured kernel).

Another possibility for structured kernels is to make it possible to directly call TORCH_IMPL_FUNCs. This can be "simulated" with DispatchStub, which is how I implemented sub in terms of add in #65851

TORCH_IMPL_FUNC(sub_out) (
  const Tensor& self, const Tensor& other, const Scalar& alpha, const Tensor& result
) {
  add_stub(device_type(), *this, -alpha);
  TORCH_INTERNAL_ASSERT(result.scalar_type() == output().dtype());
}

Instead of requiring a stub, we could instead let a structured kernel inherit from the other structured kernel that it wants to invoke:

TORCH_IMPL_FUNC(sub_out) (
  const Tensor& self, const Tensor& other, const Scalar& alpha, const Tensor& result
) {
  TORCH_IMPL_FUNC_NAME(add_out)(self, other, -alpha, result);
}

requiring you to be intimately familiar with how the target structured kernel is written, but maybe that is not too much to ask. This formulation works bets if you only want to call one function, as if you want to call multiple (a true composite) we would need to once again template this function over OPS to let you directly call other implementations without routing through the dispatcher.

@bdhirsh
Copy link
Contributor

bdhirsh commented Feb 25, 2022

This formulation works bets if you only want to call one function, as if you want to call multiple (a true composite) we would need to once again template this function over OPS to let you directly call other implementations without routing through the dispatcher.

It seems useful to try to figure out the "multiple ops decomposition" case (if it's not too hard), since there are probably a lot more aten ops that fit that pattern.

This implies we must explicitly list what operators the composite depend on in native_functions.yaml

I guess this would generalize to multiple ops pretty easily (you tell the yaml up front what all of the ops are in the decomposition, and we fill the struct with all of their dispatcher-less implementations).

That could be kind of cumbersome, but one really nice benefit is that it would start to give us a real source of truth for aten decompositions. If we ended up doing that, it would be significantly easier to tell people exactly what "the set of primitive aten ops that your backend/functionality hasn't implemented yet" is for a given model.

@ysiraichi
Copy link
Collaborator

are you interested in implementing any of this?

Definitely!
I will go over the proposals next week.

bdhirsh added a commit that referenced this issue Jun 16, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

ghstack-source-id: 1bec053b9c4af9c568a8ec4feb42fb8ac10ed925
Pull Request resolved: #77484
bdhirsh added a commit that referenced this issue Jun 16, 2022
…osite kernels."

Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Jun 16, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Jun 16, 2022
…osite kernels."

Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Jun 16, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Jun 16, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

ghstack-source-id: 1269877693a62cadadee6812578f404d5f39fc8c
Pull Request resolved: #77484
bdhirsh added a commit that referenced this issue Jul 28, 2022
…osite kernels."

Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Jul 28, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Jul 28, 2022
…osite kernels."

Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Jul 28, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Jul 28, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

ghstack-source-id: e9f581aa83f8be538cbf42ff1b423ea96ab3ff45
Pull Request resolved: #77484
bdhirsh added a commit that referenced this issue Aug 10, 2022
…osite kernels."

Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Aug 10, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Aug 10, 2022
…osite kernels."

Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Aug 10, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Aug 10, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

ghstack-source-id: b37bf82b8bfaad093017bfdf4a848375871bb84c
Pull Request resolved: #77484
pytorchmergebot pushed a commit that referenced this issue Aug 11, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)
Pull Request resolved: #77484
Approved by: https://github.com/bdhirsh
bdhirsh added a commit that referenced this issue Aug 11, 2022
…osite kernels."

Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Aug 11, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Aug 11, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

ghstack-source-id: 5b8617455c8743bbd1c7035eb4c9b7d21032196f
Pull Request resolved: #77484
@ysiraichi
Copy link
Collaborator

ysiraichi commented Aug 15, 2022

@bdhirsh @ezyang
After some discussion with @lezcano and @peterbell10 (correct me if I got anything wrong), we found another way to implement dispatch-less kernels without needing to specify the dependent operations of a given dispatch-less kernel.

  1. Instead of creating one structure per dispatch-less kernel, create one templated function for each operation. Such a templated function should have 1 template parameter of a device enum type

  2. Dispatch-less calls are generated through template specialization

namespace dispatchless {

template <DeviceEnum DEV> // (1)
Tensor add(Tensor self, Tensor other, Scalar alpha) {
  return at::add(self, other, alpha); // fallback is the dispatcher call.
}

// (2)
template <>
Tensor add<DeviceEnum::CPU>(Tensor self, Tensor other, Scalar alpha) {
  return at::cpu::add(self, other, alpha); // dispatch-less call to the CPU kernel.
}

template <>
Tensor add<DeviceEnum::Meta>(Tensor self, Tensor other, Scalar alpha) {
  return at::meta::add(self, other, alpha); // dispatch-less call to the Meta kernel.
}

}
  1. Composite kernels should, similarly, be templated functions, so that they are able to pick the right implementation for a device

  2. Composite kernels may use these generated specializations arbitrarily, without the need to specify anything extra in native_functions.yaml

  3. If this composite kernel implementation lives in a cpp file, we also have to explicitly instantiate the templates for the different devices (macro would make things easier)

namespace native {

// (3)
template <DeviceEnum DEV>
Tensor add_one(Tensor self) {
  return at::dispatchless::add<DEV>(self, at::dispatchless::ones<DEV>(self.sizes(), self.options())); // (4)
}

// (5)
template Tensor add_one<DeviceEnum::CPU>(Tensor self);
template Tensor add_one<DeviceEnum::CUDA>(Tensor self);
template Tensor add_one<DeviceEnum::Meta>(Tensor self);

}
  1. This could be used for other things:
    6.1. Making other intermediary functions dispatch-less
    6.2. Specializing functions that depend on the device

Let me know what you think.
Is this approach something we are interested on?

Edit: we would also need to tweak codegen a bit for accepting templated functions as native kernels.

bdhirsh added a commit that referenced this issue Aug 15, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

ghstack-source-id: 5e124dc1fe39985331b70c9f5d3d032a29d9f1a1
Pull Request resolved: #77484
bdhirsh added a commit that referenced this issue Aug 15, 2022
…osite kernels."

Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
bdhirsh added a commit that referenced this issue Aug 15, 2022
Fix #50953

This PR introduces changes in the codegen for generating dispatch-less composite
kernels. Summarizing, the idea is to make use of templates as the namespace source for
tensor operations. Then, we only have to generate a struct (the namespace) with the
required operations.

Here's a summary with the main changes in this PR:

- `RegisterDispatchKey.cpp`
    - Add `composite_headers` as code template variable
- `model.py`
    - Add `composite` as a key in `native_functions.yaml`
        - Set of `OperatorName`, indicating the dependent operations
    - Add `CompositeGraph` type alias
    - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated
      dispatch-less composite kernels
    - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of
      `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key
        - Shortcut for dealing with both structured and unstructured kernels
    - Add methods for generating the struct and kernel name of the generated dispatch-less
      composite kernel
        - Needed for returning the `BackendMetada`
- `gen.py`
    - Build the composite graph with `get_composite_graph`
        - Representation of dependency for dispatch-less composite kernels
        - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel)
          of `NativeFunction` and `NativeFunctionGroup`
    - Collect a set of `#include <ATen/native/composite/op.h>` headers with
      `get_composite_headers`
- `register_dispatch_key.py`
    - Add `composite_graph` as a field of `RegisterDispatchKey` class
    - Generate the struct for that dispatch key
- `native_functions.py`
    - Skip the generation of `op_native.h` header for dispatch-less composite kernels
    - They are already defined in their respective `ATen/native/composite/op.h` header

Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643)

[ghstack-poisoned]
@bdhirsh
Copy link
Contributor

bdhirsh commented Aug 15, 2022

Side note - sorry for the delay in landing the existing version of the PR. The internal failures should be finally cleaned up.

On the suggestion: not having to manually specify what ops are in the decomposition seems better, especially since it doesn't look like native_functions.yaml would be the source of truth for decomp info in the long run (lots of or decomps are getting written in python).

The extra template boilerplate seems minimal (especially since we can macro-ify the template instantiations like you mentioned), so I agree this feels net better. Curious what Ed thinks though

@ezyang
Copy link
Contributor Author

ezyang commented Aug 15, 2022

If we're pivoting off of the original plan, I'd like to quash this issue and associated PRs entirely. @ysiraichi, as I've discussed with you, incremental improvements on the C++ implementations is really not aligned with the current direction (which is PrimTorch implementations in Python, with an overhead reduction backend). I would rather we spend our time and effort on the overhead reduction backend for dynamo instead.

@ysiraichi
Copy link
Collaborator

@bdhirsh

...not having to manually specify what ops are in the decomposition seems better, especially since it doesn't look like native_functions.yaml would be the source of truth for decomp info in the long run

Yes! I would argue that those decompositions serve a different purpose + are not to be used by anyone but the codegen (i.e. noone should look at it)


@ezyang

incremental improvements on the C++ implementations is really not aligned with the current direction

Got it. That idea came up in my dispatch-less kernels presentation. I just thought it was an interesting way to avoid declaring dependent operation in native_functions.yaml.


I agree with Ed in that we should "spend our time and effort on the overhead reduction backend for dynamo instead". So, I would say we could leave this idea as an improvement for dispatch-less kernels if later needed. i.e. I will keep focused on the overhead reduction backend for dynamo and, if necessary, we can go back to this afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: internals Related to internal abstractions in c10 and ATen module: structured kernels Related to new structured kernels functionality triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants