Dispatch-less structured wrapper / composite / alias kernels #50953

ezyang · 2021-01-22T18:48:34Z

A common pattern in PyTorch is to have two implementations of a function which have different signatures:

- func: upsample_nearest1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
- func: upsample_nearest1d(Tensor self, int[1] output_size, float? scales=None) -> Tensor

Typically, one of these functions is implemented in terms of the other:

Tensor upsample_nearest1d(
    const Tensor& input,
    c10::optional<IntArrayRef> output_size,
    c10::optional<ArrayRef<double>> scale_factors) {
  auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
  auto scale_w = get_scale_value(scale_factors, 0);
  return at::upsample_nearest1d(input, osize, scale_w);
}

Now, there is a very irritating problem with upsample_nearest1d as it is written here, which is that it necessitates two dispatches: once to the wrapper function (shown), and then once again when we call at::upsample_nearest1d. Alternately, we could write multiple copies of the wrapper function and bypass the second dispatch (using #49505):

Tensor upsample_nearest1d_cpu(
    const Tensor& input,
    c10::optional<IntArrayRef> output_size,
    c10::optional<ArrayRef<double>> scale_factors) {
  auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
  auto scale_w = get_scale_value(scale_factors, 0);
  return at::cpu::upsample_nearest1d(input, osize, scale_w);
}

But this is irritating, and in the worst case scenario needs to be done per backend (CPU, CUDA) and per variant (out, functional, inplace). Oof!

What you would like to do, instead, is describe how to transform the (functional) input arguments from the wrapper function to the real function, and then automatically generate all of the variants.

It's a little uncertain to me what the parameters of this transformation should be. The easiest way to implement the transformation is to insert C++ code directly into native_functions.yaml, and then generate the multiple copies directly based on this code.

- func: upsample_nearest1d.vec(Tensor input, int[]? output_size, float[]? scale_factors) -> Tensor
  - structured_wrapper: |
    auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
    auto scale_w = get_scale_value(scale_factors, 0);
    return upsample_nearest1d(input, osize, scale_w);

This would set a new precedent that it is OK to put C++ code inside native_functions.yaml. Maybe you do not like it, and would like the conversion code to live in C++. Unfortunately, I'm not too sure how to do this: recall that the class hierarchy looks like:

upsample_nearest1d
  +- structured_upsample_nearest1d_cpu
       +- structured_upsample_nearest1d_cpu_out
       +- structured_upsample_nearest1d_cpu_inplace
       +- structured_upsample_nearest1d_cpu_functional
  +- structured_upsample_nearest1d_cuda
       +- structured_upsample_nearest1d_cuda_out
       +- structured_upsample_nearest1d_cuda_inplace
       +- structured_upsample_nearest1d_cuda_functional

There is no logical place to interpose an adapter in the class hierarchy here.

cc @ezyang @bhosmer @smessmer @ljk53 @bdhirsh @ailzhang

The text was updated successfully, but these errors were encountered:

smessmer · 2021-01-22T20:00:29Z

getting it to C++ would be easy, but not sure if that's a better solution. You'd have to specify the interface they'd have to be using. An example could be to ask them to construct a struct that contains references to all the arguments and has a method for each target-argument:

struct UpsampleNearest1d {
  UpsampleNearest1d(const Tensor& input, c10::optional<IntArrayRef> output_size, c10::optional<ArrayRef<double>> scale_factors) {...}

  std::array<int64_t> output_size() const {
    return compute_output_size(input_.sizes(), output_size_, scale_factors_);
  }
  c10::optional<double> scales() const {
    return get_scale_value(scale_factors, 0);
  }
  
private:
  const Tensor& input_;
  c10::optional<IntArrayRef>& output_size_; // reference makes sure we don't copy it. Likely unnecessary for ArrayRef, but nice for other types
  c10::optional<ArrayRef<double>> scale_factors_;
};

ezyang · 2021-01-22T21:53:36Z

@smessmer yeah, I'm trying hard not to put the arguments into the struct itself, we sort of know that compiler is bad at optimizing this case (you're forcing it to actually allocate and construct memory for the arguments)

ezyang · 2021-03-31T14:31:23Z

A simpler case of this is alias kernels, which don't require any C++ code

ezyang · 2021-04-02T15:09:56Z

Another interesting situation is pow_Scalar (maybe not truly related to this issue though):

  if (base.isComplex() && base.toComplexDouble() == 1.0) {
    out.fill_(1);
  } else if (!base.isComplex() && base.toDouble() == 1.0) {
    out.fill_(1);
  } else {
    at::pow_out(const_cast<Tensor&>(out), c10::scalar_to_tensor(base, exp.device()), exp); // redispatch!
  }

ysiraichi · 2021-07-13T09:07:13Z

I was thinking about this issue, and wondered why not, instead of device namespaces, make them into structs? Then, we could easily replace dispatches with automatic code generation by templates. For example, consider the following kernel:

Correct me if I'm missing anything, but the problem is the redispatch in the last line: at::upsample_nearest1d(input, osize, scale_w). Now, consider that we have structs, instead of namespaces, for both cpu and cuda (and meta), like so:

struct cpu {
    static Tensor upsample_nearest1d(const Tensor& self, c10::optional<IntArrayRef> output_size, c10::optional<ArrayRef<double>> scale_factors);
};

struct cuda {
    static Tensor upsample_nearest1d(const Tensor& self, c10::optional<IntArrayRef> output_size, c10::optional<ArrayRef<double>> scale_factors);
};

Then, templates could help us solve this problem by making upsample_nearest1d a template function:

template <class DEV>
Tensor upsample_nearest1d(const Tensor& self, c10::optional<IntArrayRef> output_size, c10::optional<ArrayRef<double>> scale_factors) {
  auto osize = compute_output_size(self.sizes(), output_size, scale_factors);
  auto scale_w = get_scale_value(scale_factors, 0);
  return DEV::upsample_nearest1d(self, osize, scale_w);
}

As a plus, that would preserve the way of explicitly calling kernels at::cuda::upsample_nearest1d. That said, I understand that it would increase compilation time (not sure to what extent, though).

What do you think?

ezyang · 2021-07-19T03:34:49Z

This might be a case of "perfect being the enemy of good", but I'm kind of not that keen on a template based approach because (1) templates suck (e.g., you get no typechecking until the template gets instantiated) and (2) there are a bunch of auxiliary issues (such as overloading the meaning of integers, running this code in the Python interpreter, handling out and functional simultaneously) that can't be easily solved with templates. I agree that a template style approach would be relatively easy to implement and would solve the immediate problem.

ezyang · 2021-07-21T15:42:41Z

It's worth elaborating on the auxiliary issues. A good start is looking at this post https://dev-discuss.pytorch.org/t/where-we-are-headed-and-why-it-looks-a-lot-like-julia-but-not-exactly-like-julia/276 which lays out at a high level what some of the challenges we're facing are.

If my constraint was ONLY that I wanted to avoid extra dispatch, I think templates would probably be the right way to go (indeed, it's basically the only way to do it). But I also (eventually) want to be able to trace through the code in this setting symbolically, overloading the meaning of all types (not just Tensor, which is the only type we can do in C++). If I write int64_t x in C++, I cannot replace x with a symbolic variable and do abstract interpretation on it. I could template over all of the internal types, but you can see the user experience rapidly getting worse and worse.

What I kind of want, but haven't convinced myself is the right thing to do yet, is build some sort of mini-DSL (Python-like, ofc) for writing composite kernels which can compile to C++, but can also be directly run by the Python interpreter. I don't exactly know how it should work in the terminal state, but I know that at least for small examples it should be feasible to do.

ezyang · 2022-02-24T16:19:07Z

Based on discussions with @mruberry, we're nuking the Python mini-DSL, so I think templates are the way to go now.

ezyang · 2022-02-25T16:07:09Z

OK, so here's a proposal for how to do this. It's actually a pair of proposals, one that it is simple but a bit boilerplatey, and another that takes more advantage of structured kernels.

Non-structured composites. We'll start off with @ysiraichi's proposed template syntax for writing these composites. These will go in headers like aten/src/ATen/native/composite/sub.h. Because these are "just" composites, you'll have to write a separate implementation per overload; e.g., one per sub/sub_out/sub_. We'll just look at sub functional for this example:

// I renamed DEV to OPS in analogy to torch.ops
template <class OPS>
Tensor sub(const Tensor& self, const Tensor& other, Scalar alpha) {
  return OPS::add(self, other, -alpha);

Yukio originally suggested that we put the operator namespaces into structs, so we can pass them directly. I don't want to do this, because @peterbell10 has been doing a lot of good work in reducing the amount of recompilation we have to do when unrelated function signatures change, and passing a record with ALL of the operators would go against this goal. But remember, we're in codegen world: so we can just generate a struct on the fly of exactly the functions we need, and trust in inliner to remove the indirections.

This implies we must explicitly list what operators the composite depend on in native_functions.yaml:

# native_functions.yaml syntax

- func: sub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
  composite: sub (add.Tensor)

This will give us the following structs (in separate files) which we can now instantiate the template with:

struct sub_ops_cpu {
  static Tensor add(const Tensor& self, const Tensor& other, Scalar alpha) { return at::cpu::add(self, other, alpha); }
};

struct sub_ops_cuda {
  static Tensor add(const Tensor& self, const Tensor& other, Scalar alpha) { return at::cuda::add(self, other, alpha); }
};

struct sub_ops_meta {
  static Tensor add(const Tensor& self, const Tensor& other, Scalar alpha) { return at::meta::add(self, other, alpha); }
};

struct sub_ops_generic {
  static Tensor add(const Tensor& self, const Tensor& other, Scalar alpha) { return at::add(self, other, alpha); }
}

Each instantiation gets registered as the CPU/CUDA/Meta/CompositeImplicitAutograd keys respectively. The operators are generated using CppSignatureGroup so they exactly match the corresponding at::cpu/at::cuda/at:: APIs. That's it.

@ysiraichi are you interested in implementing any of this?

(structured in next comment)

ezyang · 2022-02-25T16:26:12Z

Structured composites. A big downside of the formulation above is you have to write functional/inplace/out composites. So I was wondering what it might look like to mash up this feature with structured kernels.

Here are some things that aren't quite right: define only the functional composite or the out composite, and try to derive the other from it.

The out composite does not work as it will internally call an out function, which will break autograd support if you need CompositeImplicitAutograd to work (this is less of a problem if you have an explicit derivatives.yaml entry).
The functional composite is suboptimal: a typical conversion to the out form would be to run the functional version, and then copy_ the result to the final result tensor. This will use one more tensor than an optimal out implementation (which would use out in the tail position).

One idea is to pass in separate structs for the non-tail and tail positions. Sub would look something like this:

template <class OPS, class OUT> // OPS is unused as we don't have any non-tail calls
void sub(OUT& out, const Tensor& self, const Tensor& other, Scalar alpha) {
  out.add(self, other, -alpha);
}

OUT is no longer a struct of static methods; it is an actual object which we will use to pass out the return result if we are functional. Now we have two variants of OUT for functional and out:

struct sub_functional {
  Tensor result_;
  sub_functional() : result_() {}
  void add(const Tensor& self, const Tensor& other, Scalar alpha) {
    result_ = at::add(self, other, alpha);
  }
};

struct sub_inplace {
  sub_out_out() {}
  void add(const Tensor& self, const Tensor& other, Scalar alpha) {
    at::add_(const_cast<Tensor&>(self), other, alpha);
  }
};

struct sub_out {
  Tensor& out_;
  sub_out_out(const Tensor& out) : out_(out) {}
  void add(const Tensor& self, const Tensor& other, Scalar alpha) {
    at::add_out(self, other, alpha, out_);
  }
};

The generated inplace may not be optimal; if there are multiple ops involved, it may have been optimal to do multiple inplaces, but we don't consider this for now.

One downside to this proposal is that resulting kernels are not "really" structured kernels; e.g., if you want to write a traditional structured kernel, you need to still write a TORCH_META_FUNC, we cannot derive it from the composite (why not? Well we could write a basic one, but meta funcs can also compute intermediate values and setup auxiliary structs like TensorIterator, and it's not clear if you had wanted those things for your traditional structured kernel).

Another possibility for structured kernels is to make it possible to directly call TORCH_IMPL_FUNCs. This can be "simulated" with DispatchStub, which is how I implemented sub in terms of add in #65851

TORCH_IMPL_FUNC(sub_out) (
  const Tensor& self, const Tensor& other, const Scalar& alpha, const Tensor& result
) {
  add_stub(device_type(), *this, -alpha);
  TORCH_INTERNAL_ASSERT(result.scalar_type() == output().dtype());
}

Instead of requiring a stub, we could instead let a structured kernel inherit from the other structured kernel that it wants to invoke:

TORCH_IMPL_FUNC(sub_out) (
  const Tensor& self, const Tensor& other, const Scalar& alpha, const Tensor& result
) {
  TORCH_IMPL_FUNC_NAME(add_out)(self, other, -alpha, result);
}

requiring you to be intimately familiar with how the target structured kernel is written, but maybe that is not too much to ask. This formulation works bets if you only want to call one function, as if you want to call multiple (a true composite) we would need to once again template this function over OPS to let you directly call other implementations without routing through the dispatcher.

bdhirsh · 2022-02-25T21:13:24Z

This formulation works bets if you only want to call one function, as if you want to call multiple (a true composite) we would need to once again template this function over OPS to let you directly call other implementations without routing through the dispatcher.

It seems useful to try to figure out the "multiple ops decomposition" case (if it's not too hard), since there are probably a lot more aten ops that fit that pattern.

This implies we must explicitly list what operators the composite depend on in native_functions.yaml

I guess this would generalize to multiple ops pretty easily (you tell the yaml up front what all of the ops are in the decomposition, and we fill the struct with all of their dispatcher-less implementations).

That could be kind of cumbersome, but one really nice benefit is that it would start to give us a real source of truth for aten decompositions. If we ended up doing that, it would be significantly easier to tell people exactly what "the set of primitive aten ops that your backend/functionality hasn't implemented yet" is for a given model.

ysiraichi · 2022-02-26T12:08:18Z

are you interested in implementing any of this?

Definitely!
I will go over the proposals next week.

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header ghstack-source-id: 1bec053b9c4af9c568a8ec4feb42fb8ac10ed925 Pull Request resolved: #77484

…osite kernels." Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

…osite kernels." Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header ghstack-source-id: 1269877693a62cadadee6812578f404d5f39fc8c Pull Request resolved: #77484

…osite kernels." Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

…osite kernels." Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header ghstack-source-id: e9f581aa83f8be538cbf42ff1b423ea96ab3ff45 Pull Request resolved: #77484

…osite kernels." Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

…osite kernels." Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header ghstack-source-id: b37bf82b8bfaad093017bfdf4a848375871bb84c Pull Request resolved: #77484

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) Pull Request resolved: #77484 Approved by: https://github.com/bdhirsh

…osite kernels." Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header ghstack-source-id: 5b8617455c8743bbd1c7035eb4c9b7d21032196f Pull Request resolved: #77484

ysiraichi · 2022-08-15T12:59:33Z

@bdhirsh @ezyang
After some discussion with @lezcano and @peterbell10 (correct me if I got anything wrong), we found another way to implement dispatch-less kernels without needing to specify the dependent operations of a given dispatch-less kernel.

Instead of creating one structure per dispatch-less kernel, create one templated function for each operation. Such a templated function should have 1 template parameter of a device enum type
Dispatch-less calls are generated through template specialization

namespace dispatchless {

template <DeviceEnum DEV> // (1)
Tensor add(Tensor self, Tensor other, Scalar alpha) {
  return at::add(self, other, alpha); // fallback is the dispatcher call.
}

// (2)
template <>
Tensor add<DeviceEnum::CPU>(Tensor self, Tensor other, Scalar alpha) {
  return at::cpu::add(self, other, alpha); // dispatch-less call to the CPU kernel.
}

template <>
Tensor add<DeviceEnum::Meta>(Tensor self, Tensor other, Scalar alpha) {
  return at::meta::add(self, other, alpha); // dispatch-less call to the Meta kernel.
}

}

Composite kernels should, similarly, be templated functions, so that they are able to pick the right implementation for a device
Composite kernels may use these generated specializations arbitrarily, without the need to specify anything extra in native_functions.yaml
If this composite kernel implementation lives in a cpp file, we also have to explicitly instantiate the templates for the different devices (macro would make things easier)

namespace native {

// (3)
template <DeviceEnum DEV>
Tensor add_one(Tensor self) {
  return at::dispatchless::add<DEV>(self, at::dispatchless::ones<DEV>(self.sizes(), self.options())); // (4)
}

// (5)
template Tensor add_one<DeviceEnum::CPU>(Tensor self);
template Tensor add_one<DeviceEnum::CUDA>(Tensor self);
template Tensor add_one<DeviceEnum::Meta>(Tensor self);

}

This could be used for other things:
6.1. Making other intermediary functions dispatch-less
6.2. Specializing functions that depend on the device

Let me know what you think.
Is this approach something we are interested on?

Edit: we would also need to tweak codegen a bit for accepting templated functions as native kernels.

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header ghstack-source-id: 5e124dc1fe39985331b70c9f5d3d032a29d9f1a1 Pull Request resolved: #77484

…osite kernels." Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

Fix #50953 This PR introduces changes in the codegen for generating dispatch-less composite kernels. Summarizing, the idea is to make use of templates as the namespace source for tensor operations. Then, we only have to generate a struct (the namespace) with the required operations. Here's a summary with the main changes in this PR: - `RegisterDispatchKey.cpp` - Add `composite_headers` as code template variable - `model.py` - Add `composite` as a key in `native_functions.yaml` - Set of `OperatorName`, indicating the dependent operations - Add `CompositeGraph` type alias - Make `BackendIndex.get_kernel` return a new `BackendMetadata` for generated dispatch-less composite kernels - Add `BackendIndex.has_registered_kernel` to check whether a given tuple of `NativeFunction` and `NativeFunctionsGroup` is registered to the given dispatch key - Shortcut for dealing with both structured and unstructured kernels - Add methods for generating the struct and kernel name of the generated dispatch-less composite kernel - Needed for returning the `BackendMetada` - `gen.py` - Build the composite graph with `get_composite_graph` - Representation of dependency for dispatch-less composite kernels - Mapping of `OperatorName` with a list of tuples (one for each dependent kernel) of `NativeFunction` and `NativeFunctionGroup` - Collect a set of `#include <ATen/native/composite/op.h>` headers with `get_composite_headers` - `register_dispatch_key.py` - Add `composite_graph` as a field of `RegisterDispatchKey` class - Generate the struct for that dispatch key - `native_functions.py` - Skip the generation of `op_native.h` header for dispatch-less composite kernels - They are already defined in their respective `ATen/native/composite/op.h` header Differential Revision: [D36934643](https://our.internmc.facebook.com/intern/diff/D36934643) [ghstack-poisoned]

bdhirsh · 2022-08-15T13:34:05Z

Side note - sorry for the delay in landing the existing version of the PR. The internal failures should be finally cleaned up.

On the suggestion: not having to manually specify what ops are in the decomposition seems better, especially since it doesn't look like native_functions.yaml would be the source of truth for decomp info in the long run (lots of or decomps are getting written in python).

The extra template boilerplate seems minimal (especially since we can macro-ify the template instantiations like you mentioned), so I agree this feels net better. Curious what Ed thinks though

ezyang · 2022-08-15T14:33:19Z

If we're pivoting off of the original plan, I'd like to quash this issue and associated PRs entirely. @ysiraichi, as I've discussed with you, incremental improvements on the C++ implementations is really not aligned with the current direction (which is PrimTorch implementations in Python, with an overhead reduction backend). I would rather we spend our time and effort on the overhead reduction backend for dynamo instead.

ysiraichi · 2022-08-16T08:12:10Z

@bdhirsh

...not having to manually specify what ops are in the decomposition seems better, especially since it doesn't look like native_functions.yaml would be the source of truth for decomp info in the long run

Yes! I would argue that those decompositions serve a different purpose + are not to be used by anyone but the codegen (i.e. noone should look at it)

@ezyang

incremental improvements on the C++ implementations is really not aligned with the current direction

Got it. That idea came up in my dispatch-less kernels presentation. I just thought it was an interesting way to avoid declaring dependent operation in native_functions.yaml.

I agree with Ed in that we should "spend our time and effort on the overhead reduction backend for dynamo instead". So, I would say we could leave this idea as an improvement for dispatch-less kernels if later needed. i.e. I will keep focused on the overhead reduction backend for dynamo and, if necessary, we can go back to this afterwards.

ezyang mentioned this issue Jan 22, 2021

Make add.Scalar manual_cpp_binding #49203

Closed

glaringlee added module: internals Related to internal abstractions in c10 and ATen module: structured kernels Related to new structured kernels functionality triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 25, 2021

This was referenced Feb 24, 2021

mul: port to structured #52692

Closed

Register DefaultBackend implementations for functional/inplace structured operators #53037

Closed

ezyang changed the title ~~Dispatch-less structured wrapper kernels~~ Dispatch-less structured wrapper / alias kernels Mar 31, 2021

ezyang mentioned this issue Mar 31, 2021

Port kernels to be structured [tracker] #55070

Closed

ezyang mentioned this issue Apr 12, 2021

addmv: port to structured kernels, improve error checks #55746

Closed

ezyang changed the title ~~Dispatch-less structured wrapper / alias kernels~~ Dispatch-less structured wrapper / composite / alias kernels May 18, 2021

This was referenced May 18, 2021

port square to structured #58266

Closed

[structured] remainder #58732

Closed

ezyang mentioned this issue May 25, 2021

Port silu_backward to structured #58661

Closed

kshitij12345 mentioned this issue Jun 8, 2021

Alias for logsumexp to special namespace #58838

Closed

ezyang mentioned this issue Jul 6, 2021

Port float_power kernel to structured kernels #60855

Closed

SplitInfinity mentioned this issue Jul 17, 2021

Add OptionalRef; clamp: port to structured kernel #61361

Closed

ezyang mentioned this issue Jul 19, 2021

Port sum.dim_IntList kernel to structured kernels. #61642

Closed

ezyang mentioned this issue Aug 5, 2021

Port norm kernel to structured kernels. #62711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatch-less structured wrapper / composite / alias kernels #50953

Dispatch-less structured wrapper / composite / alias kernels #50953

ezyang commented Jan 22, 2021 •

edited by pytorch-probot bot

smessmer commented Jan 22, 2021

ezyang commented Jan 22, 2021 •

edited

ezyang commented Mar 31, 2021

ezyang commented Apr 2, 2021

ysiraichi commented Jul 13, 2021

ezyang commented Jul 19, 2021

ezyang commented Jul 21, 2021

ezyang commented Feb 24, 2022

ezyang commented Feb 25, 2022

ezyang commented Feb 25, 2022 •

edited

bdhirsh commented Feb 25, 2022

ysiraichi commented Feb 26, 2022

ysiraichi commented Aug 15, 2022 •

edited

bdhirsh commented Aug 15, 2022

ezyang commented Aug 15, 2022

ysiraichi commented Aug 16, 2022

Dispatch-less structured wrapper / composite / alias kernels #50953

Dispatch-less structured wrapper / composite / alias kernels #50953

Comments

ezyang commented Jan 22, 2021 • edited by pytorch-probot bot

smessmer commented Jan 22, 2021

ezyang commented Jan 22, 2021 • edited

ezyang commented Mar 31, 2021

ezyang commented Apr 2, 2021

ysiraichi commented Jul 13, 2021

ezyang commented Jul 19, 2021

ezyang commented Jul 21, 2021

ezyang commented Feb 24, 2022

ezyang commented Feb 25, 2022

ezyang commented Feb 25, 2022 • edited

bdhirsh commented Feb 25, 2022

ysiraichi commented Feb 26, 2022

ysiraichi commented Aug 15, 2022 • edited

bdhirsh commented Aug 15, 2022

ezyang commented Aug 15, 2022

ysiraichi commented Aug 16, 2022

ezyang commented Jan 22, 2021 •

edited by pytorch-probot bot

ezyang commented Jan 22, 2021 •

edited

ezyang commented Feb 25, 2022 •

edited

ysiraichi commented Aug 15, 2022 •

edited