So you want to write a new operator or a new kernel for an existing operator. How do you do that and what API should you use? So glad you asked.
All operators that are part of the public API of PyTorch are defined in native_functions.yaml
. Just add an entry there and write the corresponding C++ kernel function. It’s very easy and there is a good introduction at https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/README.md.
There’s four main use cases
- You’re writing a new operator that isn’t supposed to be part of the public PyTorch API.
- You’re writing a new operator but don’t want to change the core pytorch code base, say you’re developing a shared library with operators.
- You’re writing a C++ extension for PyTorch or you’re using inline c++ in your .py model files.
- You’re writing a backend library like XLA or MAIA that adds new kernels to all operators defined in
native_functions.yaml
.
For these use cases, the custom operator API is the better solution.
If you’re just using the custom operator API to add new kernels for existing operators (e.g. the XLA/MAIA example above), then you’re fine and don’t pay any price. If, however, you define a new operator purely using the custom op API, i.e. your operator never shows up in native_functions.yaml
, then you need to be aware of a few caveats.
- It will not get a C++ API generated. There will not be
Tensor::your_op()
methods orat::your_op()
functions to call your operator. - The API for calling the operator from Python looks a little bit different. It needs to be called through
torch.ops.your_op()
instead oftorch._C
. - Setting up autograd for custom operators is harder. You don’t get it automatically but need to use
torch::autograd::Function
to implement autograd (example). Note also thattorch::autograd::Function
does not work together with dispatch yet, so if you have different kernels for different backends (say CPU and CUDA), you need to manually write if/else statements for that.
So, you’ve read all above but still want to use the custom operator API? Great. Here’s how you do it.
There's two ways you can write kernels for a PyTorch operator. You can write them as functions or as lambdas.
This is probably the most simple way to write an operator. Just write a kernel function and register it with the PyTorch operator library.
namespace { Tensor my_kernel_cpu(const Tensor& a, const Tensor& b) {...} }
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op", torch::RegisterOperators::options()
.kernel<decltype(my_kernel_cpu), &my_kernel_cpu>(CPU()));
It is recommended to put your kernel into an anonymous namespace because that allows for better linker optimizations and smaller binary size.
The dispatch key argument (i.e. CPU()
) takes care that this kernel is only called for tensors from the CPU backend, more on that below.
Very short and simple kernels can be written as lambdas directly in the registration call:
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op", torch::RegisterOperators::options()
.kernel(CPU(), [] (const Tensor& a) -> Tensor{...}));
These lambdas must be stateless, i.e. not have a closure. The registration will fail if the lambda has a closure.
You can register catch-all kernels that are called for every backend. This disables dispatch for this operator and just always calls into the kernel you provide. You cannot combine catch-all kernels and regular device-bound kernels for the same operator.
namespace { Tensor my_kernel_fallback(Tensor a, Tensor b) {...} }
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op", torch::RegisterOperators::options()
.catchAllKernel<decltype(my_kernel_fallback), &my_kernel_fallback>());
The other ways of specifying kernels mentioned above (as functions, functors or lambdas) also work with catchAllKernel()
.
You can use the following syntactic sugar to define a catch-all kernel function more easily:
namespace { Tensor my_kernel_cpu(const Tensor& a, const Tensor& b) {...}
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op", &my_kernel_cpu);
or for lambdas:
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op", [] (Tensor a, Tensor b) {...});
Multiple operator registrations can be chained into the same registry by calling .op()
multiple times:
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op_1", torch::RegisterOperators::options()
.kernel<MyKernel1>(CPU()))
.op("my_namespace::my_op_2", torch::RegisterOperators::options()
.kernel<MyKernel2>(CPU()));
You can register different kernels for the same operator for different backends.
namespace {
Tensor my_kernel_cpu(const Tensor& a, const Tensor& b) {...}
Tensor my_kernel_cuda(const Tensor& a, const Tensor& b) {...}
}
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op", torch::RegisterOperators::options()
.kernel<decltype(my_kernel_cpu), &my_kernel_cpu>(CPU()))
.op("my_namespace::my_op", torch::RegisterOperators::options()
.kernel<decltype(my_kernel_cuda), &my_kernel_cuda>(CUDA()));
Note that here, the CPU and CUDA kernel were registered directly next to each other, but that's not necessary. You could even put them into different shared libraries if you want and as long as both are loaded into your process, things will work as you expect.
All examples above automatically inferred the operator schema from the kernel function/lambda. Sometimes, however, you want to specify the schema manually. To specify annotations for example, or default values for arguments (default values will not be inferred from the c++ kernel function), or simply for documentation purposes or to make sure the schema matches your expectations.
namespace { Tensor my_kernel_cpu(const Tensor& a, const Tensor& b) {...} }
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op(Tensor a, Tensor b) -> Tensor",
torch::RegisterOperators::options()
.kernel<decltype(my_kernel_cpu), &my_kernel_cpu>(CPU()));
Or with annotations:
namespace {
Tensor my_kernel_cpu(const Tensor& a, int64_t b, at::optional<int64_t> c) {...}
}
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op(Tensor(a) x, int y = 3, int? z = None) -> Tensor(a|b)",
torch::RegisterOperators::options()
.kernel<decltype(my_kernel_cpu), &my_kernel_cpu>(CPU()));
If the schema is explicitly specified but doesn't match the kernel signature, you will get an error when registering it.
The kernel function can either return void
or a single element like Tensor
in the examples above, or it can return multiple values using std::tuple
as shown in the following example:
namespace {
std::tuple<Tensor, int64_t, Tensor>
my_kernel_cpu(const Tensor& a, const Tensor& b, int64_t c) {...}
}
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op", torch::RegisterOperators::options()
.kernel<decltype(my_kernel_cpu), &my_kernel_cpu>(CPU()));
The kernel function can take any of the following types as inputs or outputs:
at::Tensor
double
(note:float
is not supported)int64_t
(note: other integer types likeint
,uint64_t
,int32_t
,...
are not supported)bool
c10::string_view
at::Scalar
(this is a type that can hold either an integer or a floating point value)at::optional<T>
with T being any type from the list above
The kernel function can take and return list inputs by using torch::List<T>
. T
must be one of the supported types from above excluding at::Scalar
.
The kernel function can take and return dicts by using torch::Dict<Key, Value>
. Key
must be int64_t
, c10::string_view
, double
or bool
, and Value
must be from the list of supported types above excluding at::Scalar
.
When taken as input, any of these types can be taken by value (i.e. Tensor
) or by const-reference (i.e. const Tensor&
). We recommend taking all arguments by value, even Tensors. They will be moved in, so there is no performance overhead.
If you need another type, it might work but not be officially supported (yet). Please reach out to Sebastian Messmer and we'll see what we can do.
When multiple kernels are registered for the same operator, they must have the same schema or registration will fail. Note: This also includes schema properties like annotations or default arguments. If one kernel specifies a schema with annotations or a default argument, all kernels for this operator must do this. Schemas automatically inferred from kernel functions will not have annotations or default arguments. This means to use annotations or default arguments, all kernels for this operator must explicitly specify the schema.
If you want to reuse the same operator name for a different schema, you can use overloads. Overloads must be named and the name is appended to the operator name after a dot:
namespace {
Tensor my_kernel_cpu_1(const Tensor& a) {...}
Tensor my_kernel_cpu_2(const Tensor& a, const Tensor& b) {...}
}
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op.overload1(Tensor a) -> Tensor",
torch::RegisterOperators::options()
.kernel<decltype(my_kernel_cpu_1), &my_kernel_cpu>(CPU()))
.op("my_namespace::my_op.overload2(Tensor a, Tensor b) -> Tensor",
torch::RegisterOperators::options()
.kernel<decltype(my_kernel_cpu_2), &my_kernel_cpu>(CPU()));
Kernels registered for the same overload must have exactly matching schemas, but kernels registered for different overloads are allowed to have different schemas. This also works when different overloads come from different shared libraries.
You can register an operator without a kernel:
static auto registry = torch::RegisterOperators()
.op("my_namespace::my_op(Tensor a, Tensor b) -> Tensor");
In this case, you must explicitly specify the full schema and you must not specify a dispatch key. This is useful to define the interface of an operator when you don't know a kernel yet. As mentioned above in the “Overloads” section, you will get an error if any kernel registered for this operator has a mismatching signature.
All registered operators are automatically available to PyTorch and JIT under torch.ops.XXX
. If your operator was my_namespace::my_op
, you can call it from python or JIT using torch.ops.my_namespace.my_op(a, b)
.
Custom operators are not available to the caffe2 frontend by default, but there's a simple macro you can add if you want to make it available. To expose a CPU kernel:
// Expose "my_namespace::my_op" custom operator to caffe2.
// In caffe2, the operator will be called "MyCaffe2OperatorName".
C10_EXPORT_C10_OP_TO_CAFFE2_CPU(
MyCaffe2OperatorName, "my_namespace::my_op")
And to expose a CUDA kernel:
C10_EXPORT_C10_OP_TO_CAFFE2_CUDA(
MyCaffe2OperatorName, "my_namespace::my_op")
Note that this doesn't autogenerate a caffe2 operator schema for you (yet). If there's need, we might consider adding that in future, but for now you have to write the caffe2 OPERATOR_SCHEMA
macro manually if you need it.
Also, there's some requirements on the operator schema for it to be callable from caffe2. Some of these restrictions are just because the functionality isn't implemented. If you have a use case that is blocked by them, please reach out to Sebastian Messmer.
- There must be either one or more arguments of type
Tensor
, or one argument of typeTensor[]
. You cannot have bothTensor
andTensor[]
. - Except for
Tensor
orTensor[]
, only arguments of typeint
,double
andbool
are supported. These can be in any position in the argument list and will be read from the caffe2 operator arguments, based on the argument name in the operator schema. - We do not support lists (
int[]
,double[]
orbool[]
) or optionals (int?
,double?
,bool?
) yet. - The operator must return a single
Tensor
or multiple tensors as in(Tensor, Tensor, Tensor)
. It cannot return a listTensor[]
, optionalTensor?
or any primitive types.