-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[pytorch] reintroduce static dispatch #51554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The new static dispatch and c10 registration can work together. It generates static dispatch code for selected backends (if set) and fallback to regular dispatch for the rest. This way, it can be used to reduce dispatcher's overhead for perf sensitive use cases without compromising the functionality. If the static_dispatch_backends flag is not set, then the behavior is the same as before. Added back the E2E mobile static dispatch CI for testing purpose. This PR doesn't try to optimize mobile build size yet. We can introduce separate build flags to disable the fallback logic, with which the linker can strip out unused op invocation code. Static dispatch for manually registrated ops / custom ops / autograd kernels are not handled by this PR. We can work on these special cases progressively. - Sample code (with static dispatch backend = CPU): ``` // aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor Tensor Tensor::add(const Tensor & other, Scalar alpha) const { DispatchKeySet _dk_set = c10::detail::multi_dispatch_key_set(other, const_cast<Tensor&>(*this)); DispatchKey _dk = c10::impl::dispatchTypeId(_dk_set, DispatchKeySet::FULL); switch (_dk) { case DispatchKey::BackendSelect: // fallthrough case DispatchKey::CPU: return at::cpu::add(const_cast<Tensor&>(*this), other, alpha); default: // fallback to regular dispatch // TORCH_CHECK(false, "Unsupported static dispatch", _dk); break; } static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::add", "Tensor") .typed<Tensor (const Tensor &, const Tensor &, Scalar)>(); return op.call(const_cast<Tensor&>(*this), other, alpha); } ``` - If the op has BackendSelect kernel, then it should fallback to c10 dispatch: ``` // aten::arange(Scalar end, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor Tensor arange(Scalar end, const TensorOptions & options) { DispatchKey _dk = options.computeDispatchKey(); switch (_dk) { case DispatchKey::CPU: return at::math::arange(end, options); default: // fallback to regular dispatch // TORCH_CHECK(false, "Unsupported static dispatch", _dk); break; } static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::arange", "") .typed<Tensor (Scalar, c10::optional<ScalarType>, c10::optional<Layout>, c10::optional<Device>, c10::optional<bool>)>(); return op.call(end, optTypeMetaToScalarType(options.dtype_opt()), options.layout_opt(), options.device_opt(), options.pinned_memory_opt()); } ``` - If the op only has math kernel and there is no tensor argument / tensor option to infer the dispatch key from, then always dispatch to math kernel (only if `static_dispatch_backends` is set). ``` // aten::_nnpack_available() -> bool bool _nnpack_available() { return at::math::_nnpack_available(); static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::_nnpack_available", "") .typed<bool ()>(); return op.call(); } ``` - If the op doesn't have CPU backend, then nothing changes: ``` // aten::quantized_batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor mean, Tensor var, float eps, float output_scale, int output_zero_point) -> Tensor Tensor quantized_batch_norm(const Tensor & input, const c10::optional<Tensor> & weight, const c10::optional<Tensor> & bias, const Tensor & mean, const Tensor & var, double eps, double output_scale, int64_t output_zero_point) { static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::quantized_batch_norm", "") .typed<Tensor (const Tensor &, const c10::optional<Tensor> &, const c10::optional<Tensor> &, const Tensor &, const Tensor &, double, double, int64_t)>(); return op.call(input, weight, bias, mean, var, eps, output_scale, output_zero_point); } ``` [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 8825104 (more details on the Dr. CI page):
Extra GitHub checks: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
The new static dispatch and c10 registration can work together. It generates static dispatch code for selected backends (if set) and fallback to regular dispatch for the rest. This way, it can be used to reduce dispatcher's overhead for perf sensitive use cases without compromising the functionality. If the static_dispatch_backends flag is not set, then the behavior is the same as before. Added back the E2E mobile static dispatch CI for testing purpose. This PR doesn't try to optimize mobile build size yet. We can introduce separate build flags to disable the fallback logic, with which the linker can strip out unused op invocation code. Static dispatch for manually registrated ops / custom ops / autograd kernels are not handled by this PR. We can work on these special cases progressively. - Sample code (with static dispatch backend = CPU): ``` // aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor Tensor Tensor::add(const Tensor & other, Scalar alpha) const { DispatchKeySet _dk_set = c10::detail::multi_dispatch_key_set(other, const_cast<Tensor&>(*this)); DispatchKey _dk = c10::impl::dispatchTypeId(_dk_set, DispatchKeySet::FULL); switch (_dk) { case DispatchKey::BackendSelect: // fallthrough case DispatchKey::CPU: return at::cpu::add(const_cast<Tensor&>(*this), other, alpha); default: // fallback to regular dispatch // TORCH_CHECK(false, "Unsupported static dispatch", _dk); break; } static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::add", "Tensor") .typed<Tensor (const Tensor &, const Tensor &, Scalar)>(); return op.call(const_cast<Tensor&>(*this), other, alpha); } ``` - If the op has BackendSelect kernel, then it should fallback to c10 dispatch: ``` // aten::arange(Scalar end, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor Tensor arange(Scalar end, const TensorOptions & options) { DispatchKey _dk = options.computeDispatchKey(); switch (_dk) { case DispatchKey::CPU: return at::math::arange(end, options); default: // fallback to regular dispatch // TORCH_CHECK(false, "Unsupported static dispatch", _dk); break; } static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::arange", "") .typed<Tensor (Scalar, c10::optional<ScalarType>, c10::optional<Layout>, c10::optional<Device>, c10::optional<bool>)>(); return op.call(end, optTypeMetaToScalarType(options.dtype_opt()), options.layout_opt(), options.device_opt(), options.pinned_memory_opt()); } ``` - If the op only has math kernel and there is no tensor argument / tensor option to infer the dispatch key from, then always dispatch to math kernel (only if `static_dispatch_backends` is set). ``` // aten::_nnpack_available() -> bool bool _nnpack_available() { return at::math::_nnpack_available(); static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::_nnpack_available", "") .typed<bool ()>(); return op.call(); } ``` - If the op doesn't have CPU backend, then nothing changes: ``` // aten::quantized_batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor mean, Tensor var, float eps, float output_scale, int output_zero_point) -> Tensor Tensor quantized_batch_norm(const Tensor & input, const c10::optional<Tensor> & weight, const c10::optional<Tensor> & bias, const Tensor & mean, const Tensor & var, double eps, double output_scale, int64_t output_zero_point) { static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::quantized_batch_norm", "") .typed<Tensor (const Tensor &, const c10::optional<Tensor> &, const c10::optional<Tensor> &, const Tensor &, const Tensor &, double, double, int64_t)>(); return op.call(input, weight, bias, mean, var, eps, output_scale, output_zero_point); } ``` ghstack-source-id: 28364e2 Pull Request resolved: #51554
tools/codegen/api/translate.py
Outdated
t = b.ctype | ||
if isinstance(t, ConstRefCType) and isinstance(t.elem, OptionalCType) and \ | ||
isinstance(t.elem.elem, BaseCType) and t.elem.elem.type == 'Tensor': | ||
ctx[ConstRefCType(BaseCType("Tensor", b.name))] = f'({b.name}.has_value() ? *{b.name} : at::Tensor())' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Today, functionally it doesn't make a difference, but it would be better to put this translation rule inside solve
itself, so that we are still uniformly doing backward inference. Because the rule here is very simple it can be done with forward and backward, so it's mostly a uniformity thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The trouble with forward inference is when you start stuffing the context with tons and tons of possible conversions "just because they might help"; backward lets you be a lot more directed about things. Though it's not that bad of an idea; see Datalog for example :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know what I changed my mind, unpacking of optional to tensor should be done as forward inference.
assert len(tensor_opts) == 1 | ||
# specialized fast pass | ||
stmts.append(f"""\ | ||
DispatchKey _dk = {tensor_opts[0].name}.computeDispatchKey(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that c10::detail::multi_dispatch_key_set
is just as good as this ;)
for case_key in backends: | ||
for dispatch_key in (case_key, DispatchKey.DefaultBackend, DispatchKey.Math): | ||
# FIXME: how do I get dispatch table for function with structured_delegate? Is it correct to | ||
# always statically dispatch to the delegate? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there's a structured delegate, the dispatch table is automatically generated based on the out variant (https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#structured-keyword-proposal ; there is no dispatch table for upsample_nearest1d
because it delegates its dispatch to upsample_nearest1d_out
). We're still on the hook for generating wrapper functions for all the variants.
I'm not sure if that helps you globally here, still reading.
This PR backports a subset of Jiakai's changes from #51554 that adds support for at::cpu in non-structured kernels. The unusual bits: - Need to add a new forward inference rule for doing conversions of const optional<Tensor>& to const Tensor& - Need to give the wrapper functions a prefix so that the call to wrapper is not ambiguous Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]
This PR backports a subset of Jiakai's changes from #51554 that adds support for at::cpu in non-structured kernels. The unusual bits: - Need to add a new forward inference rule for doing conversions of const optional<Tensor>& to const Tensor& - Need to give the wrapper functions a prefix so that the call to wrapper is not ambiguous Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 4dbeaf2 Pull Request resolved: #51590
This PR backports a subset of Jiakai's changes from #51554 that adds support for at::cpu in non-structured kernels. The unusual bits: - Need to add a new forward inference rule for doing conversions of const optional<Tensor>& to const Tensor& - Need to give the wrapper functions a prefix so that the call to wrapper is not ambiguous Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D26209871](https://our.internmc.facebook.com/intern/diff/D26209871) [ghstack-poisoned]
This PR backports a subset of Jiakai's changes from #51554 that adds support for at::cpu in non-structured kernels. The unusual bits: - Need to add a new forward inference rule for doing conversions of const optional<Tensor>& to const Tensor& - Need to give the wrapper functions a prefix so that the call to wrapper is not ambiguous Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: d65a982 Pull Request resolved: #51590
Summary: Pull Request resolved: #51590 This PR backports a subset of Jiakai's changes from #51554 that adds support for at::cpu in non-structured kernels. The unusual bits: - Need to add a new forward inference rule for doing conversions of const optional<Tensor>& to const Tensor& - Need to give the wrapper functions a prefix so that the call to wrapper is not ambiguous Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: ljk53 Differential Revision: D26209871 Pulled By: ezyang fbshipit-source-id: 8162686039675ab92a2af7a14f6b18941f8944df
The new static dispatch and c10 registration can work together. It generates static dispatch code for selected backends (if set) and fallback to regular dispatch for the rest. This way, it can be used to reduce dispatcher's overhead for perf sensitive use cases without compromising the functionality. If the static_dispatch_backends flag is not set, then the behavior is the same as before. Added back the E2E mobile static dispatch CI for testing purpose. This PR doesn't try to optimize mobile build size yet. We can introduce separate build flags to disable the fallback logic, with which the linker can strip out unused op invocation code. Static dispatch for manually registrated ops / custom ops / autograd kernels are not handled by this PR. We can work on these special cases progressively. - Sample code (with static dispatch backend = CPU): ``` // aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor Tensor Tensor::add(const Tensor & other, Scalar alpha) const { DispatchKeySet _dk_set = c10::detail::multi_dispatch_key_set(other, const_cast<Tensor&>(*this)); DispatchKey _dk = c10::impl::dispatchTypeId(_dk_set, DispatchKeySet::FULL); switch (_dk) { case DispatchKey::BackendSelect: // fallthrough case DispatchKey::CPU: return at::cpu::add(const_cast<Tensor&>(*this), other, alpha); default: // fallback to regular dispatch // TORCH_CHECK(false, "Unsupported static dispatch", _dk); break; } static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::add", "Tensor") .typed<Tensor (const Tensor &, const Tensor &, Scalar)>(); return op.call(const_cast<Tensor&>(*this), other, alpha); } ``` - If the op has BackendSelect kernel, then it should fallback to c10 dispatch: ``` // aten::arange(Scalar end, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor Tensor arange(Scalar end, const TensorOptions & options) { DispatchKey _dk = options.computeDispatchKey(); switch (_dk) { case DispatchKey::CPU: return at::math::arange(end, options); default: // fallback to regular dispatch // TORCH_CHECK(false, "Unsupported static dispatch", _dk); break; } static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::arange", "") .typed<Tensor (Scalar, c10::optional<ScalarType>, c10::optional<Layout>, c10::optional<Device>, c10::optional<bool>)>(); return op.call(end, optTypeMetaToScalarType(options.dtype_opt()), options.layout_opt(), options.device_opt(), options.pinned_memory_opt()); } ``` - If the op only has math kernel and there is no tensor argument / tensor option to infer the dispatch key from, then always dispatch to math kernel (only if `static_dispatch_backends` is set). ``` // aten::_nnpack_available() -> bool bool _nnpack_available() { return at::math::_nnpack_available(); static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::_nnpack_available", "") .typed<bool ()>(); return op.call(); } ``` - If the op doesn't have CPU backend, then nothing changes: ``` // aten::quantized_batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor mean, Tensor var, float eps, float output_scale, int output_zero_point) -> Tensor Tensor quantized_batch_norm(const Tensor & input, const c10::optional<Tensor> & weight, const c10::optional<Tensor> & bias, const Tensor & mean, const Tensor & var, double eps, double output_scale, int64_t output_zero_point) { static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::quantized_batch_norm", "") .typed<Tensor (const Tensor &, const c10::optional<Tensor> &, const c10::optional<Tensor> &, const Tensor &, const Tensor &, double, double, int64_t)>(); return op.call(input, weight, bias, mean, var, eps, output_scale, output_zero_point); } ``` Differential Revision: [D26197326](https://our.internmc.facebook.com/intern/diff/D26197326) [ghstack-poisoned]
The new static dispatch and c10 registration can work together. It generates static dispatch code for selected backends (if set) and fallback to regular dispatch for the rest. This way, it can be used to reduce dispatcher's overhead for perf sensitive use cases without compromising the functionality. If the static_dispatch_backends flag is not set, then the behavior is the same as before. Added back the E2E mobile static dispatch CI for testing purpose. This PR doesn't try to optimize mobile build size yet. We can introduce separate build flags to disable the fallback logic, with which the linker can strip out unused op invocation code. Static dispatch for manually registrated ops / custom ops / autograd kernels are not handled by this PR. We can work on these special cases progressively. - Sample code (with static dispatch backend = CPU): ``` // aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor Tensor Tensor::add(const Tensor & other, Scalar alpha) const { DispatchKeySet _dk_set = c10::detail::multi_dispatch_key_set(other, const_cast<Tensor&>(*this)); DispatchKey _dk = c10::impl::dispatchTypeId(_dk_set, DispatchKeySet::FULL); switch (_dk) { case DispatchKey::BackendSelect: // fallthrough case DispatchKey::CPU: return at::cpu::add(const_cast<Tensor&>(*this), other, alpha); default: // fallback to regular dispatch // TORCH_CHECK(false, "Unsupported static dispatch", _dk); break; } static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::add", "Tensor") .typed<Tensor (const Tensor &, const Tensor &, Scalar)>(); return op.call(const_cast<Tensor&>(*this), other, alpha); } ``` - If the op has BackendSelect kernel, then it should fallback to c10 dispatch: ``` // aten::arange(Scalar end, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor Tensor arange(Scalar end, const TensorOptions & options) { DispatchKey _dk = options.computeDispatchKey(); switch (_dk) { case DispatchKey::CPU: return at::math::arange(end, options); default: // fallback to regular dispatch // TORCH_CHECK(false, "Unsupported static dispatch", _dk); break; } static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::arange", "") .typed<Tensor (Scalar, c10::optional<ScalarType>, c10::optional<Layout>, c10::optional<Device>, c10::optional<bool>)>(); return op.call(end, optTypeMetaToScalarType(options.dtype_opt()), options.layout_opt(), options.device_opt(), options.pinned_memory_opt()); } ``` - If the op only has math kernel and there is no tensor argument / tensor option to infer the dispatch key from, then always dispatch to math kernel (only if `static_dispatch_backends` is set). ``` // aten::_nnpack_available() -> bool bool _nnpack_available() { return at::math::_nnpack_available(); static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::_nnpack_available", "") .typed<bool ()>(); return op.call(); } ``` - If the op doesn't have CPU backend, then nothing changes: ``` // aten::quantized_batch_norm(Tensor input, Tensor? weight, Tensor? bias, Tensor mean, Tensor var, float eps, float output_scale, int output_zero_point) -> Tensor Tensor quantized_batch_norm(const Tensor & input, const c10::optional<Tensor> & weight, const c10::optional<Tensor> & bias, const Tensor & mean, const Tensor & var, double eps, double output_scale, int64_t output_zero_point) { static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::quantized_batch_norm", "") .typed<Tensor (const Tensor &, const c10::optional<Tensor> &, const c10::optional<Tensor> &, const Tensor &, const Tensor &, double, double, int64_t)>(); return op.call(input, weight, bias, mean, var, eps, output_scale, output_zero_point); } ``` ghstack-source-id: c0bb78f Pull Request resolved: #51554
This is a simplified version of #51554. Compared to #51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it still uses the c10 dispatch, e.g.: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::_use_cudnn_ctc_loss", "") .typed<bool (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, int64_t)>(); return op.call(log_probs, targets, input_lengths, target_lengths, blank); } ``` [ghstack-poisoned]
This is a simplified version of #51554. Compared to #51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it still uses the c10 dispatch, e.g.: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::_use_cudnn_ctc_loss", "") .typed<bool (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, int64_t)>(); return op.call(log_probs, targets, input_lengths, target_lengths, blank); } ``` Differential Revision: [D26337857](https://our.internmc.facebook.com/intern/diff/D26337857) [ghstack-poisoned]
This is a simplified version of #51554. Compared to #51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it still uses the c10 dispatch, e.g.: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::_use_cudnn_ctc_loss", "") .typed<bool (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, int64_t)>(); return op.call(log_probs, targets, input_lengths, target_lengths, blank); } ``` Differential Revision: [D26337857](https://our.internmc.facebook.com/intern/diff/D26337857) [ghstack-poisoned]
This is a simplified version of #51554. Compared to #51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it still uses the c10 dispatch, e.g.: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::_use_cudnn_ctc_loss", "") .typed<bool (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, int64_t)>(); return op.call(log_probs, targets, input_lengths, target_lengths, blank); } ``` Differential Revision: [D26337857](https://our.internmc.facebook.com/intern/diff/D26337857) [ghstack-poisoned]
This is a simplified version of #51554. Compared to #51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it still uses the c10 dispatch, e.g.: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::_use_cudnn_ctc_loss", "") .typed<bool (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, int64_t)>(); return op.call(log_probs, targets, input_lengths, target_lengths, blank); } ``` ghstack-source-id: f1dfc2e Pull Request resolved: #51957
This is a simplified version of #51554. Compared to #51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it still uses the c10 dispatch, e.g.: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { static auto op = c10::Dispatcher::singleton() .findSchemaOrThrow("aten::_use_cudnn_ctc_loss", "") .typed<bool (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef, int64_t)>(); return op.call(log_probs, targets, input_lengths, target_lengths, blank); } ``` Differential Revision: [D26337857](https://our.internmc.facebook.com/intern/diff/D26337857) [ghstack-poisoned]
This is a simplified version of #51554. Compared to #51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it will throw error: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { TORCH_CHECK(false, "Static dispatch does not support _use_cudnn_ctc_loss for CPU."); } ``` Differential Revision: [D26337857](https://our.internmc.facebook.com/intern/diff/D26337857) [ghstack-poisoned]
This is a simplified version of #51554. Compared to #51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it will throw error: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { TORCH_CHECK(false, "Static dispatch does not support _use_cudnn_ctc_loss for CPU."); } ``` ghstack-source-id: c40e6b6 Pull Request resolved: #51957
Summary: Pull Request resolved: #51957 This is a simplified version of #51554. Compared to #51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it will throw error: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { TORCH_CHECK(false, "Static dispatch does not support _use_cudnn_ctc_loss for CPU."); } ``` Differential Revision: D26337857 Test Plan: Imported from OSS Reviewed By: bhosmer Pulled By: ljk53 fbshipit-source-id: a8e95799115c349de3c09f04a26b01d21a679364
Summary: Pull Request resolved: pytorch#51957 This is a simplified version of pytorch#51554. Compared to pytorch#51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it will throw error: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { TORCH_CHECK(false, "Static dispatch does not support _use_cudnn_ctc_loss for CPU."); } ``` Differential Revision: D26337857 Test Plan: Imported from OSS Reviewed By: bhosmer Pulled By: ljk53 fbshipit-source-id: a8e95799115c349de3c09f04a26b01d21a679364
Summary: Pull Request resolved: pytorch#51957 This is a simplified version of pytorch#51554. Compared to pytorch#51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it will throw error: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { TORCH_CHECK(false, "Static dispatch does not support _use_cudnn_ctc_loss for CPU."); } ``` Differential Revision: D26337857 Test Plan: Imported from OSS Reviewed By: bhosmer Pulled By: ljk53 fbshipit-source-id: a8e95799115c349de3c09f04a26b01d21a679364
Summary: Pull Request resolved: pytorch#51957 This is a simplified version of pytorch#51554. Compared to pytorch#51554, this version only supports statically dispatching to a specific backend. The benefit is that it skipped the dispatch key computation logic thus has less framework overhead. The downside is that if input tensors do not match the specified backend it will throw error instead of falling back to regular dispatch. Sample code: ``` Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) { return at::cpu::empty(size, options, memory_format); } // aten::conj(Tensor(a) self) -> Tensor(a) Tensor conj(const Tensor & self) { return at::math::conj(self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_out(Tensor & out, const Tensor & self) { return at::cpu::conj_out(out, self); } // aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!) Tensor & conj_outf(const Tensor & self, Tensor & out) { return at::cpu::conj_out(out, self); } // aten::_conj(Tensor self) -> Tensor Tensor _conj(const Tensor & self) { return at::defaultbackend::_conj(self); } ``` For ops without the specific backend dispatch, it will throw error: ``` // aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) { TORCH_CHECK(false, "Static dispatch does not support _use_cudnn_ctc_loss for CPU."); } ``` Differential Revision: D26337857 Test Plan: Imported from OSS Reviewed By: bhosmer Pulled By: ljk53 fbshipit-source-id: a8e95799115c349de3c09f04a26b01d21a679364
Stack from ghstack:
The new static dispatch and c10 registration can work together.
It generates static dispatch code for selected backends (if set) and
fallback to regular dispatch for the rest. This way, it can be
used to reduce dispatcher's overhead for perf sensitive use cases
without compromising the functionality.
If the static_dispatch_backends flag is not set, then the behavior is
the same as before.
Added back the E2E mobile static dispatch CI for testing purpose.
This PR doesn't try to optimize mobile build size yet. We can introduce
separate build flags to disable the fallback logic, with which the linker
can strip out unused op invocation code.
Static dispatch for manually registrated ops / custom ops / autograd
kernels are not handled by this PR. We can work on these special cases
progressively.
to infer the dispatch key from, then always dispatch to math kernel (only if
static_dispatch_backends
is set).Differential Revision: D26197326