add a boxed CPU fallback kernel #58065

bdhirsh · 2021-05-11T18:27:37Z

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback.

Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests.

Design

To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes:

Confirm whether or not we can remove all C++ logging info directly in the yaml.

Current Design

All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding xla-side PR with the xla changes.

There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels.

// xla_cpu_fallback.h
#include <ATen/native/CPUFallback.h>
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack);
...

// xla_cpu_fallback.cpp
#include "torch_xla/csrc/aten_cpu_fallback.h"
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
  // Do custom logging here
  ...
  // Call the actual boxed CPU fallback.
  at::native::cpu_fallback(op, stack);
}

TORCH_LIBRARY_IMPL(_, XLA, m) {
  m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>());
}

Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.:

#include <ATen/native/CPUFallback.h>

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha);
  }
  ...
}

That decltype(at::addmm) logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands.

Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?
We could change the api to use at::redispatch, which would make it look something like this:

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha);
  }
  ...
}

Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though!

Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this:

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha);
  }
  ...
}

Writing that out actually I actually like it more (I think it'll let us get rid of decltype(...)). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out.

More alternatives
The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides:

Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback.
Passing custom C++ logging through yaml is just more fragile: right now xla uses an iostream to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later.

To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated out wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since out wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback.

Performance impact

While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer.

I ran my benchmarks using callgrind, benchmarking both at::add and at::add_out run on XLA. My callgrind benchmark for at::add can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind.

I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the at::add() call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does.

at::add:
before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001
after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273
delta: ~15.5% increase

at::add_out:
before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz
after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227
delta: ~14.5% increase

High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case.

For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a CompositeExplicitAutograd kernel which calls into the out operator. So the extra work that we end up doing is:

An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add)
An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback)
An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel).
unboxing->boxing->unboxing logic (this is the only strictly required piece)

There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's an issue for it here), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later.

Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (at::to_cpu takes up a ton of instructions, but I don't see any attribution for the at::native::add kernel anywhere).

Stack from ghstack:

add a boxed CPU fallback kernel #58065

Differential Revision: D28833085

[ghstack-poisoned]

facebook-github-bot · 2021-05-11T18:27:44Z

💊 CI failures summary and remediations

As of commit 233d726 (more details on the Dr. CI page and at hud.pytorch.org/pr/58065):

2/2 failures possibly* introduced in this PR
- 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Jun 25 19:02:13 FAILED: /var/lib/jenkins/worksp...mp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o

Jun 25 19:01:34 [9/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/xla_lower_util.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/xla_lower_util.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/xla_lower_util.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:01:42 [10/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/nll_loss.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/nll_loss.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/nll_loss.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:01:47 [11/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/op_by_op_executor.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/op_by_op_executor.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/op_by_op_executor.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:01:49 [12/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/softmax_builder.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/softmax_builder.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/softmax_builder.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:01:53 [13/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/function_call_tracker.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/function_call_tracker.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/function_call_tracker.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:01:58 [14/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/shape_builder.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/shape_builder.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/shape_builder.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:02:00 [15/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/ir_dump_util.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/ir_dump_util.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/ir_dump_util.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:02:04 [16/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/ir.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/ir.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/ir.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:02:09 [17/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/batch_norm.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/batch_norm.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/batch_norm.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:02:13 [18/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:02:13 FAILED: /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o 
Jun 25 19:02:13 clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:02:13 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp:13:10: fatal error: 'torch_xla/csrc/aten_xla_type_default.h' file not found
Jun 25 19:02:13 #include "torch_xla/csrc/aten_xla_type_default.h"
Jun 25 19:02:13          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jun 25 19:02:13 1 error generated.
Jun 25 19:02:17 [19/168] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/init_python_bindings.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/init_python_bindings.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/init_python_bindings.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jun 25 19:02:17 In file included from /var/lib/jenkins/workspace/xla/torch_xla/csrc/init_python_bindings.cpp:35:
Jun 25 19:02:17 In file included from /var/lib/jenkins/workspace/torch/csrc/jit/python/pybind.h:8:
Jun 25 19:02:17 In file included from /var/lib/jenkins/workspace/torch/csrc/THP.h:42:
Jun 25 19:02:17 In file included from /var/lib/jenkins/workspace/torch/csrc/autograd/python_autograd.h:15:

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ghstack-source-id: b85ef1e Pull Request resolved: #58065

[ghstack-poisoned]

ghstack-source-id: c61b9ca Pull Request resolved: #58065

[ghstack-poisoned]

ghstack-source-id: 01ecc1f Pull Request resolved: #58065

[ghstack-poisoned]

ghstack-source-id: 6631281 Pull Request resolved: #58065

ezyang · 2021-05-18T16:00:06Z

return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha);

It seems to me that if we could work out some way to template on at::_ops::addmm directly, then you could eliminate the necessity for the string op name too, because you should be able to index each of the disambiguated symbols to the relevant schema string.

ezyang · 2021-05-18T17:31:51Z

This is head in clouds stuff, but seeing how algorithmically similar the codegen and the boxed fallbacks are, I dream of a universe where we can write the semantics of a boxed fallback once, and then translate this into a codegen pass ala partial evaluation

ezyang · 2021-05-18T17:34:33Z

aten/src/ATen/native/CPUFallback.cpp

+      // To be extra safe I'm converting to a c10::List, IValue will take ownership of the items in the list.
+      auto cpu_ivalue = c10::IValue(c10::List<at::Tensor>(to_cpu(ivalue.toTensorList().vec())));
+      (*stack)[arguments_begin + idx] = std::move(cpu_ivalue);
+    }


You don't have to be worried, vector constructor is

template <class T, IValue::enable_if_ivalue_constructible<T>> inline IValue::IValue(const std::vector<T>& v) : IValue(c10::List<T>()) { auto list = to<c10::List<T>>(); list.reserve(v.size()); for (const auto& e : v) { list.push_back(e); } }

your version is more efficient anyway tho!

ezyang · 2021-05-18T17:36:05Z

aten/src/ATen/native/CPUFallback.cpp

+      (*stack)[arguments_begin + idx] = std::move(cpu_ivalue);
+    }
+  }
+  auto cpu_tensors = to_cpu(tensor_args);


It's a little wonky here that you don't arrange for all tensors to be converted to CPU at once; e.g., if you have f(Tensor, TensorList), first TensorList will get materialized, then Tensor.

yeah I'd go further ask why we're collecting the individual tensor args into a temporary list here and passing it to the helper that has the extra list-related logic... not sure the motivation for factoring it this way, instead of just doing to_cpu directly on the individual tensors here. (Could do the undefined guard as a helper on individual tensors if needed)

yeah I'd go further ask why we're collecting the individual tensor args into a temporary list here and passing it to the helper that has the extra list-related logic

It turns out that this is actually needed for XLA. The individual xla tensor arguments can't be independently copied to CPU, there's some shared context between them involved with the copying. If you look at the xla implementation of at::_to_cpu(), it eventually calls some fusing logic here.

It's a little wonky here that you don't arrange for all tensors to be converted to CPU at once

Is the part you don't like the fact that arguments won't necessarily be copied to CPU in the order that they come in the operator's schema? That seems fair. But just to confirm, you're not suggesting a single at::_to_cpu() call with all of the Tensor + TensorList args? That's also doable, it would just require a bunch more indexing logic to separate out the lists afterwards.

EDIT: Actually I dunno- what do you want to do about schemas like f(Tensor, TensorList, Tensor). Maybe merging all tensor args into one giant list is the right call? Or - do you think it's a valid expectation that args get converted to CPU in the order they come in the schema?

Merging all tensor args into one giant list is actually the best for XLA since in case some tensors share a part of underlying graph that part can be materialized only once. ;) We probably didn't care too much about perf here so we indeed did it separately in the current codegen, but I think it can be changed.
In this PR I think it's fine if we document this inefficiency in the comment, also note that "materializing together" is a requirement specific to xla only". ;)

I'm ok with adding a note: If the only factor is perf, there's also technically a tradeoff where eager backends that use the CPU fallback would get a small perf hit from the extra indexing logic to merge and slice up the list of tensors. But that's probably pretty mild, and I'm thinking of this as a knob we can use later to improve CPU fallback perf for XLA if we need to.

ezyang · 2021-05-18T17:37:58Z

aten/src/ATen/native/CPUFallback.cpp

+  // However, the new tensor that we created cannot share the same storage,
+  // since it lives on CPU and the original tensor lives on a different device.
+  // Because of that, we treat immutable aliases the same way that we treat non-aliases:
+  // as a fresh tensor that has entirely new storage.


Wouldn't it be better to error in this case?

Yeah, agreed; I left it like that to maintain the current behavior for XLA, but it's a bug that we should probably error on instead.

The only view op that it applies to for xla is unfold, since it's non-composite and doesn't have an XLA lowering:

>>> a = torch.arange(4) >>> b = a.unfold(0, 2, 1) >>> b[0] = -1 >>> a tensor([-1, -1, 2, 3]) # b is a view, so the modification changes a >>> a = torch.arange(4, device=xm.xla_device()) >>> b = a.unfold(0, 2, 1) >>> b[0] = -1 >>> a tensor([0, 1, 2, 3], device='xla:0') # modifying b didn't change a!

Yup we should make it clear that XLA need to handle all view ops in the backend instead of relying on fallback. IIRC @JackCaoG tried lowering unfold before but it's not merged due to complexity and priority.
(Erroring out here is definitely helpful).

Haven't updated the PR yet, but I just tested the error message locally:

>>> a = torch.arange(4, device=xm.xla_device()) >>> b = a.unfold(0, 2, 1) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: The operator aten::unfold appears to be a view operator, but it has no implementation for the backend "xla:0". View operators don't support falling back to run on the CPU, since the tensor's storage cannot be shared across devices.

ezyang · 2021-05-18T17:40:53Z

aten/src/ATen/native/CPUFallback.cpp

+          // mutable alias case: move the input ivalue directly onto the stack
+          // in place of the existing cpu output tensor.
+          bool found_alias = false;
+          for (int64_t i = 0; i < tensor_args_indices.size(); ++i) {


🚨 quadratic police 🚨

This shouldn't really even be happening at call time, should it? We're just re-deriving the same schema info over and over. Not worth blocking the PR, but we could think about precomputing this (and other schema-derived info?) at registration time.

Yeah, this is totally something we could precompute. Right now we have a way to go from Argument -> AliasInfo, but there's no direct link from AliasInfo -> [list_of_args_with_that_alias]. I could avoid the runtime quadratic by pre-computing that map in this fallback, but idk if it's worth the effort - it's not reaaaally quadratic since you only do the linear search for each mutable alias'd output argument (there's usually only one). And in xla's case, afaik most people want to call functional ops that don't have any mutable alias'd outputs.

On the topic of boxed fallback perf: there are probably a bunch of things we could pre-compute at registration time to speed this up, and boxed fallbacks in general. They would all increase the size of FunctionSchema though, so it'd probably be worth benchmarking to make sure we actually get a speedup.

Just a few ideas (not sure if these are all fruitful):

[the one you said] Precompute the mapping between alias'd inputs/outputs, which would avoid the double for loop here. Right now you can identify them by checking if their alias_info objects are the same, but we could store direct references between the input/output args instead. I'm not sure whether or not we'd have to deal with weird schemas though, like (Tensor(a!) t1, Tensor(a!) t2) -> Tensor(a!)

Precompute which arguments are tensors/tensorlists, since it's probably a pretty common task in a boxed fallback to iterate through tensor arguments.

Also I see the FunctionSchema stores a std::vector<Argument> - it might be an easy win to make these SmallVector.

ezyang · 2021-05-18T17:41:48Z

aten/src/ATen/native/CPUFallback.cpp

+
+// convenience helper for converting tensors to cpu
+
+std::vector<at::Tensor> to_cpu(const at::TensorList& tensors) {


Perhaps the stock at::to_cpu should just support undefined? IDK...

Yeah what's the downside?

Agreed, I'll try adding that in. At some point I was debugging some failures related to undefined tensor and probably put the logic directly here out of frustration / to be extra safe 😛.

Oh, well one downside worth pointing out is that backends that implement their own at::_to_cpu need to remember to handle undefined tensors. I can add the logic explicitly for XLA's version, but it's just another thing to get wrong.

In our schema we currently don't have a good way to tell backends whether they should handle undefined on a given Tensor in TensorList. (Most tensors are concrete but we do see occasionally tensor can be undefined in some ops (esp for those saved for backward in batchnorm.
In this case if we don't handle it here we'll need to send a note to backends to warn about undefined. (we also have optional tensors but it's not commonly used in the TensorList arguments.) So if we need this logic somewhere (either in backend or pytorch core) it's probably easier that we do it for backends in one place. ;)

ezyang

This looks good. @zdevito, does 10-20% for working on boxed code make sense to you? I had a hypothesis that some of the cost was from traversing metadata but I don't see any flagrantly terrible logic on the metadata (there is a quadratic loop but it shouldn't matter).

A high level question I'd ask you is how you feel about the boxed fallback compared to the code generator pipeline. It seems like it's less code, how did you feel about the complexity?

Other miscellanea:

Higher priority boxed fallback seems like a reasonable thing to add.

bhosmer

Couple minor things but yeah LG!

Re precedence vs CompositeImplicitAutograd, I definitely think we need to be careful though. Longer discussion.

bhosmer · 2021-05-18T19:09:51Z

aten/src/ATen/native/CPUFallback.cpp

+
+// convenience helper for converting tensors to cpu
+
+std::vector<at::Tensor> to_cpu(const at::TensorList& tensors) {


Yeah what's the downside?

bhosmer · 2021-05-18T19:16:28Z

aten/src/ATen/native/CPUFallback.cpp

+      (*stack)[arguments_begin + idx] = std::move(cpu_ivalue);
+    }
+  }
+  auto cpu_tensors = to_cpu(tensor_args);


yeah I'd go further ask why we're collecting the individual tensor args into a temporary list here and passing it to the helper that has the extra list-related logic... not sure the motivation for factoring it this way, instead of just doing to_cpu directly on the individual tensors here. (Could do the undefined guard as a helper on individual tensors if needed)

bhosmer · 2021-05-18T19:18:15Z

aten/src/ATen/native/CPUFallback.cpp

+  auto cpu_tensors = to_cpu(tensor_args);
+
+  for (auto i = 0; i < tensor_args_indices.size(); ++i) {
+    //auto cpu_ivalue = c10::IValue(cpu_tensors[i]);


not sure the significance of the inline comments...

absolutely no significance, need to clean up some of my comments 😛 (that includes the "I'm worried" comment that Ed pointed out further up - definitely planned on removing that before eventually merging)

bhosmer · 2021-05-18T19:45:20Z

aten/src/ATen/native/CPUFallback.cpp

+          // mutable alias case: move the input ivalue directly onto the stack
+          // in place of the existing cpu output tensor.
+          bool found_alias = false;
+          for (int64_t i = 0; i < tensor_args_indices.size(); ++i) {


This shouldn't really even be happening at call time, should it? We're just re-deriving the same schema info over and over. Not worth blocking the PR, but we could think about precomputing this (and other schema-derived info?) at registration time.

bhosmer · 2021-05-18T22:20:09Z

aten/src/ATen/native/CPUFallback.h

+// External backends can add their own custom logging on top if it to customize their own CPU fallbacks.
+TORCH_API void cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack);
+
+template<c10::KernelFunction::BoxedKernelFunction* fallback_fn, class ReturnType, class... ParameterTypes>


Probably worth a brief comment describing the situation this is useful for.

bhosmer · 2021-05-18T22:24:30Z

aten/src/ATen/native/CPUFallback.h

+}
+
+template<c10::KernelFunction::BoxedKernelFunction* fallback_fn, class F2>
+struct call_fallback_fn2 final {};


Similarly, I'd add a comment here describing why you'd use this one instead of call_fallback_fn

I should have preemptively commented these. The two versions of this function are temporary: once #58092 lands I can kill the version above, and use this nicer version (and rename call_fallback_fn2 to call_fallback_fn).

I talked with Ed and there's probably an even nicer version of this where you just call at::native::call_fallback_fn<&xla_cpu_fallback, at::_ops::_adaptive_avg_pool3d>::call(self, output_size);, and at::_ops::_adaptive_avg_pool3d has enough information to give you both the op schema type and the op's name + overload name, that way users of this API only have to pass in the arguments + the fallback that they want to call.

Right now I'm planning on just landing the PR independently from that change and doing a fix-up later.

bdhirsh · 2021-05-19T17:14:07Z

This looks good. @zdevito, does 10-20% for working on boxed code make sense to you? I had a hypothesis that some of the cost was from traversing metadata but I don't see any flagrantly terrible logic on the metadata (there is a quadratic loop but it shouldn't matter).

A high level question I'd ask you is how you feel about the boxed fallback compared to the code generator pipeline. It seems like it's less code, how did you feel about the complexity?

Other miscellanea:

Higher priority boxed fallback seems like a reasonable thing to add.

This feels to me like a pretty big complexity decrease- the codegen for CPU fallbacks was pretty convoluted and ugly. It comes at the cost of requiring a little more work from the backend though, since they now need to write their own boxed fallback. This isn't too bad though, since the helper functions we provide do 99% of the work - just needs to be well documented. The different ideas for improving perf of the fallback all come with their own additional complexity though, so that seems to me like a bridge we can cross later when/if we want to further improve perf.

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback. Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests. ### Design To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes: * Confirm whether or not we can remove all C++ logging info directly in the yaml. **Current Design** All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1). There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels. ``` // xla_cpu_fallback.h #include <ATen/native/CPUFallback.h> ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack); ... ``` ``` // xla_cpu_fallback.cpp #include "torch_xla/csrc/aten_cpu_fallback.h" ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // Do custom logging here ... // Call the actual boxed CPU fallback. at::native::cpu_fallback(op, stack); } TORCH_LIBRARY_IMPL(_, XLA, m) { m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>()); } ``` Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.: ``` #include <ATen/native/CPUFallback.h> at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha); } ... } ``` That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands. **Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?** We could change the api to use `at::redispatch`, which would make it look something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha); } ... } ``` Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though! Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha); } ... } ``` Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out. **More alternatives** The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides: * Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback. * Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later. To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback. ### Performance impact While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer. I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind. I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does. `at::add`: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase `at::add_out`: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case. For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is: * An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add) * An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback) * An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel). * unboxing->boxing->unboxing logic (this is the only strictly required piece) There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](#55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later. Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere). [ghstack-poisoned]

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback. Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests. ### Design To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes: * Confirm whether or not we can remove all C++ logging info directly in the yaml. **Current Design** All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1). There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels. ``` // xla_cpu_fallback.h #include <ATen/native/CPUFallback.h> ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack); ... ``` ``` // xla_cpu_fallback.cpp #include "torch_xla/csrc/aten_cpu_fallback.h" ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // Do custom logging here ... // Call the actual boxed CPU fallback. at::native::cpu_fallback(op, stack); } TORCH_LIBRARY_IMPL(_, XLA, m) { m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>()); } ``` Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.: ``` #include <ATen/native/CPUFallback.h> at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha); } ... } ``` That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands. **Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?** We could change the api to use `at::redispatch`, which would make it look something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha); } ... } ``` Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though! Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha); } ... } ``` Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out. **More alternatives** The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides: * Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback. * Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later. To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback. ### Performance impact While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer. I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind. I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does. `at::add`: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase `at::add_out`: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case. For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is: * An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add) * An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback) * An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel). * unboxing->boxing->unboxing logic (this is the only strictly required piece) There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](#55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later. Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere). Differential Revision: [D28833085](https://our.internmc.facebook.com/intern/diff/D28833085) [ghstack-poisoned]

bdhirsh · 2021-06-11T21:43:37Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback. Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests. ### Design To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes: * Confirm whether or not we can remove all C++ logging info directly in the yaml. **Current Design** All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1). There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels. ``` // xla_cpu_fallback.h #include <ATen/native/CPUFallback.h> ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack); ... ``` ``` // xla_cpu_fallback.cpp #include "torch_xla/csrc/aten_cpu_fallback.h" ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // Do custom logging here ... // Call the actual boxed CPU fallback. at::native::cpu_fallback(op, stack); } TORCH_LIBRARY_IMPL(_, XLA, m) { m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>()); } ``` Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.: ``` #include <ATen/native/CPUFallback.h> at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha); } ... } ``` That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands. **Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?** We could change the api to use `at::redispatch`, which would make it look something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha); } ... } ``` Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though! Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha); } ... } ``` Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out. **More alternatives** The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides: * Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback. * Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later. To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback. ### Performance impact While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer. I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind. I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does. `at::add`: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase `at::add_out`: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case. For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is: * An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add) * An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback) * An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel). * unboxing->boxing->unboxing logic (this is the only strictly required piece) There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](#55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later. Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere). Differential Revision: [D28833085](https://our.internmc.facebook.com/intern/diff/D28833085) [ghstack-poisoned]

bdhirsh · 2021-06-14T14:08:19Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback. Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests. ### Design To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes: * Confirm whether or not we can remove all C++ logging info directly in the yaml. **Current Design** All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1). There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels. ``` // xla_cpu_fallback.h #include <ATen/native/CPUFallback.h> ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack); ... ``` ``` // xla_cpu_fallback.cpp #include "torch_xla/csrc/aten_cpu_fallback.h" ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // Do custom logging here ... // Call the actual boxed CPU fallback. at::native::cpu_fallback(op, stack); } TORCH_LIBRARY_IMPL(_, XLA, m) { m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>()); } ``` Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.: ``` #include <ATen/native/CPUFallback.h> at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha); } ... } ``` That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands. **Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?** We could change the api to use `at::redispatch`, which would make it look something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha); } ... } ``` Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though! Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha); } ... } ``` Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out. **More alternatives** The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides: * Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback. * Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later. To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback. ### Performance impact While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer. I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind. I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does. `at::add`: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase `at::add_out`: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case. For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is: * An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add) * An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback) * An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel). * unboxing->boxing->unboxing logic (this is the only strictly required piece) There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](#55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later. Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere). Differential Revision: [D28833085](https://our.internmc.facebook.com/intern/diff/D28833085) [ghstack-poisoned]

bdhirsh · 2021-06-14T14:27:18Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback. Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests. ### Design To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes: * Confirm whether or not we can remove all C++ logging info directly in the yaml. **Current Design** All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1). There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels. ``` // xla_cpu_fallback.h #include <ATen/native/CPUFallback.h> ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack); ... ``` ``` // xla_cpu_fallback.cpp #include "torch_xla/csrc/aten_cpu_fallback.h" ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // Do custom logging here ... // Call the actual boxed CPU fallback. at::native::cpu_fallback(op, stack); } TORCH_LIBRARY_IMPL(_, XLA, m) { m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>()); } ``` Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.: ``` #include <ATen/native/CPUFallback.h> at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha); } ... } ``` That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands. **Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?** We could change the api to use `at::redispatch`, which would make it look something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha); } ... } ``` Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though! Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha); } ... } ``` Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out. **More alternatives** The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides: * Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback. * Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later. To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback. ### Performance impact While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer. I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind. I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does. `at::add`: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase `at::add_out`: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case. For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is: * An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add) * An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback) * An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel). * unboxing->boxing->unboxing logic (this is the only strictly required piece) There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](#55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later. Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere). Differential Revision: [D28833085](https://our.internmc.facebook.com/intern/diff/D28833085) [ghstack-poisoned]

bdhirsh · 2021-06-14T19:37:41Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback. Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests. ### Design To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes: * Confirm whether or not we can remove all C++ logging info directly in the yaml. **Current Design** All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1). There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels. ``` // xla_cpu_fallback.h #include <ATen/native/CPUFallback.h> ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack); ... ``` ``` // xla_cpu_fallback.cpp #include "torch_xla/csrc/aten_cpu_fallback.h" ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // Do custom logging here ... // Call the actual boxed CPU fallback. at::native::cpu_fallback(op, stack); } TORCH_LIBRARY_IMPL(_, XLA, m) { m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>()); } ``` Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.: ``` #include <ATen/native/CPUFallback.h> at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha); } ... } ``` That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands. **Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?** We could change the api to use `at::redispatch`, which would make it look something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha); } ... } ``` Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though! Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha); } ... } ``` Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out. **More alternatives** The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides: * Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback. * Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later. To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback. ### Performance impact While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer. I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind. I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does. `at::add`: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase `at::add_out`: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case. For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is: * An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add) * An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback) * An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel). * unboxing->boxing->unboxing logic (this is the only strictly required piece) There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](#55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later. Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere). Differential Revision: [D28833085](https://our.internmc.facebook.com/intern/diff/D28833085) [ghstack-poisoned]

bdhirsh · 2021-06-25T13:50:47Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback. Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests. ### Design To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes: * Confirm whether or not we can remove all C++ logging info directly in the yaml. **Current Design** All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1). There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels. ``` // xla_cpu_fallback.h #include <ATen/native/CPUFallback.h> ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack); ... ``` ``` // xla_cpu_fallback.cpp #include "torch_xla/csrc/aten_cpu_fallback.h" ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // Do custom logging here ... // Call the actual boxed CPU fallback. at::native::cpu_fallback(op, stack); } TORCH_LIBRARY_IMPL(_, XLA, m) { m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>()); } ``` Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.: ``` #include <ATen/native/CPUFallback.h> at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha); } ... } ``` That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands. **Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?** We could change the api to use `at::redispatch`, which would make it look something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha); } ... } ``` Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though! Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha); } ... } ``` Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out. **More alternatives** The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides: * Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback. * Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later. To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback. ### Performance impact While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer. I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind. I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does. `at::add`: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase `at::add_out`: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case. For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is: * An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add) * An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback) * An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel). * unboxing->boxing->unboxing logic (this is the only strictly required piece) There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](#55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later. Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere). Differential Revision: [D28833085](https://our.internmc.facebook.com/intern/diff/D28833085) [ghstack-poisoned]

Pull Request resolved: #58065 This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback. Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests. ### Design To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes: * Confirm whether or not we can remove all C++ logging info directly in the yaml. **Current Design** All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1). There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels. ``` // xla_cpu_fallback.h #include <ATen/native/CPUFallback.h> ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack); ... ``` ``` // xla_cpu_fallback.cpp #include "torch_xla/csrc/aten_cpu_fallback.h" ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // Do custom logging here ... // Call the actual boxed CPU fallback. at::native::cpu_fallback(op, stack); } TORCH_LIBRARY_IMPL(_, XLA, m) { m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>()); } ``` Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.: ``` #include <ATen/native/CPUFallback.h> at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha); } ... } ``` That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands. **Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?** We could change the api to use `at::redispatch`, which would make it look something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha); } ... } ``` Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though! Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha); } ... } ``` Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out. **More alternatives** The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides: * Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback. * Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later. To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback. ### Performance impact While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer. I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind. I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does. `at::add`: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase `at::add_out`: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case. For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is: * An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add) * An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback) * An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel). * unboxing->boxing->unboxing logic (this is the only strictly required piece) There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](#55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later. Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere). Differential Revision: [D28833085](https://our.internmc.facebook.com/intern/diff/D28833085/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D28833085/)! ghstack-source-id: 132409766

facebook-github-bot · 2021-06-25T23:28:22Z

@bdhirsh merged this pull request in 9134b0e.

Summary: Pull Request resolved: pytorch#58065 This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback. Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests. ### Design To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes: * Confirm whether or not we can remove all C++ logging info directly in the yaml. **Current Design** All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1). There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels. ``` // xla_cpu_fallback.h #include <ATen/native/CPUFallback.h> ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack); ... ``` ``` // xla_cpu_fallback.cpp #include "torch_xla/csrc/aten_cpu_fallback.h" ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // Do custom logging here ... // Call the actual boxed CPU fallback. at::native::cpu_fallback(op, stack); } TORCH_LIBRARY_IMPL(_, XLA, m) { m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>()); } ``` Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.: ``` #include <ATen/native/CPUFallback.h> at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha); } ... } ``` That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once pytorch#58092 lands. **Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?** We could change the api to use `at::redispatch`, which would make it look something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha); } ... } ``` Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though! Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha); } ... } ``` Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out. **More alternatives** The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides: * Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback. * Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later. To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback. ### Performance impact While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer. I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind. I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does. `at::add`: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase `at::add_out`: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case. For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is: * An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add) * An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback) * An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel). * unboxing->boxing->unboxing logic (this is the only strictly required piece) There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](pytorch#55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later. Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere). Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D28833085 Pulled By: bdhirsh fbshipit-source-id: 537ebd5d7fb5858f1158764ff47132d503c3b92b

Summary: Pull Request resolved: #58065 This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback. Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests. ### Design To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes: * Confirm whether or not we can remove all C++ logging info directly in the yaml. **Current Design** All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1). There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels. ``` // xla_cpu_fallback.h #include <ATen/native/CPUFallback.h> ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack); ... ``` ``` // xla_cpu_fallback.cpp #include "torch_xla/csrc/aten_cpu_fallback.h" ... void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) { // Do custom logging here ... // Call the actual boxed CPU fallback. at::native::cpu_fallback(op, stack); } TORCH_LIBRARY_IMPL(_, XLA, m) { m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>()); } ``` Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.: ``` #include <ATen/native/CPUFallback.h> at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha); } ... } ``` That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands. **Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?** We could change the api to use `at::redispatch`, which would make it look something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha); } ... } ``` Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though! Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this: ``` at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) { .... if (...call_fallback...) { return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha); } ... } ``` Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out. **More alternatives** The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides: * Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback. * Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later. To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback. ### Performance impact While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer. I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind. I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does. `at::add`: before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001 after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273 delta: ~15.5% increase `at::add_out`: before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227 delta: ~14.5% increase High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case. For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is: * An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add) * An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback) * An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel). * unboxing->boxing->unboxing logic (this is the only strictly required piece) There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](#55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later. Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere). Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D28833085 Pulled By: bdhirsh fbshipit-source-id: 537ebd5d7fb5858f1158764ff47132d503c3b92b

Register CPU kernel to use as fallback when specific implementation is not registered for operator. CPU fallback kernel was added to PyTorch with: pytorch/pytorch#58065 This is minimal change to register fallback CPU kernel. Followup changes with remove existing code generated TorchFallback kernels. We can also add logic to log information about which fallback CPU kernels were invoked to guide work on replacing CPU kernel with ORT implementation.

add a boxed CPU fallback kernel

8900024

[ghstack-poisoned]

facebook-github-bot added the cla signed label May 11, 2021

bdhirsh mentioned this pull request May 11, 2021

generate inplace/out kernels for xla #57510

Closed

bdhirsh mentioned this pull request May 11, 2021

remove xla-specific stuff from codegen (minus CPU fallback) #58064

Closed

bdhirsh added a commit that referenced this pull request May 11, 2021

add a boxed CPU fallback kernel

710c695

ghstack-source-id: b85ef1e Pull Request resolved: #58065

Update on "add a boxed CPU fallback kernel"

2346658

[ghstack-poisoned]

bdhirsh added a commit that referenced this pull request May 12, 2021

add a boxed CPU fallback kernel

590a863

ghstack-source-id: c61b9ca Pull Request resolved: #58065

Update on "add a boxed CPU fallback kernel"

ea366a7

[ghstack-poisoned]

bdhirsh added a commit that referenced this pull request May 14, 2021

add a boxed CPU fallback kernel

cb3962b

ghstack-source-id: 01ecc1f Pull Request resolved: #58065

Update on "add a boxed CPU fallback kernel"

2dcb79f

[ghstack-poisoned]

bdhirsh added a commit that referenced this pull request May 17, 2021

add a boxed CPU fallback kernel

a2491bd

ghstack-source-id: 6631281 Pull Request resolved: #58065

bdhirsh requested review from bhosmer, ezyang and ailzhang May 17, 2021 23:28

ezyang reviewed May 18, 2021

View reviewed changes

ezyang approved these changes May 18, 2021

View reviewed changes

bhosmer approved these changes May 18, 2021

View reviewed changes

bdhirsh added 6 commits June 15, 2021 10:00

bdhirsh mentioned this pull request Jun 25, 2021

detect missing kernels from external backends in codegen #60737

Closed

facebook-github-bot closed this in 9134b0e Jun 25, 2021

facebook-github-bot added the Merged label Jun 25, 2021

facebook-github-bot deleted the gh/bdhirsh/116/head branch June 29, 2021 14:22

ezyang mentioned this pull request Aug 2, 2021

Dynamic tensor rematerialization #62448

Open

jamill mentioned this pull request Jun 13, 2022

RFC: register fallback CPU kernel microsoft/onnxruntime#11838

Closed


		// convenience helper for converting tensors to cpu

		std::vector<at::Tensor> to_cpu(const at::TensorList& tensors) {

add a boxed CPU fallback kernel #58065

add a boxed CPU fallback kernel #58065

Uh oh!

Conversation

bdhirsh commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

Performance impact

Uh oh!

facebook-github-bot commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

Uh oh!

ezyang commented May 18, 2021

Uh oh!

ezyang commented May 18, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

bhosmer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

bdhirsh commented May 11, 2021 •

edited

Loading

facebook-github-bot commented May 11, 2021 •

edited

Loading

bdhirsh May 19, 2021 •

edited

Loading

bdhirsh May 19, 2021 •

edited

Loading

bdhirsh May 19, 2021 •

edited

Loading

bdhirsh May 19, 2021 •

edited

Loading

bdhirsh May 19, 2021 •

edited

Loading