detect missing kernels from external backends in codegen #60737

bdhirsh · 2021-06-25T13:30:11Z

The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error.

This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, it compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for add.Tensor and add.Scalar, but only provides a single XLANativeFunctions::add(...) definition, we'll error out because we only saw 1 add kernel but we expected 2. Any variation (forgetting the XLANativeFunctions bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors.

An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error.

I didn't bother with that for now, mostly because:

I'd already written this when I heard about it 😛
As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now).

Stack from ghstack:

detect missing kernels from external backends in codegen #60737 detect missing kernels from external backends in codegen
move all external kernels into a class for better compiler error messages #59839 move all external kernels into a class for better compiler error messages

Differential Revision: D29392615

[ghstack-poisoned]

facebook-github-bot · 2021-06-25T13:30:17Z

💊 CI failures summary and remediations

As of commit ac59c65 (more details on the Dr. CI page and at hud.pytorch.org/pr/60737):

3/3 failures possibly* introduced in this PR
- 2/3 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/...any arguments to function call, expected 2, have 3

Jul 06 19:30:36 clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp:395:21: error: no member named 'index_put_' in namespace 'torch_xla'
Jul 06 19:30:36   return torch_xla::index_put_(self, indices, values, accumulate);
Jul 06 19:30:36          ~~~~~~~~~~~^
Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp:455:10: error: use of undeclared identifier 'view'; did you mean 'View'?
Jul 06 19:30:36   return view(self, size);
Jul 06 19:30:36          ^
Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/torch_xla/csrc/view.h:128:7: note: 'View' declared here
Jul 06 19:30:36 class View {
Jul 06 19:30:36       ^
Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp:1091:56: error: too many arguments to function call, expected 2, have 3
Jul 06 19:30:36   return torch_xla::div(self, other, /*rounding_mode=*/c10::nullopt);
Jul 06 19:30:36          ~~~~~~~~~~~~~~                                ^~~~~~~~~~~~
Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp:1090:1: note: 'div' declared here
Jul 06 19:30:36 at::Tensor div(const at::Tensor& self, const at::Tensor& other) {
Jul 06 19:30:36 ^
Jul 06 19:30:36 3 errors generated.
Jul 06 19:30:37 [19/171] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/init_python_bindings.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/init_python_bindings.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/init_python_bindings.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jul 06 19:30:37 In file included from /var/lib/jenkins/workspace/xla/torch_xla/csrc/init_python_bindings.cpp:35:
Jul 06 19:30:37 In file included from /var/lib/jenkins/workspace/torch/csrc/jit/python/pybind.h:8:
Jul 06 19:30:37 In file included from /var/lib/jenkins/workspace/torch/csrc/THP.h:42:

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.2-py3.6

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

[ghstack-poisoned]

ghstack-source-id: 85f8b63 Pull Request resolved: #60737

bdhirsh · 2021-06-25T13:50:53Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615) [ghstack-poisoned]

ghstack-source-id: 961cfbb Pull Request resolved: #60737

bdhirsh · 2021-06-29T22:37:11Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2021-06-30T13:11:07Z

tools/codegen/gen_backend_stubs.py

+    # then we can directly match the file against each signature
+    # This makes regex-ing easier to deal with since clang-format usually spreads the kernel signature over multiple lines.
+    # (And we don't want the codegen to throw an error at you because you have extra whitespace).
+    # backend_defns_no_ws_str: str = ''.join(backend_defns.split())


What's going on with the commented code here?

woops, forgot to re-ghstack before adding reviewers

@ezyang

The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error. This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, and compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for `add.Tensor` and `add.Scalar`, but only provides a single `XLANativeFunctions::add(...)` definition, we'll error out because we only saw 1 `add` kernel but we expected 2. Any variation (forgetting the `XLANativeFunctions` bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors. An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error. I didn't bother with that for now, mostly because: - I'd already written this when I heard about it 😛 - As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now). Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615) [ghstack-poisoned]

ghstack-source-id: a663b08 Pull Request resolved: #60737

ezyang · 2021-07-01T03:21:48Z

tools/codegen/gen_backend_stubs.py

+
+    expected_backend_op_names: List[OperatorName] = \
+        list(backend_indices[backend_key].index.keys()) + list(backend_indices[autograd_key].index.keys())
+    expected_backend_native_funcs: List[NativeFunction] = [f for f in native_functions if f.func.name in expected_backend_op_names]


you sure you want to O(n^2) this? 🚨🚨🚨 quadratic police 🚨🚨🚨

I might be missing it - where are you seeing the quadratic?

ezyang · 2021-07-01T03:22:27Z

tools/codegen/gen_backend_stubs.py

+{class_name} is missing a kernel definition for {expected_name}. We found {actual_overload_count} kernel(s) with that name,
+but expected {expected_overload_count} kernel(s). The expected function schemas for the missing operator are:
+{expected_schemas_str}
+""")


send this to stderr

though honestly my preference is to bundle this all up into a single error message and just raise that

bleh yeah, should've just started with that (bundling up into a single error message)

ezyang · 2021-07-01T03:27:08Z

tools/codegen/gen_backend_stubs.py

+            def create_decl(f: NativeFunction) -> str:
+                with native_function_manager(f):
+                    return DispatcherSignature.from_schema(f.func).decl()
+            expected_schemas_str = '\n'.join([create_decl(f) for f in funcs])


no c++ type matching! but that seems OK as this wouldn't be a linker error in that case

yeah exactly :) kinda the bare minimum to avoid a linker error

@ezyang

The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error. This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, it compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for `add.Tensor` and `add.Scalar`, but only provides a single `XLANativeFunctions::add(...)` definition, we'll error out because we only saw 1 `add` kernel but we expected 2. Any variation (forgetting the `XLANativeFunctions` bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors. An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error. I didn't bother with that for now, mostly because: - I'd already written this when I heard about it 😛 - As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now). Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615) [ghstack-poisoned]

ghstack-source-id: d5d70c5 Pull Request resolved: #60737

bdhirsh · 2021-07-01T14:34:26Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang

The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error. This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, it compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for `add.Tensor` and `add.Scalar`, but only provides a single `XLANativeFunctions::add(...)` definition, we'll error out because we only saw 1 `add` kernel but we expected 2. Any variation (forgetting the `XLANativeFunctions` bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors. An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error. I didn't bother with that for now, mostly because: - I'd already written this when I heard about it 😛 - As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now). Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615) [ghstack-poisoned]

ghstack-source-id: 40d5667 Pull Request resolved: #60737

bdhirsh · 2021-07-06T18:58:51Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-08T22:33:31Z

@bdhirsh merged this pull request in 8bc2ba3.

detect missing kernels from external backends in codegen

58be56a

[ghstack-poisoned]

This was referenced Jun 25, 2021

add a boxed CPU fallback kernel #58065

Closed

move all external kernels into a class for better compiler error messages #59839

Closed

facebook-github-bot added the cla signed label Jun 25, 2021

Update on "detect missing kernels from external backends in codegen"

8467b7f

[ghstack-poisoned]

bdhirsh added a commit that referenced this pull request Jun 25, 2021

detect missing kernels from external backends in codegen

ce795a8

ghstack-source-id: 85f8b63 Pull Request resolved: #60737

Update on "detect missing kernels from external backends in codegen"

c1f5166

Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615) [ghstack-poisoned]

bdhirsh added a commit that referenced this pull request Jun 29, 2021

detect missing kernels from external backends in codegen

c0195ae

ghstack-source-id: 961cfbb Pull Request resolved: #60737

bdhirsh requested review from ezyang and ailzhang June 29, 2021 22:48

ezyang reviewed Jun 30, 2021

View reviewed changes

bdhirsh added a commit that referenced this pull request Jun 30, 2021

detect missing kernels from external backends in codegen

e2d483c

ghstack-source-id: a663b08 Pull Request resolved: #60737

ezyang reviewed Jul 1, 2021

View reviewed changes

ezyang approved these changes Jul 1, 2021

View reviewed changes

bdhirsh added a commit that referenced this pull request Jul 1, 2021

detect missing kernels from external backends in codegen

7c827e1

ghstack-source-id: d5d70c5 Pull Request resolved: #60737

bdhirsh added a commit that referenced this pull request Jul 6, 2021

detect missing kernels from external backends in codegen

d974c0c

ghstack-source-id: 40d5667 Pull Request resolved: #60737

facebook-github-bot closed this in 8bc2ba3 Jul 8, 2021

facebook-github-bot added the Merged label Jul 8, 2021

facebook-github-bot deleted the gh/bdhirsh/127/head branch July 12, 2021 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

detect missing kernels from external backends in codegen #60737

detect missing kernels from external backends in codegen #60737

Uh oh!

bdhirsh commented Jun 25, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jun 25, 2021 •

edited

Loading

Uh oh!

bdhirsh commented Jun 25, 2021

Uh oh!

bdhirsh commented Jun 29, 2021

Uh oh!

ezyang Jun 30, 2021

Uh oh!

bdhirsh Jun 30, 2021

Uh oh!

ezyang Jul 1, 2021

Uh oh!

bdhirsh Jul 1, 2021

Uh oh!

ezyang Jul 1, 2021

Uh oh!

ezyang Jul 1, 2021

Uh oh!

bdhirsh Jul 1, 2021

Uh oh!

ezyang Jul 1, 2021

Uh oh!

bdhirsh Jul 1, 2021

Uh oh!

bdhirsh commented Jul 1, 2021

Uh oh!

bdhirsh commented Jul 6, 2021

Uh oh!

facebook-github-bot commented Jul 8, 2021

Uh oh!

Uh oh!

detect missing kernels from external backends in codegen #60737

detect missing kernels from external backends in codegen #60737

Uh oh!

Conversation

bdhirsh commented Jun 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

ci.pytorch.org: 1 failed

Uh oh!

bdhirsh commented Jun 25, 2021

Uh oh!

bdhirsh commented Jun 29, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Jul 1, 2021

Uh oh!

bdhirsh commented Jul 6, 2021

Uh oh!

facebook-github-bot commented Jul 8, 2021

Uh oh!

Uh oh!

bdhirsh commented Jun 25, 2021 •

edited

Loading

facebook-github-bot commented Jun 25, 2021 •

edited

Loading