Skip to content

Conversation

bdhirsh
Copy link
Contributor

@bdhirsh bdhirsh commented Jun 25, 2021

The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error.

This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, it compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for add.Tensor and add.Scalar, but only provides a single XLANativeFunctions::add(...) definition, we'll error out because we only saw 1 add kernel but we expected 2. Any variation (forgetting the XLANativeFunctions bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors.

An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error.

I didn't bother with that for now, mostly because:

  • I'd already written this when I heard about it 😛
  • As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now).

Stack from ghstack:

Differential Revision: D29392615

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 25, 2021

💊 CI failures summary and remediations

As of commit ac59c65 (more details on the Dr. CI page and at hud.pytorch.org/pr/60737):


  • 3/3 failures possibly* introduced in this PR
    • 2/3 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/...any arguments to function call, expected 2, have 3
Jul 06 19:30:36 clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp:395:21: error: no member named 'index_put_' in namespace 'torch_xla'
Jul 06 19:30:36   return torch_xla::index_put_(self, indices, values, accumulate);
Jul 06 19:30:36          ~~~~~~~~~~~^
Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp:455:10: error: use of undeclared identifier 'view'; did you mean 'View'?
Jul 06 19:30:36   return view(self, size);
Jul 06 19:30:36          ^
Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/torch_xla/csrc/view.h:128:7: note: 'View' declared here
Jul 06 19:30:36 class View {
Jul 06 19:30:36       ^
Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp:1091:56: error: too many arguments to function call, expected 2, have 3
Jul 06 19:30:36   return torch_xla::div(self, other, /*rounding_mode=*/c10::nullopt);
Jul 06 19:30:36          ~~~~~~~~~~~~~~                                ^~~~~~~~~~~~
Jul 06 19:30:36 /var/lib/jenkins/workspace/xla/torch_xla/csrc/aten_xla_type.cpp:1090:1: note: 'div' declared here
Jul 06 19:30:36 at::Tensor div(const at::Tensor& self, const at::Tensor& other) {
Jul 06 19:30:36 ^
Jul 06 19:30:36 3 errors generated.
Jul 06 19:30:37 [19/171] clang++-9 -MMD -MF /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/init_python_bindings.o.d -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/var/lib/jenkins/workspace/xla -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-bin -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/protobuf_archive/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_protobuf/src -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/eigen_archive -I/var/lib/jenkins/workspace/xla/third_party/tensorflow/bazel-tensorflow/external/com_google_absl -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/torch/lib/tmp_install/include -I/opt/conda/lib/python3.6/site-packages/torch/include -I/opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c -c /var/lib/jenkins/workspace/xla/torch_xla/csrc/init_python_bindings.cpp -o /var/lib/jenkins/workspace/xla/build/temp.linux-x86_64-3.6/torch_xla/csrc/init_python_bindings.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_clang"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1002"' -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1
Jul 06 19:30:37 In file included from /var/lib/jenkins/workspace/xla/torch_xla/csrc/init_python_bindings.cpp:35:
Jul 06 19:30:37 In file included from /var/lib/jenkins/workspace/torch/csrc/jit/python/pybind.h:8:
Jul 06 19:30:37 In file included from /var/lib/jenkins/workspace/torch/csrc/THP.h:42:

ci.pytorch.org: 1 failed


Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

bdhirsh added a commit that referenced this pull request Jun 25, 2021
@bdhirsh
Copy link
Contributor Author

bdhirsh commented Jun 25, 2021

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

bdhirsh added a commit that referenced this pull request Jun 29, 2021
@bdhirsh
Copy link
Contributor Author

bdhirsh commented Jun 29, 2021

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@bdhirsh bdhirsh requested review from ezyang and ailzhang June 29, 2021 22:48
# then we can directly match the file against each signature
# This makes regex-ing easier to deal with since clang-format usually spreads the kernel signature over multiple lines.
# (And we don't want the codegen to throw an error at you because you have extra whitespace).
# backend_defns_no_ws_str: str = ''.join(backend_defns.split())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on with the commented code here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops, forgot to re-ghstack before adding reviewers

The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error.

This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, and compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for `add.Tensor` and `add.Scalar`, but only provides a single `XLANativeFunctions::add(...)` definition, we'll error out because we only saw 1 `add` kernel but we expected 2. Any variation (forgetting the `XLANativeFunctions` bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors.

An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error.

I didn't bother with that for now, mostly because:
- I'd already written this when I heard about it 😛 
- As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now).


Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615)

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jun 30, 2021

expected_backend_op_names: List[OperatorName] = \
list(backend_indices[backend_key].index.keys()) + list(backend_indices[autograd_key].index.keys())
expected_backend_native_funcs: List[NativeFunction] = [f for f in native_functions if f.func.name in expected_backend_op_names]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you sure you want to O(n^2) this? 🚨🚨🚨 quadratic police 🚨🚨🚨

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing it - where are you seeing the quadratic?

{class_name} is missing a kernel definition for {expected_name}. We found {actual_overload_count} kernel(s) with that name,
but expected {expected_overload_count} kernel(s). The expected function schemas for the missing operator are:
{expected_schemas_str}
""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send this to stderr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though honestly my preference is to bundle this all up into a single error message and just raise that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bleh yeah, should've just started with that (bundling up into a single error message)

def create_decl(f: NativeFunction) -> str:
with native_function_manager(f):
return DispatcherSignature.from_schema(f.func).decl()
expected_schemas_str = '\n'.join([create_decl(f) for f in funcs])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no c++ type matching! but that seems OK as this wouldn't be a linker error in that case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah exactly :) kinda the bare minimum to avoid a linker error

The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error.

This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, it compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for `add.Tensor` and `add.Scalar`, but only provides a single `XLANativeFunctions::add(...)` definition, we'll error out because we only saw 1 `add` kernel but we expected 2. Any variation (forgetting the `XLANativeFunctions` bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors.

An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error.

I didn't bother with that for now, mostly because:
- I'd already written this when I heard about it 😛 
- As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now).


Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615)

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jul 1, 2021
@bdhirsh
Copy link
Contributor Author

bdhirsh commented Jul 1, 2021

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error.

This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, it compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for `add.Tensor` and `add.Scalar`, but only provides a single `XLANativeFunctions::add(...)` definition, we'll error out because we only saw 1 `add` kernel but we expected 2. Any variation (forgetting the `XLANativeFunctions` bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors.

An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error.

I didn't bother with that for now, mostly because:
- I'd already written this when I heard about it 😛 
- As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now).


Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615)

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jul 6, 2021
@bdhirsh
Copy link
Contributor Author

bdhirsh commented Jul 6, 2021

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@bdhirsh merged this pull request in 8bc2ba3.

@facebook-github-bot facebook-github-bot deleted the gh/bdhirsh/127/head branch July 12, 2021 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants