-
Notifications
You must be signed in to change notification settings - Fork 25.2k
detect missing kernels from external backends in codegen #60737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit ac59c65 (more details on the Dr. CI page and at hud.pytorch.org/pr/60737):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
[ghstack-poisoned]
@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615) [ghstack-poisoned]
@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
tools/codegen/gen_backend_stubs.py
Outdated
# then we can directly match the file against each signature | ||
# This makes regex-ing easier to deal with since clang-format usually spreads the kernel signature over multiple lines. | ||
# (And we don't want the codegen to throw an error at you because you have extra whitespace). | ||
# backend_defns_no_ws_str: str = ''.join(backend_defns.split()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on with the commented code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops, forgot to re-ghstack before adding reviewers
The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error. This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, and compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for `add.Tensor` and `add.Scalar`, but only provides a single `XLANativeFunctions::add(...)` definition, we'll error out because we only saw 1 `add` kernel but we expected 2. Any variation (forgetting the `XLANativeFunctions` bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors. An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error. I didn't bother with that for now, mostly because: - I'd already written this when I heard about it 😛 - As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now). Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615) [ghstack-poisoned]
|
||
expected_backend_op_names: List[OperatorName] = \ | ||
list(backend_indices[backend_key].index.keys()) + list(backend_indices[autograd_key].index.keys()) | ||
expected_backend_native_funcs: List[NativeFunction] = [f for f in native_functions if f.func.name in expected_backend_op_names] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you sure you want to O(n^2) this? 🚨🚨🚨 quadratic police 🚨🚨🚨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing it - where are you seeing the quadratic?
tools/codegen/gen_backend_stubs.py
Outdated
{class_name} is missing a kernel definition for {expected_name}. We found {actual_overload_count} kernel(s) with that name, | ||
but expected {expected_overload_count} kernel(s). The expected function schemas for the missing operator are: | ||
{expected_schemas_str} | ||
""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
send this to stderr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though honestly my preference is to bundle this all up into a single error message and just raise that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bleh yeah, should've just started with that (bundling up into a single error message)
def create_decl(f: NativeFunction) -> str: | ||
with native_function_manager(f): | ||
return DispatcherSignature.from_schema(f.func).decl() | ||
expected_schemas_str = '\n'.join([create_decl(f) for f in funcs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no c++ type matching! but that seems OK as this wouldn't be a linker error in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah exactly :) kinda the bare minimum to avoid a linker error
The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error. This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, it compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for `add.Tensor` and `add.Scalar`, but only provides a single `XLANativeFunctions::add(...)` definition, we'll error out because we only saw 1 `add` kernel but we expected 2. Any variation (forgetting the `XLANativeFunctions` bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors. An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error. I didn't bother with that for now, mostly because: - I'd already written this when I heard about it 😛 - As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now). Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615) [ghstack-poisoned]
@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error. This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, it compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for `add.Tensor` and `add.Scalar`, but only provides a single `XLANativeFunctions::add(...)` definition, we'll error out because we only saw 1 `add` kernel but we expected 2. Any variation (forgetting the `XLANativeFunctions` bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors. An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error. I didn't bother with that for now, mostly because: - I'd already written this when I heard about it 😛 - As @ailzhang pointed out, that wouldn't solve all BC issues for external backends. Most of the time when people add new defaultable params, they also add new tests that test out that functionality. Those tests would still break for the external backend, and unless we have a friendly pattern for making people aware of XLA and knowing to skip the tests for XLA, we'll end up with the same issue (the pytorch/xla CI tests will fail until that test is fixed or skipped). So the burden still falls on pytorch/xla maintainers to either spread the knowledge of skipping new tests for XLA (and fixing them up later in batches), or just make a patch in pytorch/xla (which is what we do now). Differential Revision: [D29392615](https://our.internmc.facebook.com/intern/diff/D29392615) [ghstack-poisoned]
@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
The PR below this is helpful when there's a schema mismatch, but doesn't help you if you add a new operator to your yaml file, and completely forget to add the corresponding kernel class definition - you still get a linker error.
This PR catches those errors, but not by parsing the schema in the external backend's kernel definition. Instead, it compares the number of names with what's expected. For example, if the backend specifies that they'll write kernels for
add.Tensor
andadd.Scalar
, but only provides a singleXLANativeFunctions::add(...)
definition, we'll error out because we only saw 1add
kernel but we expected 2. Any variation (forgetting theXLANativeFunctions
bit, or messing up the schema) should either be caught from this codegen check or by a compiler error, so we shouldn't end up with any linker errors.An alternative would be to scrap this PR completely and write something more ambitious, which @ezyang pointed out to me - if we parse the schemas, we can also generate glue code that would prevent xla from breaking whenever we introduce a new defaultable parameter to an existing op in pytorch. If we see a default argument in the schema, but don't see the corresponding argument in the backend schema, we can add a runtime check to ignore the defaultable argument if it's equal to its default value, and otherwise raise an error.
I didn't bother with that for now, mostly because:
Stack from ghstack:
Differential Revision: D29392615