Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATen operator API versioning #38973

Open
fengyuan14 opened this issue May 25, 2020 · 10 comments
Open

ATen operator API versioning #38973

fengyuan14 opened this issue May 25, 2020 · 10 comments
Labels
module: internals Related to internal abstractions in c10 and ATen triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@fengyuan14
Copy link
Collaborator

fengyuan14 commented May 25, 2020

馃殌 Feature

When implementing a new out-of-source ATen backend extension for PyTorch, we find ATen operator APIs are incompatible from version to version (even among minor versions, for example v1.5.x).

We expect ATen operator API versioning could be provided to improve user experience, when out-of-source extension does not match with PyTorch (ATen operator API).

Motivation

End-user may get runtime error about ATen operator API mismatching, when they try some different PyTorch minor versions with a given Intel Extension for PyTorch. For example,
Extension v0.1 is based on PyTorch v1.5.0.
End-user may try extension v0.1 on PyTorch v1.5.3+, and get ATen runtime error, due to ATen operator API changes.

In addition, different workloads may get different ATen runtime error (different operators API change). ATen runtime error is good enough for extension developer, but is not friendly enough for end-user.

So intuitively, we want to raise a warning ahead of all at runtime, if some ATen operator APIs change, which is more friendly to users, and may not bring risks.

Pitch

We expect to have ATen operator API versioning for runtime check and raise a warning at extension loading time, if PyTorch ATen operator API version is not supported by extension.

P.S. We thought of checking PyTorch version only. But it would take huge efforts to investigate ATen operator API changes on all PyTorch versions (including all minor versions).

@ezyang ezyang added module: internals Related to internal abstractions in c10 and ATen triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 26, 2020
@ezyang
Copy link
Contributor

ezyang commented May 26, 2020

A few thoughts:

  • One big missing piece to this is that when a user-level BC-compatible change is made (e.g., adding a new optional parameter), this hard breaks backend extensions. Ideally, in this situation, we would automatically introduce a compatibility wrapper that simply checks if the optional parameter is set to its default value, and raises an error if it's not (since of course the old operator doesn't support this new option). Ideally we'd also raise a (optional?) warning so you as an extension writer could easily tell what they need to update. I don't think it's reasonable to try to do something for BC-breaking changes though: user code needed to change, so of course extension code is going to need to change too.
  • The other problem is that many operators are "leaky", in the sense that they were originally devised as low level implementation details, and not intended to be exposed as public, and ought not be exposed to extension writers either. If these operators end up being extension points you're also likely to get broken by them. "Fortunately", usually breaking these ops also has BC implications for serialized TorchScript models, so we're likely to be more conservative with them.
  • I understand it's really desirable for C++ extensions to be able to modify their code to adjust for API changes. We really need to start publishing a version macro so extensions can maintain one codebase for multiple PyTorch versions.

Also cc @ailzhang for some perspective from XLA.

@fengyuan14
Copy link
Collaborator Author

Thanks, @ezyang,

Let me clarify my understanding. Actually, we have two issues,

  1. how to co-work (be compatible) with multiple PyTorch versions.
  2. how to provide elegant exit to end-user, if not compatible.

Regarding issue 1, we understand it is hard to provide a wholly compatible solution to end-user, otherwise, either we may pay a lot of efforts to maintain several extension versions, or PyTorch provide compatible API. In my mind, issue 1 might be a long term talk.

So we hope, at current stage, issue 2 could be solved. We think there may be a clear high-level warning noticed to end-user, not varied and detailed warning for different ATen op.

Detailed warning is good enough for developer in debug mode, but not clear to end-user in release mode.

@ailzhang
Copy link
Contributor

Yea I agree that it's probably too much to push for compatibility across multiple Pytorch versions for now (although it's a good long term goal), and a compatibility check/warning might be good enough and feasible.

Although from XLA's experience, C++ API level changes are mostly compile errors (like function signature changes), @arthuryuan1987 can you provide a few examples of runtime errors due to API changes as well? Thanks!

@ezyang
Copy link
Contributor

ezyang commented May 27, 2020

Error message trick in #38739 might be relevant here

@fengyuan14
Copy link
Collaborator Author

We thought ATen operator API includes two parts, operator signature and operator dispatch strategy. Let me show you error log separately,

  1. operator signature mismatching (at::_embedding_bag)
    bool include_last_offset=False is added from PyTorch v1.5. If we use extension based on PyTorch v1.4 blindly, when loading extension, we will get,
ImportError: Tried to register multiple operators with the same name and the same overload name but different schemas: aten::_embedding_bag(Tensor weight, Tensor indices, Tensor offsets, bool scale_grad_by_freq=False, int mode=0, bool sparse=False, Tensor? per_sample_weights=None) -> (Tensor, Tensor, Tensor, Tensor) vs aten::_embedding_bag(Tensor weight, Tensor indices, Tensor offsets, bool scale_grad_by_freq=False, int mode=0, bool sparse=False, Tensor? per_sample_weights=None, bool include_last_offset=False) -> (Tensor, Tensor, Tensor, Tensor) (findOrRegisterSchema_ at /home/fengyuan/workspace/pytorch/pytorch-extension/aten/src/ATen/core/dispatch/Dispatcher.cpp:64)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6c (0x7fb4fe19c06c in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Dispatcher::findOrRegisterSchema_(c10::FunctionSchema&&) + 0x1a7 (0x7fb4f908ac77 in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10::Dispatcher::registerSchema(c10::FunctionSchema) + 0x9e (0x7fb4f908bcee in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x7fff27 (0x7fb4f90b9f27 in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10::RegisterOperators::registerSchemaAndKernel_(c10::FunctionSchema, c10::RegisterOperators::Options::KernelRegistrationConfig&&) + 0xe3 (0x7fb4f90b1ad3 in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10::RegisterOperators::registerOp_(c10::RegisterOperators::Options&&) + 0xaf6 (0x7fb4f90b2926 in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10::RegisterOperators::checkSchemaAndRegisterOp_(c10::RegisterOperators::Options&&) + 0x97d (0x7fb4f90b53ed in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::RegisterAtenTypeFunctions() + 0x9d6 (0x7fb4d111aee6 in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch_ipex-0.1-py3.7-linux-x86_64.egg/_torch_ipex.so)
frame #8: PyInit__torch_ipex + 0x11e (0x7fb4d10eea6e in /home/fengyuan/pyenv/py3-dev/lib/python3.7/site-packages/torch_ipex-0.1-py3.7-linux-x86_64.egg/_torch_ipex.so)
<omitting python frames>
frame #55: __libc_start_main + 0xe7 (0x7fb501b15b97 in /lib/x86_64-linux-gnu/libc.so.6)

Of course, if we try to rebase extension to support PyTorch v1.5, we will generate registration code automatically, and get a compilation error, mismatching between generated code and our native implementation. That is another talk (development mode or debugging mode). Here what we want to talk is the error log is confused for end-user (release mode).

  1. operator dispatch strategy changes. (at::tanh)
RuntimeError: Could not run 'aten::tanh' with arguments from the 'XXXTensorId' backend. 'aten::tanh' is only available for these backends: [CPUTensorId, QuantizedCPUTensorId, VariableTensorId].

We thought an API version warning may be more clear to end-user.

@ailzhang
Copy link
Contributor

@arthuryuan1987 I see,
For 1), it's actually indeed one reason why we release torch_xla packages with corresponding torch package as well (on colab and docker images).
For 2), XLA's workaround is to have a fallback implementation of every op and register all ops to the backend to avoid hitting no aten::tanh is available for XX backend. This fallback is also auto generated from RegisterationDeclarations.h and it sits in generated torch_xla/csrc/aten_xla_type_default.cpp.
Your case might be slightly different from XLA, just want to provide some context in case it's helpful.

@ezyang
Copy link
Contributor

ezyang commented May 28, 2020

@arthuryuan1987 I imagine there are some simple rewordings of these error messages which could make things more clear for users. Do you want to submit a PR doing this? Add me as reviewer.

@fengyuan14
Copy link
Collaborator Author

@ailzhang , For 1), to release our extension packages with corresponding torch package might be heavy to us. I wonder, do you release torch_xla package, only if ATen operator APIs change? Maybe, ATen operator API changes happen to PyTorch version 1.5.1, 1.5.2, 1.5.4, 1.5.8. So you will release torch_xla for these PyTorch minor versions? If so, I think it is too heavy for us. For 2), I think it is good idea of being compatible.

@ezyang , you see, call stack ( in 1). ) might be wordy for end-user. In addtion, end-user will get different call stack, if we have several ATen operator API changes. Yes, I can submit a PR.

@ezyang
Copy link
Contributor

ezyang commented Jun 1, 2020

Being able to release a single package for multiple minor versions of ATen is going to be a hard path to go down. We historically have made ZERO abi compatibility guarantees, even across minor versions, and infrastructurally speaking we're not setup to do this in the future. If it makes you feel better, we don't release minor versions that often, so it is essentially just having to do major version releases.

@fengyuan14
Copy link
Collaborator Author

Agree with what you talk on compatibility among minor versions. If ATen API compatibility only breaks on major version releases, it will be good to backend extensions. We always release extension packages separately for each major versions (PyTorch v1.4, v1.5, v1.6).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: internals Related to internal abstractions in c10 and ATen triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants