New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantization - we need a better solution for tracking quantization backend settings in a model #46749
Comments
These are good ideas. We will need to also emphasize that a created quantized model for one backend can also be used in another. This is currently done by setting the backend prior to loading a model. We will need to explain how a model can be transferred to a different backend. Perhaps a model.to_backend() sort of API would be useful? |
doesn't that require rerunning the whole quantization flow? |
Sounds good, I think we should be more explicit about the backend. However I'm also thinking about why we used global flag in the first place and I think we should address the concerns that leads to that decision as well. One thing I can think of is that we are currently using this global flag to decide which kernels to run, e.g.: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qclamp.cpp#L89, and we'll need to think about where to put this flag, it feels like that it should be an attribute of Tensor since we need to disptach on it. |
Right. I believe that was one of the main motivation. We would need an additional "backend" in dispatch to distinguish between FBGEMM/QNNPACK and any other potential future optimized backends we want to run on. The idea was that we would use the QuantizedCPU backend and switch between different architecture kernels (server/mobile) based on the global runtime flag. Another use-case where this is used in the PT codebase was for MKL-DNN backend which is also controlled by a global flag - https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Convolution.cpp#L233 |
We could break this down into separate dependent issues:
|
cc @dreiss |
This might be annoying to deal with because (1) freezing could delete it, and (2) it would no longer be visible if you put your quantize model inside of another nn.Module (which sometimes do when exporting a model for production).
This would be a significant change. I think we should try to avoid it if possible.
This is a good point, but choosing the "right" backend for qclamp is less important because FBGEMM and QNNPACK are numerically equivalent. If I understand correctly, the core of the problem here is that FBGEMM has a known an accepted numerical inaccuracy in conv (and presumably also linear), which is worked around by using reduce_range. If these are the only operators that have problems, then we might be able to take advantage of the fact that they both use prepacking. If we set the flag on the prepacked weights, rather than the model itself, that works around both the freezing and wrapping issues and doesn't need a new dispatch key. What do people thing of this proposal: could we add a boolean to the |
Is there implications on performance as well between different backends? e.g. in server is fbgemm always faster than qnnpack? Also the availability of backends in different environments? Is it true that qnnpack is available in all platforms and fbgemm is only available in server? |
In my experience, yes, fbgemm is always faster than qnnpack when it is available. I believe this is partially due to the fact that fbgemm uses SIMD instructions that are not available on mobile x86 processors.
More-or-less yes. I believe FBGEMM is available on x86_64 only (server and most desktop/laptop environments). qnnpack is available on arm32, arm64, x86, x86_64, and we include it in both our mobile and server builds. |
in that case how do we decide which backend to use for other operators? I think there are two ways: Proposal 1What is described in the your proposal, always set to qnnpack (also no user api for modifying quantized engine), thus no numeric issues, but perf might be a problem when fbgemm is available and user wants to use it. Proposal 2We will set the quantized engine based on the availability of backend + performance. that is: when fbgemm is available use fbgemm, otherwise use qnnpack, and don't expose api for user modifications. although there might be a problem when user actually want to use qnnpack in server even when fbgemm exists, this might be used when user want to verify the numerics of qnnpack quantized model in server before exporting to mobile. ---- edit --- |
This is not quite right. My proposal is to always set to qnnpack only for prepacked objects that have been tagged as requiring the full numerical range. |
I think we can still support this. The logic would basically be.
|
@dreiss: Wanted to check if my understanding is correct:
4 Ideally, quantized models should be backend independent and the user can decide which backend needs to be used at run-time. |
then the question is: what is the api for people to explicitly request qnnpack? |
Yes.
I don't think this capability is necessary for production use cases, but it might be helpful for debugging.
Like, if you train a model without reduce_range on a op that qnnpack doesn't support? I suppose in that case we'd have to fail since we have no way to run that op.
Yes, I agree with this, but if you train a model without reduce_range, it's simply not safe to run it with fbgemm.
I think the current one is fine. If we assume that it's only used in serious debugging situations, it doesn't need to be super friendly. |
Commenting here to give a user perspective as this has bitten me twice in the past couple weeks: User forgets to set the same backend during evaluation as export Loading and saving a model increases the size if backend is different I'm not following this discussion that well but I think
would still have resulted in the above errors as our errors were due to users not explicitly setting the backend to qnnpack (and fbgemm would be available on server). TL;DR would be nice to save some metadata in the TorchScript model to indicate what backend was set during quantization (or saving) and raise some warning when the model is loaded if a different backend is set. However if this is difficult to implement or leads to some undesired effects, we can prob just save some metadata ourselves and make sure to pass this around when we share models. |
I don't think this is correct. If you change your notebook to use fbgemm all the way through, then even the first saved model has the larger size. I think this is due to a separate issue that we could address separately if it's a problem. Here's what I think is going on with model size: These quantized convolutions store their weights in memory in a "packed' form for fast execution. However, the packing format is hardware-dependent, so we need to "unpack" the weights when serializing. fbgemm and qnnpack take different approaches here. When packing weights for qnnpack, we hold on to references to the original unpacked tensors. When "unpacking" we just return those original tensors. Therefore, any aliasing between multiple convolutions will be preserved. The major downside of this approach is that we store two copies of the weights in memory (packed and unpacked). (To avoid excessive memory usage on mobile, we free the unpacked weights, so it's not possible to save a quantized model on mobile.) Conversely, fbgemm supports unpacking the packed weights numerically. So it never stores a duplicate copy in memory, but when we serialize an fbgemm qconv, it constructs a fresh unpacked weight tensor every time. So multiple separate qconvs that share the same weight (or even the same qconv asked to serialize itself multiple times, which might be what's going on here), will return unrelated tensors for serialization, so it becomes impossible for the serializer to dedup them. |
Yup that was unclear, should have been something like "loading a model saved using qnnpack engine and then saving with fbgemm increased model size". Can confirm that model size remains the same with fbgemm (but is larger than qnnpack). |
This actually sounds like a bug. We should just fix it and preserve aliasing through packing/unpacking cycle. One way to do it is to preserve the original tensor during load but resizing it to 0 (to free up memory while running) and then restoring it only once during packing back (so that only the first instance of packed weight would populate it). That way the save/load cycle would still preserve sharing. Or given that sharing is pretty rare, we can also just no release memory in case memory is shared (we can check Storage.use_count for it). On meta level, the original design intention from last year was to hide backend differences from users. That's a much better UX, e.g. people generally don't think whether their gemms run with MKL/fbgemm/cublas/etc. I see two issues listed:
(1) is indeed the real issue. Can we solve it by detecting whether the weights matrix has >7bit numbers and auto-switching to qnnpack? Have we explored it in details? (2) is probably just a bug to be fixed as described above. The actual serialization format is engine independent today Are there other issues we see? Creating separate APIs per engine is a one-way change and it'd be hard to undo in the future. So we should be convinced that we don't really have a way to abstract out fbgemm/qnnpack transparently before we go that route |
I'm confused. Do you mean resize it destructively? Wouldn't that break if you do
Wouldn't this be one if you make 2 packs with the same weight? |
Sorry, I was just thinking about the loading part. You're right that it should work correctly even when pack is called several times one after another. I tried to poke the sharing behavior more. I actually don't think we should fix it at PackedParams level because it semantically behaves like a "tensor". And as you point out it's pretty hard to detect when the second pack happens that the first one should be reused. I extended @mattcyu1's example from N386754 a bit and made it public: https://colab.research.google.com/drive/1qWUq6PQ4fj67TBbD3xg_BvYJve9E91AZ?usp=sharing Basically you see that the difference comes that repeated jit.trace of a quantized model creates different instances of PackedWeights, but if it's a regular tensor - the instance comes out shared. That seems the root cause. In this case we should have just created a single PackedWeights shared across module instances. Then we don't need to worry about engines or anything within, because the sharing occurs at higher level. (cc @suo to double-check my understanding) @dreiss - I like your described idea of remembering the setting of reduce_range inside the module. We can even bake it inside PackedParams that way it won't be eaten up by freezing for sure. Then we can either:
|
Can we move the size discussion to #45465 ? |
Yes, this is what I was proposing.
Yes. |
Thanks for the discussion and the additional context, everyone. I'm not focusing on the size discussion in this comment since it was moved to a separate issue. My takeaways: Assume that a quantization pass generates M output nodes, and there are N target backends (fbgemm and qnnpack today, more may be added in the future). Each of the M output nodes can:
Proposed next steps:
|
… conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `safe_for_fbgemm` boolean flag stored on `ConvPackedParams{n}d`. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
…128 flag on quantized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
…tized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * better unit testing * serialization, verifying that this is BC * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: f252b21f284d6e80cb94f5bf4eb0e299892d3f93 Pull Request resolved: #59984
…128 flag on quantized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
…tized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * better unit testing * serialization, verifying that this is BC * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 80ab2c77eecd2c95e517219b478e6815f4d93d3a Pull Request resolved: #59984
…128 flag on quantized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
…tized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * better unit testing * serialization, verifying that this is BC * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 1f4ee6d2cc6befb91f2e498971c062d8e8bffa9f Pull Request resolved: #59984
…128 flag on quantized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
…tized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * better unit testing * serialization, verifying that this is BC * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: ede13fde05c2f40f1a1950013f75fb69ccc46162 Pull Request resolved: #59984
…128 flag on quantized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
…tized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * better unit testing * serialization, verifying that this is BC * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 32e1b34f22aa7fe0b9a2ad7dbe0d446f6efb46e5 Pull Request resolved: #59984
Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * better unit testing * serialization, verifying that this is BC * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: d1b3de88e5dccdde396a737bc8ac49b1efc44ebb Pull Request resolved: #59984
…128 flag on quantized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
…tized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
…128 flag on quantized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
…tized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * better unit testing * serialization, verifying that this is BC * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 823b16eb0f825f40e1d90228526deb3cef637b21 Pull Request resolved: #59984
…128 flag on quantized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
…tized conv" Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * testing serialization of saving v3 and loading it back * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` # test that the error gets thrown when we are trying to run an operation which could # saturate, and does not get thrown otherwise python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range # test that loading older versions of conv packed params works as expected # TODO(before land): extend these tests with the v3 files python test/test_quantization.py TestSerialization ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285) [ghstack-poisoned]
Summary: This is a start of fixing the problems surfaced in #46749. This particular PR only fixes a small part of this: 1. if a conv module is unsafe to run in fbgemm, we now persist this information. 2. if we are in an fbgemm kernel and we detect that the current conv packed params are tagged as unsafe, we throw an error. For now, this PR is a WIP to get some early feedback if this is the right direction, since iteration cost on this is high. In particular, missing things here are: * better unit testing * serialization, verifying that this is BC * getting all the conv callsites (currently just module + conv2d is handled) Note: there were some potential improvements discussed on dynamically dispatching to qnnpack if it is available and the flag is set. This PR does not attempt to solve this issue - it can be solved by future PRs. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: a51e7cb5593f2623ba39e28f947fece3b2b2e52c Pull Request resolved: #59984
@vkuzo I have been trying to follow this discussion but not sure if there was a solution. So do we have a way to know in non-eager mode, if reduce_range was used/not used at the time of model-generation? Where are the details of any such API/functionality if implemented? |
🐛 Bug: Quantization - we need a better solution for tracking quantization backend settings in a model
Currently, there are various points of confusion:
torch.backends.quantized.engine
, which must be manually set, controls whether qnnpack or fbgemm kernels should be used.reduce_range
in particular, it's easy to do the wrong thingA potential high level set of improvements could be:
prepare
andconvert
stepstorch.backends.quantized.engine
flag is deprecated. Instead, a local per-model flag is used to control which kernels are used.fbgemm
backendCreating this issue to finalize the end state we want, and track progress.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @jerryzh168 @jianyuh @dzhulgakov @raghuramank100 @jamesr66a @vkuzo
The text was updated successfully, but these errors were encountered: