Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization - we need a better solution for tracking quantization backend settings in a model #46749

Open
vkuzo opened this issue Oct 23, 2020 · 33 comments
Assignees
Labels
oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@vkuzo
Copy link
Contributor

vkuzo commented Oct 23, 2020

🐛 Bug: Quantization - we need a better solution for tracking quantization backend settings in a model

Currently, there are various points of confusion:

  1. a target backend (qnnpack / fbgemm) is usually chosen during model preparation, but it is not saved in the model
  2. in a Python environment, a global flag torch.backends.quantized.engine, which must be manually set, controls whether qnnpack or fbgemm kernels should be used.
  3. with reduce_range in particular, it's easy to do the wrong thing
  4. there are no clear guidelines on what users should do if they need a model which works well on all backends

A potential high level set of improvements could be:

  1. target backend and reduce_range settings used during model prepare and convert are saved on the model as metadata and persist throughout the prepare and convert steps
  2. the global torch.backends.quantized.engine flag is deprecated. Instead, a local per-model flag is used to control which kernels are used.
  3. there is tooling which throws a clear warning when the user tries to do something which is not supported, such as running a model which was trained without reduce_range with the fbgemm backend

Creating this issue to finalize the end state we want, and track progress.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @jerryzh168 @jianyuh @dzhulgakov @raghuramank100 @jamesr66a @vkuzo

@vkuzo vkuzo added the oncall: quantization Quantization support in PyTorch label Oct 23, 2020
@github-actions github-actions bot added this to Need Triage in Quantization Triage Oct 23, 2020
@raghuramank100
Copy link
Contributor

These are good ideas. We will need to also emphasize that a created quantized model for one backend can also be used in another. This is currently done by setting the backend prior to loading a model. We will need to explain how a model can be transferred to a different backend. Perhaps a model.to_backend() sort of API would be useful?

@jerryzh168
Copy link
Contributor

These are good ideas. We will need to also emphasize that a created quantized model for one backend can also be used in another. This is currently done by setting the backend prior to loading a model. We will need to explain how a model can be transferred to a different backend. Perhaps a model.to_backend() sort of API would be useful?

doesn't that require rerunning the whole quantization flow?

@jerryzh168
Copy link
Contributor

Sounds good, I think we should be more explicit about the backend. However I'm also thinking about why we used global flag in the first place and I think we should address the concerns that leads to that decision as well. One thing I can think of is that we are currently using this global flag to decide which kernels to run, e.g.: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qclamp.cpp#L89, and we'll need to think about where to put this flag, it feels like that it should be an attribute of Tensor since we need to disptach on it.

@supriyar
Copy link
Contributor

supriyar commented Oct 23, 2020

However I'm also thinking about why we used global flag in the first place and I think we should address the concerns that leads to that decision as well. One thing I can think of is that we are currently using this global flag to decide which kernels to run

Right. I believe that was one of the main motivation. We would need an additional "backend" in dispatch to distinguish between FBGEMM/QNNPACK and any other potential future optimized backends we want to run on. The idea was that we would use the QuantizedCPU backend and switch between different architecture kernels (server/mobile) based on the global runtime flag.

Another use-case where this is used in the PT codebase was for MKL-DNN backend which is also controlled by a global flag - https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Convolution.cpp#L233

@vkuzo
Copy link
Contributor Author

vkuzo commented Oct 23, 2020

We could break this down into separate dependent issues:

  1. high-pri - a quantized model does not contain explicit information on which backend, if any, the model targets. We can attach that information, set right global flags automatically based on this metadata during inference, and throw warnings or exceptions if the user is trying to do something unexpected (reduce_range, unsupported ops, etc). This should be a smaller change.
  2. low-pri - deprecating global flags. This would raise in importance in an environment where multiple models, with different backend settings each, are executed concurrently. I would expect this to not happen often.

@vkuzo
Copy link
Contributor Author

vkuzo commented Oct 26, 2020

cc @dreiss

@dreiss
Copy link
Contributor

dreiss commented Oct 26, 2020

target backend and reduce_range settings used during model prepare and convert are saved on the model as metadata and persist throughout the prepare and convert steps

This might be annoying to deal with because (1) freezing could delete it, and (2) it would no longer be visible if you put your quantize model inside of another nn.Module (which sometimes do when exporting a model for production).

We would need an additional "backend" in dispatch to distinguish between FBGEMM/QNNPACK

This would be a significant change. I think we should try to avoid it if possible.

currently using this global flag to decide which kernels to run, e.g.: qclamp.cpp

This is a good point, but choosing the "right" backend for qclamp is less important because FBGEMM and QNNPACK are numerically equivalent.

If I understand correctly, the core of the problem here is that FBGEMM has a known an accepted numerical inaccuracy in conv (and presumably also linear), which is worked around by using reduce_range. If these are the only operators that have problems, then we might be able to take advantage of the fact that they both use prepacking. If we set the flag on the prepacked weights, rather than the model itself, that works around both the freezing and wrapping issues and doesn't need a new dispatch key.

What do people thing of this proposal: could we add a boolean to the params_tensor in ConvParamsSerializationType serialize_conv to represent "requires full numerical accuracy" or something like that? Then in deserialize_conv (still in conv_serialization.h), if it's true, we'd always use the QNNPACK engine (or throw if it's not available)? Then, during quantization, you'd just set this to true on all of the qconvs if you're quantizing with non-reduced-range activations?

@jerryzh168
Copy link
Contributor

This is a good point, but choosing the "right" backend for qclamp is less important because FBGEMM and QNNPACK are numerically equivalent.

Is there implications on performance as well between different backends? e.g. in server is fbgemm always faster than qnnpack?

Also the availability of backends in different environments? Is it true that qnnpack is available in all platforms and fbgemm is only available in server?

@dreiss
Copy link
Contributor

dreiss commented Oct 26, 2020

Is there implications on performance as well between different backends? e.g. in server is fbgemm always faster than qnnpack?

In my experience, yes, fbgemm is always faster than qnnpack when it is available. I believe this is partially due to the fact that fbgemm uses SIMD instructions that are not available on mobile x86 processors.

Also the availability of backends in different environments? Is it true that qnnpack is available in all platforms and fbgemm is only available in server?

More-or-less yes. I believe FBGEMM is available on x86_64 only (server and most desktop/laptop environments). qnnpack is available on arm32, arm64, x86, x86_64, and we include it in both our mobile and server builds.

@jerryzh168
Copy link
Contributor

jerryzh168 commented Oct 26, 2020

In my experience, yes, fbgemm is always faster than qnnpack when it is available. I believe this is partially due to the fact that fbgemm uses SIMD instructions that are not available on mobile x86 processors.

in that case how do we decide which backend to use for other operators? I think there are two ways:

Proposal 1

What is described in the your proposal, always set to qnnpack (also no user api for modifying quantized engine), thus no numeric issues, but perf might be a problem when fbgemm is available and user wants to use it.

Proposal 2

We will set the quantized engine based on the availability of backend + performance. that is: when fbgemm is available use fbgemm, otherwise use qnnpack, and don't expose api for user modifications.

although there might be a problem when user actually want to use qnnpack in server even when fbgemm exists, this might be used when user want to verify the numerics of qnnpack quantized model in server before exporting to mobile.

---- edit ---
actually i think proposal 2 might work, since for qnnpack model the boolean should be set to True and the qnnpack engine will be chosen for conv/linear even if the rest of the ops are running on fbgemm.

@dreiss
Copy link
Contributor

dreiss commented Oct 27, 2020

What is described in the your proposal, always set to qnnpack

This is not quite right. My proposal is to always set to qnnpack only for prepacked objects that have been tagged as requiring the full numerical range.

@dreiss
Copy link
Contributor

dreiss commented Oct 27, 2020

actually want to use qnnpack in server even when fbgemm exists

I think we can still support this. The logic would basically be.

if (requires full range || qnnpack explicitly requested || fbgemm not available) {
  use qnnpack
} else {
  use fbgemm
}

@raghuramank100
Copy link
Contributor

raghuramank100 commented Oct 27, 2020

@dreiss: Wanted to check if my understanding is correct:

  1. Quantized Model with packed weights containing information on reduce range (lets call it backend so that it is general).

  2. When model is de-serialized, the backend information is loaded. Can the user override this information? For example, one could have a model quantized for FBGEMM, that can run on QNNPACK with no numerical changes. Also, down the road, for newer processors, there should not be any difference in numerics between fbgemm and qnnpack.

  3. By default the logic would be: Do we have operator parity between qnnpack and fbgemm? Are there cases where this logic would fail?

backend = backend_from_packed_weights
\\ Override logic, can always use qnnpack
if (backend=='qnnpack'|| qnnpack explicitly requested || fbgemm not available) {
  use qnnpack
} else {
  use fbgemm
}

4 Ideally, quantized models should be backend independent and the user can decide which backend needs to be used at run-time.

@jerryzh168
Copy link
Contributor

jerryzh168 commented Oct 27, 2020

actually want to use qnnpack in server even when fbgemm exists

I think we can still support this. The logic would basically be.

if (requires full range || qnnpack explicitly requested || fbgemm not available) {
  use qnnpack
} else {
  use fbgemm
}

then the question is: what is the api for people to explicitly request qnnpack?

@dreiss
Copy link
Contributor

dreiss commented Oct 28, 2020

Quantized Model with packed weights containing information on reduce range (lets call it backend so that it is general).

Yes.

the backend information is loaded. Can the user override this information?

I don't think this capability is necessary for production use cases, but it might be helpful for debugging.

Do we have operator parity between qnnpack and fbgemm? Are there cases where this logic would fail?

Like, if you train a model without reduce_range on a op that qnnpack doesn't support? I suppose in that case we'd have to fail since we have no way to run that op.

Ideally, quantized models should be backend independent and the user can decide which backend needs to be used at run-time.

Yes, I agree with this, but if you train a model without reduce_range, it's simply not safe to run it with fbgemm.

what is the api for people to explicitly request qnnpack?

I think the current one is fine. If we assume that it's only used in serious debugging situations, it doesn't need to be super friendly.

@mattcyu1
Copy link

mattcyu1 commented Oct 28, 2020

Commenting here to give a user perspective as this has bitten me twice in the past couple weeks:

User forgets to set the same backend during evaluation as export
We are building a model to be deployed on mobile. We set torch.backends.quantized.engine="qnnpack" and run ptq, save the model as TorchScript. In a separate operator (lets say we're doing this on another server), we load the model and run evaluation. However since we did not set the backend in the eval operator, we get very different results (than if setting to qnnpack). While not hard to find the issue and set the backend explicitly, it would be nice if it's possible to get some kind of warning that we were using a different backend than what was used during quantization (or even when the model was saved). This would have been more dangerous if the results were closer as it would have been less obvious that there was an issue.

Loading and saving a model increases the size if backend is different
While investigating #45465, it turns out the cause was the backend was qnnpack when the model was saved but fbgemm when the model was loaded (repro: N386754). The user who did the load and save was unaware of the backend settings so when they saved, the default was used (fbgemm) and the model increased in size. This led to some confusing debugging (@dreiss this is why I was asking you about weight sharing bc I went down that rabbit hole trying to figure out the size increases). A warning that the backend during the loading of the model was different than what was used to quantize (or save) could have helped debug the issue.

I'm not following this discussion that well but I think

if (backend=='qnnpack'|| qnnpack explicitly requested || fbgemm not available) {
  use qnnpack
} else {
  use fbgemm
}

would still have resulted in the above errors as our errors were due to users not explicitly setting the backend to qnnpack (and fbgemm would be available on server).

TL;DR would be nice to save some metadata in the TorchScript model to indicate what backend was set during quantization (or saving) and raise some warning when the model is loaded if a different backend is set. However if this is difficult to implement or leads to some undesired effects, we can prob just save some metadata ourselves and make sure to pass this around when we share models.

@dreiss
Copy link
Contributor

dreiss commented Oct 29, 2020

Loading and saving a model increases the size if backend is different

I don't think this is correct. If you change your notebook to use fbgemm all the way through, then even the first saved model has the larger size. I think this is due to a separate issue that we could address separately if it's a problem. Here's what I think is going on with model size:

These quantized convolutions store their weights in memory in a "packed' form for fast execution. However, the packing format is hardware-dependent, so we need to "unpack" the weights when serializing. fbgemm and qnnpack take different approaches here. When packing weights for qnnpack, we hold on to references to the original unpacked tensors. When "unpacking" we just return those original tensors. Therefore, any aliasing between multiple convolutions will be preserved. The major downside of this approach is that we store two copies of the weights in memory (packed and unpacked). (To avoid excessive memory usage on mobile, we free the unpacked weights, so it's not possible to save a quantized model on mobile.) Conversely, fbgemm supports unpacking the packed weights numerically. So it never stores a duplicate copy in memory, but when we serialize an fbgemm qconv, it constructs a fresh unpacked weight tensor every time. So multiple separate qconvs that share the same weight (or even the same qconv asked to serialize itself multiple times, which might be what's going on here), will return unrelated tensors for serialization, so it becomes impossible for the serializer to dedup them.

@mattcyu1
Copy link

Yup that was unclear, should have been something like "loading a model saved using qnnpack engine and then saving with fbgemm increased model size". Can confirm that model size remains the same with fbgemm (but is larger than qnnpack).

@dzhulgakov
Copy link
Collaborator

These quantized convolutions store their weights in memory in a "packed' form for fast execution. However, the packing format is hardware-dependent, so we need to "unpack" the weights when serializing. fbgemm and qnnpack take different approaches here. When packing weights for qnnpack, we hold on to references to the original unpacked tensors. When "unpacking" we just return those original tensors. Therefore, any aliasing between multiple convolutions will be preserved. The major downside of this approach is that we store two copies of the weights in memory (packed and unpacked). (To avoid excessive memory usage on mobile, we free the unpacked weights, so it's not possible to save a quantized model on mobile.) Conversely, fbgemm supports unpacking the packed weights numerically. So it never stores a duplicate copy in memory, but when we serialize an fbgemm qconv, it constructs a fresh unpacked weight tensor every time. So multiple separate qconvs that share the same weight (or even the same qconv asked to serialize itself multiple times, which might be what's going on here), will return unrelated tensors for serialization, so it becomes impossible for the serializer to dedup them.

This actually sounds like a bug. We should just fix it and preserve aliasing through packing/unpacking cycle. One way to do it is to preserve the original tensor during load but resizing it to 0 (to free up memory while running) and then restoring it only once during packing back (so that only the first instance of packed weight would populate it). That way the save/load cycle would still preserve sharing. Or given that sharing is pretty rare, we can also just no release memory in case memory is shared (we can check Storage.use_count for it).

On meta level, the original design intention from last year was to hide backend differences from users. That's a much better UX, e.g. people generally don't think whether their gemms run with MKL/fbgemm/cublas/etc.

I see two issues listed:

  1. reduce_range caused by lack of VNNI instructions
  2. fbgemm serialized files being larger

(1) is indeed the real issue. Can we solve it by detecting whether the weights matrix has >7bit numbers and auto-switching to qnnpack? Have we explored it in details?

(2) is probably just a bug to be fixed as described above. The actual serialization format is engine independent today

Are there other issues we see? Creating separate APIs per engine is a one-way change and it'd be hard to undo in the future. So we should be convinced that we don't really have a way to abstract out fbgemm/qnnpack transparently before we go that route

@dreiss
Copy link
Contributor

dreiss commented Oct 29, 2020

preserve the original tensor during load but resizing it to 0 (to free up memory while running)

I'm confused. Do you mean resize it destructively? Wouldn't that break if you do p1 = pack(w); p2 = pack(w); because p2 would see a zero-size tensor?

Storage.use_count

Wouldn't this be one if you make 2 packs with the same weight?

@dzhulgakov
Copy link
Collaborator

Sorry, I was just thinking about the loading part. You're right that it should work correctly even when pack is called several times one after another.

I tried to poke the sharing behavior more. I actually don't think we should fix it at PackedParams level because it semantically behaves like a "tensor". And as you point out it's pretty hard to detect when the second pack happens that the first one should be reused.

I extended @mattcyu1's example from N386754 a bit and made it public: https://colab.research.google.com/drive/1qWUq6PQ4fj67TBbD3xg_BvYJve9E91AZ?usp=sharing

Basically you see that the difference comes that repeated jit.trace of a quantized model creates different instances of PackedWeights, but if it's a regular tensor - the instance comes out shared. That seems the root cause. In this case we should have just created a single PackedWeights shared across module instances. Then we don't need to worry about engines or anything within, because the sharing occurs at higher level. (cc @suo to double-check my understanding)


@dreiss - I like your described idea of remembering the setting of reduce_range inside the module. We can even bake it inside PackedParams that way it won't be eaten up by freezing for sure. Then we can either:

  • generate a warning to user if the backend is wrong
  • when loading the PackedWeights if no reduce_range was applied - auto-fall-back to qnnpack and print a warning/error. Note that backend setting is really incapsulated in in-memory PackedWeight so one could easily mix both in one model. The only reason we have the global flag is to control the preference of which PackedWeights to create when deserializing, it doesn't affect the model runtime

@dreiss
Copy link
Contributor

dreiss commented Oct 30, 2020

Can we move the size discussion to #45465 ?

@dreiss
Copy link
Contributor

dreiss commented Oct 30, 2020

We can even bake it inside PackedParams that way it won't be eaten up by freezing for sure

Yes, this is what I was proposing.

when loading the PackedWeights if no reduce_range was applied - auto-fall-back to qnnpack and print a warning/error

Yes.

@supriyar supriyar moved this from Need Triage to Unassigned in Quantization Triage Oct 30, 2020
@supriyar supriyar added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 30, 2020
@vkuzo
Copy link
Contributor Author

vkuzo commented Jan 14, 2021

Thanks for the discussion and the additional context, everyone. I'm not focusing on the size discussion in this comment since it was moved to a separate issue. My takeaways:

Assume that a quantization pass generates M output nodes, and there are N target backends (fbgemm and qnnpack today, more may be added in the future). Each of the M output nodes can:

  1. be supported and work equally well across the N backends (most things fall here)
  2. be supported on a subset of backends, and result in a runtime error on unsupported backends (example: someone writes a new op with only fbgemm kernels, etc)
  3. give numerically incorrect results on some backends, based on input settings (fbgemm requires reduce_range, but fails silently without it)
  • (2.) sounds ok for now
  • (3.) is a problem which we should definitely fix, we should throw warnings/errors when this happens instead of silent failure.
  • reduce_range is the only instance of 3 that I'm aware of today. More of these may arise as more ops and backends are added (if anyone has knowledge of similar issues that are coming up, please let us know).

Proposed next steps:

  • (no change) do not persist backend throughout the model. Model owners can/should do this outside of PyTorch.
  • (change) persist the value of reduce_range on each quantized node. There is some homework to do on the best way to accomplish this, and whether to do this for all nodes or just nodes with known problematic kernels. We will have to also make sure BC is not broken, and look into whether this can be done dynamically instead of persisting a flag.
    ** (change) In fbgemm, throw an error if a kernel is called without reduce_range. (current behavior - overflow)
    ** (optional) In qnnpack, optionally display a warning if a kernel is called with reduce_range (user is leaving accuracy on the table) (current behavior - no warning)

vkuzo added a commit that referenced this issue Jun 29, 2021
… conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `safe_for_fbgemm` boolean flag stored on `ConvPackedParams{n}d`.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jun 29, 2021
…128 flag on quantized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jun 29, 2021
…tized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jun 29, 2021
Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* better unit testing
* serialization, verifying that this is BC
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: f252b21f284d6e80cb94f5bf4eb0e299892d3f93
Pull Request resolved: #59984
vkuzo added a commit that referenced this issue Jun 29, 2021
…128 flag on quantized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jun 29, 2021
…tized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jun 29, 2021
Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* better unit testing
* serialization, verifying that this is BC
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 80ab2c77eecd2c95e517219b478e6815f4d93d3a
Pull Request resolved: #59984
vkuzo added a commit that referenced this issue Jun 30, 2021
…128 flag on quantized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jun 30, 2021
…tized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jun 30, 2021
Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* better unit testing
* serialization, verifying that this is BC
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 1f4ee6d2cc6befb91f2e498971c062d8e8bffa9f
Pull Request resolved: #59984
vkuzo added a commit that referenced this issue Jul 1, 2021
…128 flag on quantized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jul 1, 2021
…tized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jul 1, 2021
Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* better unit testing
* serialization, verifying that this is BC
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: ede13fde05c2f40f1a1950013f75fb69ccc46162
Pull Request resolved: #59984
vkuzo added a commit that referenced this issue Jul 6, 2021
…128 flag on quantized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jul 6, 2021
…tized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jul 6, 2021
Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* better unit testing
* serialization, verifying that this is BC
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 32e1b34f22aa7fe0b9a2ad7dbe0d446f6efb46e5
Pull Request resolved: #59984
vkuzo added a commit that referenced this issue Jul 14, 2021
Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* better unit testing
* serialization, verifying that this is BC
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: d1b3de88e5dccdde396a737bc8ac49b1efc44ebb
Pull Request resolved: #59984
vkuzo added a commit that referenced this issue Jul 14, 2021
…128 flag on quantized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jul 14, 2021
…tized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jul 15, 2021
…128 flag on quantized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jul 15, 2021
…tized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jul 15, 2021
Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* better unit testing
* serialization, verifying that this is BC
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 823b16eb0f825f40e1d90228526deb3cef637b21
Pull Request resolved: #59984
vkuzo added a commit that referenced this issue Jul 19, 2021
…128 flag on quantized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jul 19, 2021
…tized conv"


Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information with a `input_qrange_le_128` boolean flag stored on `ConvPackedParams{n}d` set to False.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* testing serialization of saving v3 and loading it back
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
# test that the error gets thrown when we are trying to run an operation which could
# saturate, and does not get thrown otherwise
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range

# test that loading older versions of conv packed params works as expected
# TODO(before land): extend these tests with the v3 files
python test/test_quantization.py TestSerialization
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29175285](https://our.internmc.facebook.com/intern/diff/D29175285)

[ghstack-poisoned]
vkuzo added a commit that referenced this issue Jul 19, 2021
Summary:

This is a start of fixing the problems surfaced in #46749.
This particular PR only fixes a small part of this:
1. if a conv module is unsafe to run in fbgemm, we now persist this
information.
2. if we are in an fbgemm kernel and we detect that the current conv
packed params are tagged as unsafe, we throw an error.

For now, this PR is a WIP to get some early feedback if this is the
right direction, since iteration cost on this is high. In particular,
missing things here are:
* better unit testing
* serialization, verifying that this is BC
* getting all the conv callsites (currently just module + conv2d is handled)

Note: there were some potential improvements discussed on dynamically
dispatching to qnnpack if it is available and the flag is set.  This PR
does not attempt to solve this issue - it can be solved by future PRs.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_conv_reduce_range
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: a51e7cb5593f2623ba39e28f947fece3b2b2e52c
Pull Request resolved: #59984
@saurabh-shandilya
Copy link

@vkuzo I have been trying to follow this discussion but not sure if there was a solution. So do we have a way to know in non-eager mode, if reduce_range was used/not used at the time of model-generation? Where are the details of any such API/functionality if implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Quantization Triage
  
Low Priority
Development

No branches or pull requests

10 participants