[quant] Release qnnpack original weights for conv/linear #37595

supriyar · 2020-04-30T18:13:36Z

Stack from ghstack:

[quant] Release qnnpack original weights for conv/linear #37595 [quant] Release qnnpack original weights for conv/linear

Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D21365495

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: d9f6c4d Pull Request resolved: #37595

dr-ci · 2020-04-30T19:39:58Z

💊 Build failures summary and remediations

As of commit 37a0c96 (more details on the Dr. CI page):

✅ None of the build failures appear to be your fault 💚

1/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs
1/2 broken upstream at merge base 28ed04c since May 06

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_cpp_doc_push (1/1)

Step: "Doc Build and Push" (full log | diagnosis details | 🔁 rerun) ❄️

May 07 02:49:49 OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')

May 07 02:49:49  
May 07 02:49:49 During handling of the above exception, another exception occurred: 
May 07 02:49:49  
May 07 02:49:49 Traceback (most recent call last): 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/pip/_vendor/urllib3/contrib/pyopenssl.py", line 313, in recv_into 
May 07 02:49:49     return self.connection.recv_into(*args, **kwargs) 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1840, in recv_into 
May 07 02:49:49     self._raise_ssl_error(self._ssl, result) 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error 
May 07 02:49:49     raise SysCallError(errno, errorcode.get(errno)) 
May 07 02:49:49 OpenSSL.SSL.SysCallError: (104, 'ECONNRESET') 
May 07 02:49:49  
May 07 02:49:49 During handling of the above exception, another exception occurred: 
May 07 02:49:49  
May 07 02:49:49 Traceback (most recent call last): 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/pip/_vendor/urllib3/response.py", line 425, in _error_catcher 
May 07 02:49:49     yield 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/pip/_vendor/urllib3/response.py", line 507, in read 
May 07 02:49:49     data = self._fp.read(amt) if not fp_closed else b"" 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 62, in read 
May 07 02:49:49     data = self.__fp.read(amt)

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:

pytorch_linux_backward_compatibility_check_test since May 06
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 25 times.

aten/src/ATen/native/quantized/cpu/qconv.cpp

kimishpatel

This is really great. Thank Supriya. I left a couple of comments.

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 9c47b15 Pull Request resolved: #37595

ezyang · 2020-05-01T16:08:38Z

As @kimishpatel says, I'm generally not a fan of any approach that involves explicitly freeing the original weights, as it is hard to ensure that no one else attempts to access the waits from somewhere else in the meantime.

If we're in the universe where we can special case mobile, my ideal solution would be to simply avoid saving the original weights at all on mobile. This results in a very simple invariant, which is that you can never call unpack on mobile. Is there a reason this is not possible? (If we need a solution that can generalize to server, I think things are a bit harder.)

supriyar · 2020-05-01T16:16:16Z

As @kimishpatel says, I'm generally not a fan of any approach that involves explicitly freeing the original weights, as it is hard to ensure that no one else attempts to access the waits from somewhere else in the meantime.

If we're in the universe where we can special case mobile, my ideal solution would be to simply avoid saving the original weights at all on mobile. This results in a very simple invariant, which is that you can never call unpack on mobile. Is there a reason this is not possible? (If we need a solution that can generalize to server, I think things are a bit harder.)

Thanks for the context @ezyang. For conv and linear ops that use QNNPACK we have a pre-pack function that explicitly packs the weights as a separate aten function call which is called when loading the model. However, QNNPACK was written originally keeping caffe2 functionality in mind - in PyTorch we need the input scale in order to do packing which is only available at runtime. So we actually pack the weights during the first run call (see https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qlinear.cpp#L263) and store them in memory (essentially rendering the pre-pack op for QNNPACK a no-op).
We currently don't plan to change the kernels to change the pack/run functions in qnnpack to not depend on input scale, so considered the alternative of releasing the original weights on mobile.
Note that this will also be an issue with XNNPACK where we don't have an unpack function.

kimishpatel · 2020-05-01T16:38:29Z

As @kimishpatel says, I'm generally not a fan of any approach that involves explicitly freeing the original weights, as it is hard to ensure that no one else attempts to access the waits from somewhere else in the meantime.
If we're in the universe where we can special case mobile, my ideal solution would be to simply avoid saving the original weights at all on mobile. This results in a very simple invariant, which is that you can never call unpack on mobile. Is there a reason this is not possible? (If we need a solution that can generalize to server, I think things are a bit harder.)

Thanks for the context @ezyang. For conv and linear ops that use QNNPACK we have a pre-pack function that explicitly packs the weights as a separate aten function call which is called when loading the model. However, QNNPACK was written originally keeping caffe2 functionality in mind - in PyTorch we need the input scale in order to do packing which is only available at runtime. So we actually pack the weights during the first run call (see https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qlinear.cpp#L263) and store them in memory (essentially rendering the pre-pack op for QNNPACK a no-op).
We currently don't plan to change the kernels to change the pack/run functions in qnnpack to not depend on input scale, so considered the alternative of releasing the original weights on mobile.
Note that this will also be an issue with XNNPACK where we don't have an unpack function.

So if we should put aside the issue of prepacking op actually being a null-op. The problem persist regardless of that.
Once we have torchbind'd class landed for QNNPACK as well the problem is common to both QNNPACK and XNNPACK. Here is the main issue I see.
The torchbind'd class offers setstate and getstate. It it via setstate that weights are packed (this will be a effectively null op for QNNPACK (https://github.com/pytorch/pytorch/pull/35923/files#diff-9cda6e782b86822d30fca76d2257e9cfR279)). Now since the packed weights cannot be serialized, if we want to serialize such a model we need to return original weights. So the model that is saved still has the original weights and not packed weights.
Now when we load the model again, say this time for mobile, then we can have an option of freeing the original weights. But nowhere we have the option of not saving the original weights in the first place.
Given this scenario either we need 1) a way to free the original weights post packing or 2) just use the original weight's memory for packing via appropriately resizing the tensor.
Option 1 is certainly better if possible than option 2 (since there what is stored underneath is not really a tensor and we should not do that).
So @ezyang how can we achieve 1 in a safe manner.

ezyang · 2020-05-01T17:05:19Z

To help me understand the situation better: what happens if the runtime input size varies on mobile? Also, what is the thread safety guarantees for these operators?

kimishpatel · 2020-05-01T17:52:26Z

To help me understand the situation better: what happens if the runtime input size varies on mobile? Also, what is the thread safety guarantees for these operators?

@ezyang, we are just talking about weight packing here. So input size can vary. That is ok.
Regarding thread safety: There is no shared data structure that is mutated as such. The packed weight is in essence shared, however, unless that weight is used by more than one convolution/linear which happened to be invoked on different threads at the same time, it should not be a problem. In rare case like that, it would be an issue though. (@supriyar can add more context if I missed something)
Is that what you were referring to?

ezyang · 2020-05-01T18:29:24Z

sorry, I meant to ask "input range", not input size

dreiss · 2020-05-01T19:38:51Z

I think the methods on the packed weights are not thread safe, and we should probably document this.

I agree with Kimish that we should not have a hard ifdef to control this. My vote would be a global flag that we check when packing that says whether to retain the packed weights or not. Having an ifdef to make that flag default to true on mobile and false on server should give us the best default experience while retaining the flexibility to save on mobile if we need to.

I agree with Edward that explicitly freeing weights is dangerous. Were we able to determine whether the JIT is retaining a reference to the "setstate" arguments?

ezyang · 2020-05-01T20:00:57Z

I think the methods on the packed weights are not thread safe, and we should probably document this.

Documentation would be good, though I'm not sure it's sufficient. I think this case is particularly dangerous because weights is morally a "read-only" argument, and the mutation is under the hood. In general, multithreaded reads to tensors are OK in PyTorch, and so that makes it doubly surprising that weights are not thread safe, unless you happened to know something about the underlying implementation.

That being said, I'm not really sure how you would fix this, except by adding an extra packing stage where you know the input range (which you've said above that you're not going to do. I'm curious to know why, but I could definitely see why this might be time consuming to do).

kimishpatel · 2020-05-01T20:06:50Z

sorry, I meant to ask "input range", not input size

I think I am more confused :). Can you elaborate on what you mean by input range and the context?

kimishpatel · 2020-05-01T20:08:01Z

Were we able to determine whether the JIT is retaining a reference to the "setstate" arguments?

Not known yet. Not sure if @ezyang has a better answer or maybe @zdevito?

kimishpatel · 2020-05-01T20:11:59Z

Documentation would be good, though I'm not sure it's sufficient. I think this case is particularly dangerous because weights is morally a "read-only" argument, and the mutation is under the hood. In general, multithreaded reads to tensors are OK in PyTorch, and so that makes it doubly surprising that weights are not thread safe, unless you happened to know something about the underlying implementation.

@ezyang, I should clarify. The weights of the model are still immutable. It is the structure in which they are wrapped, in this case torchbind'd class, is mutable and that is what gets mutated (at least in XNNPACK integration code).

That being said, I'm not really sure how you would fix this, except by adding an extra packing stage where you know the input range (which you've said above that you're not going to do. I'm curious to know why, but I could definitely see why this might be time consuming to do).

I still dont understand the input range reference here.

dreiss · 2020-05-01T21:59:33Z

Does that mean the input scale and zero point? In the case of QNNPACK, that's not enough to make the prepack immutable.

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 6db0b53 Pull Request resolved: #37595

aten/src/ATen/Context.cpp

kimishpatel · 2020-05-02T01:03:01Z

aten/src/ATen/Context.cpp

@@ -149,6 +149,14 @@ bool Context::isXNNPACKAvailable() const {
 #endif
 }

+bool Context::releaseOriginalWeights() const {


Also can we chose a different name? Since as of now this applies specifically to qnnpack maybe we should call it out such. Like releaseQNNPACKOriginalWeights. Not a great name perhaps but the current one is too broad.

Sure, I thought you may need to use the same flag for XNNPACK as well. If that is not the case then I will make it QNNPACK specific. let me know

How about setReleaseWeightsWhenPrepacking?

aten/src/ATen/native/quantized/cpu/qconv.cpp

ezyang · 2020-05-04T13:56:18Z

Does that mean the input scale and zero point? In the case of QNNPACK, that's not enough to make the prepack immutable.

Oops, I guess I meant input scale. Re your point about making it immutable, is this reflected in the code right now? When I was reading the code, I saw:

    // Re-quantizing the bias based on input scale and weight scale.
    if (!pack_data.input_scale.has_value() ||
        pack_data.input_scale.value() != input_scale) {

which suggested to me that input scale was the only thing being tested.

The weights of the model are still immutable. It is the structure in which they are wrapped, in this case torchbind'd class, is mutable and that is what gets mutated (at least in XNNPACK integration code).

This doesn't feel materially different to me. You have a wrapper object as a stand in for the weights; unless you are telling me that this is actually some generalized "mutable context" for an entire operation that has more thread safety requirements?

kimishpatel · 2020-05-04T14:31:24Z

This doesn't feel materially different to me. You have a wrapper object as a stand in for the weights; unless you are telling me that this is actually some generalized "mutable context" for an entire operation that has more thread safety requirements?

@ezyang, I was just making clarification. In theory thread-safety is an issue, but practically speaking qnnpack_linear wont be called from multiple threads with the same wrapper object assuming we dont have identical instances of linear op as two different nodes in the graph. It is possible, yes but not likely I assume. Please correct me if otherwise.
The way it is done today is largely due to how QNNPACK is implemented. This could be done differently but that probably will be a little convoluted as well.
Same is not true for FP32 ops that use XNNPACK. The weights and their packed versions are filled in at model loading time and are immutable afterwards.

dreiss · 2020-05-04T17:18:37Z

aten/src/ATen/Context.h

@@ -128,6 +130,7 @@ class CAFFE2_API Context {
  bool deterministic_cudnn = false;
  bool benchmark_cudnn = false;
  bool enabled_mkldnn = true;
+  bool release_original_weights = false;


Should we default this to true on mobile?

Yeah, we could. Only issue I see with that is that someone trying out mobile build on server for experimentation purposes might want unpacking to work. But I'm okay with making it default if you want.

aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp

dreiss · 2020-05-04T17:23:04Z

qnnpack_linear wont be called from multiple threads with the same wrapper object assuming we dont have identical instances of linear op as two different nodes in the graph. It is possible, yes but not likely I assume. Please correct me if otherwise

It could also happen if two different threads call into the same model at the same time.

dreiss · 2020-05-04T17:24:47Z

This doesn't feel materially different to me. You have a wrapper object as a stand in for the weights; unless you are telling me that this is actually some generalized "mutable context" for an entire operation that has more thread safety requirements?

Yeah, I think of it as an opaque context object. It's possible in theory to make it thread safe, but it's a significant chunk of work, and given the general lack of multi-threaded inference on mobile (except for within kernels), I don't think it's worth it.

dreiss · 2020-05-04T17:25:16Z

This change looks good to me overall. Any objections to me accepting?

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495) [ghstack-poisoned]

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c52df46 Pull Request resolved: #37595

ezyang · 2020-05-05T22:56:00Z

This change looks good to me overall. Any objections to me accepting?

The change to use reset() adequately addresses the concerns I had related to the question @supriyar brought PyTorch core. So on this limited aspect, I think the patch is fine.

I still find the overall design a bit questionable. Before I summarize my concerns, I want to preface that I am sort of parachuting into this without the historical context. So I don't think you should block this particular PR on these concerns. But I do think it's important to convey some of the lower level concerns.

In general, most of everything in PyTorch is thread safe when used in a semantically read only manner. In cases where it was convenient to maintain some mutable state as an optimization, this was done under the hood, in a thread safe way, invisibly to the user (cuDNN benchmarking is a good example). I'm sympathetic to the fact that we don't really do multithreaded mobile, and I know we have a lot of places in our code where on mobile we have ifdefs that just YOLO thread safety (e.g., fb limited capability where thread_local keyword isn't even supported.) Nothing about the code being modified in this PR makes it clear that this is a "mobile only" thing. The releasing of original weights (from this PR) is intended to be mobile only, but then there is code (https://github.com/pytorch/pytorch/pull/37595/files#diff-83e2267c4f232a4a83c2fc67a9551697) which lets you toggle this functionality from Python. Any reasonable engineer would infer, by reading this, that this feature was intended to be used from server (otherwise, why would there be Python bindings for it?) And there is still the point that the server-side repacking of quantization parameters is not thread safe. So just to respond directly to @dreiss's comment: "It's possible in theory to make it thread safe, but it's a significant chunk of work, and given the general lack of multi-threaded inference on mobile (except for within kernels), I don't think it's worth it" this argument doesn't hold water for the CPU side of things!
I disagree that it would be a lot of work to make things thread safe. If the object is truly never to be used in a multithreaded setting, you can simply give bad performance characteristics (or throw an error) when the context is used in a multithreaded setting. The primitive you need is just a way to automatically retrieve the packed Tensor; since our refcounting is thread safe, that's sufficient to guarantee that the packed Tensor stays valid even if another thread decides to repack. As for repacking, you can just take out a big lock in that case. I think this level of care is warranted, because it is really really painful to debug multithreading errors when you know nothing about what is going on internally, and as established in (1), it doesn't seem there are enough safety barriers to prevent intrepid end users from trying to make something multithreaded.
In general, we like to minimize situations where (1) there is a global toggle on behavior (e.g., releaseWeightsWhenPrepacking) that makes previously working programs start erroring and (2) there there are mobile ifdefs (e.g., changing the defaulting of this behavior.) So it makes my heart bleed when we add another mobile-specific behavior toggle. Obviously, there are cases when there's no other choice, and you've gotta do it. But if it's possible to internalize the behavior change to the model itself (e.g., when allocating the qnnpack context, specify whether or not it releases weights on packing or not), that makes it easier to test this behavior, since you no longer have to build on mobile and/or toggle a global flag, you can just test the behavior with respect to a change in the function parameter. (Speaking of which: where are the tests? :)

supriyar · 2020-05-06T01:18:21Z

Thanks for the review and comments. Addressing some of them here

Any reasonable engineer would infer, by reading this, that this feature was intended to be used from server (otherwise, why would there be Python bindings for it?)

I intended to remove the python bindings for it. I didn't update the PR, my bad...

I know we have a lot of places in our code where on mobile we have ifdefs that just YOLO thread safety (e.g., fb limited capability where thread_local keyword isn't even supported.)

Looks like thread safety on mobile is a general issue that will need to be addressed even outside of this PR. In this case I can throw an error if we try to access the original weights (say from another thread) in the pack logic after it has been released by another thread. Checking pack_ptr.orig_weight.defined() should achieve that. @dreiss, thoughts?

we like to minimize situations where (1) there is a global toggle on behavior (e.g., releaseWeightsWhenPrepacking)

For this case I'm not sure how we can achieve this without modifying the APIs. One solution I can think of is modifying the *_prepack functions to take in an additional argument specifying if the original weights should be released. But prepack isn't called directly by the user (it gets called as part of __get__/__setstate__ calls in the module) so we would have to modify the module init function to swap it. But that would mean having different modules (for Linear and Conv) for mobile vs server and will make this change much more involved.

I'd also like to add that this is blocking some folks, so I'm trying to get this to land at the earliest.

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495) [ghstack-poisoned]

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7272533 Pull Request resolved: #37595

ezyang · 2020-05-06T17:30:02Z

I'd also like to add that this is blocking some folks, so I'm trying to get this to land at the earliest.

Yes. Reamplifying: you shouldn't block this PR on my comments, they're for thinking about the future. I don't think I'm the right person to give the literal Approve here, though.

aten/src/ATen/native/quantized/cpu/qconv.cpp

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495) [ghstack-poisoned]

Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 8e3272c Pull Request resolved: #37595

facebook-github-bot · 2020-05-07T20:21:13Z

This pull request has been merged in 7bf9d98.

dreiss · 2020-05-07T20:47:09Z

But I do think it's important to convey some of the lower level concerns.

Points 1 and 2 are very convincing. I'll make sure we follow up to put locks around all of these mutable packed weights in {Q,X}NNPACK. For release_original_weights, maybe we can just make it a std::atomic?

For point 3, I don't know how to encode this in the model. For a given model, I might want to load it, wrap it, and save it. In that case, the original weights must be preserved. Or I might want to load it and run it. In that case, I don't need the original weights, and freeing them might be necessary to hit my memory budget. So I think it needs to be controllable even for a single model. And since server-side seems to not really care about memory usage (at least on the QNNPACK scale) and mobile is very unlikely to want to re-save a model, the divergent defaults give us automatic correct behavior with no user intervention in all cases that I'm aware of.

kimishpatel · 2020-05-07T21:05:06Z

For

Do note that, if we were to pack weights inside prepacking ops, then with torchbind support we remove this ops completely. They become the attribute of the model and will be packed at model load time (sadly for QNNPACK we are not there yet).
If packing happens at model load time then I think this is the safest, avoiding the needs for locks and maintaining the invariant of immutability.
That said, the fact that prepacking ops exist as a separate op does expose us to thread-safety issue. So if you have a model where prepacking ops are not removed then you are not maintaining the immutability invariant. As of now for XNNPACK the optimization JIT pass converts model to containing prepacking variants of conv and linear, as well as removing the prepacking ops. Perhaps such a pass should make sure that this is done atomically.

supriyar requested review from dreiss, ezyang and kimishpatel April 30, 2020 18:14

kimishpatel reviewed Apr 30, 2020

View reviewed changes

aten/src/ATen/native/quantized/cpu/qconv.cpp Outdated Show resolved Hide resolved

kimishpatel requested changes Apr 30, 2020

View reviewed changes

supriyar requested a review from kimishpatel May 2, 2020 00:30

kimishpatel reviewed May 2, 2020

View reviewed changes

aten/src/ATen/Context.cpp Show resolved Hide resolved

kimishpatel reviewed May 2, 2020

View reviewed changes

aten/src/ATen/native/quantized/cpu/qconv.cpp Outdated Show resolved Hide resolved

dreiss reviewed May 4, 2020

View reviewed changes

supriyar requested review from dreiss and kimishpatel May 5, 2020 20:46

dreiss approved these changes May 7, 2020

View reviewed changes

aten/src/ATen/native/quantized/cpu/qconv.cpp Outdated Show resolved Hide resolved

supriyar added 2 commits May 6, 2020 18:08

kimishpatel approved these changes May 7, 2020

View reviewed changes

facebook-github-bot closed this in 7bf9d98 May 7, 2020

facebook-github-bot added the merged label May 7, 2020

facebook-github-bot deleted the gh/supriyar/103/head branch May 11, 2020 14:20

mruberry added the Merged label Oct 28, 2020

[quant] Release qnnpack original weights for conv/linear #37595

[quant] Release qnnpack original weights for conv/linear #37595

Uh oh!

Conversation

supriyar commented Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 Build failures summary and remediations

❄️ 1 failure tentatively classified as flaky

pytorch_cpp_doc_push (1/1)

🚧 1 ongoing upstream failure:

Uh oh!

Uh oh!

kimishpatel left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented May 1, 2020

Uh oh!

supriyar commented May 1, 2020

Uh oh!

kimishpatel commented May 1, 2020

Uh oh!

ezyang commented May 1, 2020

Uh oh!

kimishpatel commented May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented May 1, 2020

Uh oh!

dreiss commented May 1, 2020

Uh oh!

ezyang commented May 1, 2020

Uh oh!

kimishpatel commented May 1, 2020

Uh oh!

kimishpatel commented May 1, 2020

Uh oh!

kimishpatel commented May 1, 2020

Uh oh!

dreiss commented May 1, 2020

Uh oh!

Uh oh!

kimishpatel May 2, 2020

Choose a reason for hiding this comment

Uh oh!

supriyar May 4, 2020

Choose a reason for hiding this comment

Uh oh!

dreiss May 4, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ezyang commented May 4, 2020

Uh oh!

kimishpatel commented May 4, 2020

Uh oh!

dreiss May 4, 2020

Choose a reason for hiding this comment

Uh oh!

supriyar May 4, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dreiss commented May 4, 2020

Uh oh!

dreiss commented May 4, 2020

Uh oh!

dreiss commented May 4, 2020

Uh oh!

ezyang commented May 5, 2020

Uh oh!

supriyar commented May 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented May 6, 2020

Uh oh!

Uh oh!

facebook-github-bot commented May 7, 2020

supriyar commented Apr 30, 2020 •

edited

Loading

dr-ci bot commented Apr 30, 2020 •

edited

Loading

kimishpatel commented May 1, 2020 •

edited

Loading

supriyar commented May 6, 2020 •

edited

Loading