Skip to content

[quant] Release qnnpack original weights for conv/linear #37595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

supriyar
Copy link
Contributor

@supriyar supriyar commented Apr 30, 2020

Stack from ghstack:

Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D21365495

Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
supriyar added a commit that referenced this pull request Apr 30, 2020
Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: d9f6c4d
Pull Request resolved: #37595
@dr-ci
Copy link

dr-ci bot commented Apr 30, 2020

💊 Build failures summary and remediations

As of commit 37a0c96 (more details on the Dr. CI page):


None of the build failures appear to be your fault 💚



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_cpp_doc_push (1/1)

Step: "Doc Build and Push" (full log | diagnosis details | 🔁 rerun) ❄️

May 07 02:49:49 OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
May 07 02:49:49  
May 07 02:49:49 During handling of the above exception, another exception occurred: 
May 07 02:49:49  
May 07 02:49:49 Traceback (most recent call last): 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/pip/_vendor/urllib3/contrib/pyopenssl.py", line 313, in recv_into 
May 07 02:49:49     return self.connection.recv_into(*args, **kwargs) 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1840, in recv_into 
May 07 02:49:49     self._raise_ssl_error(self._ssl, result) 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error 
May 07 02:49:49     raise SysCallError(errno, errorcode.get(errno)) 
May 07 02:49:49 OpenSSL.SSL.SysCallError: (104, 'ECONNRESET') 
May 07 02:49:49  
May 07 02:49:49 During handling of the above exception, another exception occurred: 
May 07 02:49:49  
May 07 02:49:49 Traceback (most recent call last): 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/pip/_vendor/urllib3/response.py", line 425, in _error_catcher 
May 07 02:49:49     yield 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/pip/_vendor/urllib3/response.py", line 507, in read 
May 07 02:49:49     data = self._fp.read(amt) if not fp_closed else b"" 
May 07 02:49:49   File "/opt/conda/lib/python3.6/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 62, in read 
May 07 02:49:49     data = self.__fp.read(amt) 

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 25 times.

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great. Thank Supriya. I left a couple of comments.

Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
supriyar added a commit that referenced this pull request Apr 30, 2020
Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 9c47b15
Pull Request resolved: #37595
@ezyang
Copy link
Contributor

ezyang commented May 1, 2020

As @kimishpatel says, I'm generally not a fan of any approach that involves explicitly freeing the original weights, as it is hard to ensure that no one else attempts to access the waits from somewhere else in the meantime.

If we're in the universe where we can special case mobile, my ideal solution would be to simply avoid saving the original weights at all on mobile. This results in a very simple invariant, which is that you can never call unpack on mobile. Is there a reason this is not possible? (If we need a solution that can generalize to server, I think things are a bit harder.)

@supriyar
Copy link
Contributor Author

supriyar commented May 1, 2020

As @kimishpatel says, I'm generally not a fan of any approach that involves explicitly freeing the original weights, as it is hard to ensure that no one else attempts to access the waits from somewhere else in the meantime.

If we're in the universe where we can special case mobile, my ideal solution would be to simply avoid saving the original weights at all on mobile. This results in a very simple invariant, which is that you can never call unpack on mobile. Is there a reason this is not possible? (If we need a solution that can generalize to server, I think things are a bit harder.)

Thanks for the context @ezyang. For conv and linear ops that use QNNPACK we have a pre-pack function that explicitly packs the weights as a separate aten function call which is called when loading the model. However, QNNPACK was written originally keeping caffe2 functionality in mind - in PyTorch we need the input scale in order to do packing which is only available at runtime. So we actually pack the weights during the first run call (see https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qlinear.cpp#L263) and store them in memory (essentially rendering the pre-pack op for QNNPACK a no-op).
We currently don't plan to change the kernels to change the pack/run functions in qnnpack to not depend on input scale, so considered the alternative of releasing the original weights on mobile.
Note that this will also be an issue with XNNPACK where we don't have an unpack function.

@kimishpatel
Copy link
Contributor

As @kimishpatel says, I'm generally not a fan of any approach that involves explicitly freeing the original weights, as it is hard to ensure that no one else attempts to access the waits from somewhere else in the meantime.
If we're in the universe where we can special case mobile, my ideal solution would be to simply avoid saving the original weights at all on mobile. This results in a very simple invariant, which is that you can never call unpack on mobile. Is there a reason this is not possible? (If we need a solution that can generalize to server, I think things are a bit harder.)

Thanks for the context @ezyang. For conv and linear ops that use QNNPACK we have a pre-pack function that explicitly packs the weights as a separate aten function call which is called when loading the model. However, QNNPACK was written originally keeping caffe2 functionality in mind - in PyTorch we need the input scale in order to do packing which is only available at runtime. So we actually pack the weights during the first run call (see https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qlinear.cpp#L263) and store them in memory (essentially rendering the pre-pack op for QNNPACK a no-op).
We currently don't plan to change the kernels to change the pack/run functions in qnnpack to not depend on input scale, so considered the alternative of releasing the original weights on mobile.
Note that this will also be an issue with XNNPACK where we don't have an unpack function.

So if we should put aside the issue of prepacking op actually being a null-op. The problem persist regardless of that.
Once we have torchbind'd class landed for QNNPACK as well the problem is common to both QNNPACK and XNNPACK. Here is the main issue I see.
The torchbind'd class offers setstate and getstate. It it via setstate that weights are packed (this will be a effectively null op for QNNPACK (https://github.com/pytorch/pytorch/pull/35923/files#diff-9cda6e782b86822d30fca76d2257e9cfR279)). Now since the packed weights cannot be serialized, if we want to serialize such a model we need to return original weights. So the model that is saved still has the original weights and not packed weights.
Now when we load the model again, say this time for mobile, then we can have an option of freeing the original weights. But nowhere we have the option of not saving the original weights in the first place.
Given this scenario either we need 1) a way to free the original weights post packing or 2) just use the original weight's memory for packing via appropriately resizing the tensor.
Option 1 is certainly better if possible than option 2 (since there what is stored underneath is not really a tensor and we should not do that).
So @ezyang how can we achieve 1 in a safe manner.

@ezyang
Copy link
Contributor

ezyang commented May 1, 2020

To help me understand the situation better: what happens if the runtime input size varies on mobile? Also, what is the thread safety guarantees for these operators?

@kimishpatel
Copy link
Contributor

kimishpatel commented May 1, 2020

To help me understand the situation better: what happens if the runtime input size varies on mobile? Also, what is the thread safety guarantees for these operators?

@ezyang, we are just talking about weight packing here. So input size can vary. That is ok.
Regarding thread safety: There is no shared data structure that is mutated as such. The packed weight is in essence shared, however, unless that weight is used by more than one convolution/linear which happened to be invoked on different threads at the same time, it should not be a problem. In rare case like that, it would be an issue though. (@supriyar can add more context if I missed something)
Is that what you were referring to?

@ezyang
Copy link
Contributor

ezyang commented May 1, 2020

sorry, I meant to ask "input range", not input size

@dreiss
Copy link
Contributor

dreiss commented May 1, 2020

I think the methods on the packed weights are not thread safe, and we should probably document this.

I agree with Kimish that we should not have a hard ifdef to control this. My vote would be a global flag that we check when packing that says whether to retain the packed weights or not. Having an ifdef to make that flag default to true on mobile and false on server should give us the best default experience while retaining the flexibility to save on mobile if we need to.

I agree with Edward that explicitly freeing weights is dangerous. Were we able to determine whether the JIT is retaining a reference to the "setstate" arguments?

@ezyang
Copy link
Contributor

ezyang commented May 1, 2020

I think the methods on the packed weights are not thread safe, and we should probably document this.

Documentation would be good, though I'm not sure it's sufficient. I think this case is particularly dangerous because weights is morally a "read-only" argument, and the mutation is under the hood. In general, multithreaded reads to tensors are OK in PyTorch, and so that makes it doubly surprising that weights are not thread safe, unless you happened to know something about the underlying implementation.

That being said, I'm not really sure how you would fix this, except by adding an extra packing stage where you know the input range (which you've said above that you're not going to do. I'm curious to know why, but I could definitely see why this might be time consuming to do).

@kimishpatel
Copy link
Contributor

sorry, I meant to ask "input range", not input size

I think I am more confused :). Can you elaborate on what you mean by input range and the context?

@kimishpatel
Copy link
Contributor

Were we able to determine whether the JIT is retaining a reference to the "setstate" arguments?

Not known yet. Not sure if @ezyang has a better answer or maybe @zdevito?

@kimishpatel
Copy link
Contributor

Documentation would be good, though I'm not sure it's sufficient. I think this case is particularly dangerous because weights is morally a "read-only" argument, and the mutation is under the hood. In general, multithreaded reads to tensors are OK in PyTorch, and so that makes it doubly surprising that weights are not thread safe, unless you happened to know something about the underlying implementation.

@ezyang, I should clarify. The weights of the model are still immutable. It is the structure in which they are wrapped, in this case torchbind'd class, is mutable and that is what gets mutated (at least in XNNPACK integration code).

That being said, I'm not really sure how you would fix this, except by adding an extra packing stage where you know the input range (which you've said above that you're not going to do. I'm curious to know why, but I could definitely see why this might be time consuming to do).

I still dont understand the input range reference here.

@dreiss
Copy link
Contributor

dreiss commented May 1, 2020

Does that mean the input scale and zero point? In the case of QNNPACK, that's not enough to make the prepack immutable.

Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
supriyar added a commit that referenced this pull request May 2, 2020
Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 6db0b53
Pull Request resolved: #37595
@supriyar supriyar requested a review from kimishpatel May 2, 2020 00:30
@@ -149,6 +149,14 @@ bool Context::isXNNPACKAvailable() const {
#endif
}

bool Context::releaseOriginalWeights() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can we chose a different name? Since as of now this applies specifically to qnnpack maybe we should call it out such. Like releaseQNNPACKOriginalWeights. Not a great name perhaps but the current one is too broad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I thought you may need to use the same flag for XNNPACK as well. If that is not the case then I will make it QNNPACK specific. let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about setReleaseWeightsWhenPrepacking?

@ezyang
Copy link
Contributor

ezyang commented May 4, 2020

Does that mean the input scale and zero point? In the case of QNNPACK, that's not enough to make the prepack immutable.

Oops, I guess I meant input scale. Re your point about making it immutable, is this reflected in the code right now? When I was reading the code, I saw:

    // Re-quantizing the bias based on input scale and weight scale.
    if (!pack_data.input_scale.has_value() ||
        pack_data.input_scale.value() != input_scale) {

which suggested to me that input scale was the only thing being tested.

The weights of the model are still immutable. It is the structure in which they are wrapped, in this case torchbind'd class, is mutable and that is what gets mutated (at least in XNNPACK integration code).

This doesn't feel materially different to me. You have a wrapper object as a stand in for the weights; unless you are telling me that this is actually some generalized "mutable context" for an entire operation that has more thread safety requirements?

@kimishpatel
Copy link
Contributor

This doesn't feel materially different to me. You have a wrapper object as a stand in for the weights; unless you are telling me that this is actually some generalized "mutable context" for an entire operation that has more thread safety requirements?

@ezyang, I was just making clarification. In theory thread-safety is an issue, but practically speaking qnnpack_linear wont be called from multiple threads with the same wrapper object assuming we dont have identical instances of linear op as two different nodes in the graph. It is possible, yes but not likely I assume. Please correct me if otherwise.
The way it is done today is largely due to how QNNPACK is implemented. This could be done differently but that probably will be a little convoluted as well.
Same is not true for FP32 ops that use XNNPACK. The weights and their packed versions are filled in at model loading time and are immutable afterwards.

@@ -128,6 +130,7 @@ class CAFFE2_API Context {
bool deterministic_cudnn = false;
bool benchmark_cudnn = false;
bool enabled_mkldnn = true;
bool release_original_weights = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default this to true on mobile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could. Only issue I see with that is that someone trying out mobile build on server for experimentation purposes might want unpacking to work. But I'm okay with making it default if you want.

@dreiss
Copy link
Contributor

dreiss commented May 4, 2020

qnnpack_linear wont be called from multiple threads with the same wrapper object assuming we dont have identical instances of linear op as two different nodes in the graph. It is possible, yes but not likely I assume. Please correct me if otherwise

It could also happen if two different threads call into the same model at the same time.

@dreiss
Copy link
Contributor

dreiss commented May 4, 2020

This doesn't feel materially different to me. You have a wrapper object as a stand in for the weights; unless you are telling me that this is actually some generalized "mutable context" for an entire operation that has more thread safety requirements?

Yeah, I think of it as an opaque context object. It's possible in theory to make it thread safe, but it's a significant chunk of work, and given the general lack of multi-threaded inference on mobile (except for within kernels), I don't think it's worth it.

@dreiss
Copy link
Contributor

dreiss commented May 4, 2020

This change looks good to me overall. Any objections to me accepting?

Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495)

[ghstack-poisoned]
supriyar added a commit that referenced this pull request May 5, 2020
Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: c52df46
Pull Request resolved: #37595
@supriyar supriyar requested review from dreiss and kimishpatel May 5, 2020 20:46
@ezyang
Copy link
Contributor

ezyang commented May 5, 2020

This change looks good to me overall. Any objections to me accepting?

The change to use reset() adequately addresses the concerns I had related to the question @supriyar brought PyTorch core. So on this limited aspect, I think the patch is fine.

I still find the overall design a bit questionable. Before I summarize my concerns, I want to preface that I am sort of parachuting into this without the historical context. So I don't think you should block this particular PR on these concerns. But I do think it's important to convey some of the lower level concerns.

  1. In general, most of everything in PyTorch is thread safe when used in a semantically read only manner. In cases where it was convenient to maintain some mutable state as an optimization, this was done under the hood, in a thread safe way, invisibly to the user (cuDNN benchmarking is a good example). I'm sympathetic to the fact that we don't really do multithreaded mobile, and I know we have a lot of places in our code where on mobile we have ifdefs that just YOLO thread safety (e.g., fb limited capability where thread_local keyword isn't even supported.) Nothing about the code being modified in this PR makes it clear that this is a "mobile only" thing. The releasing of original weights (from this PR) is intended to be mobile only, but then there is code (https://github.com/pytorch/pytorch/pull/37595/files#diff-83e2267c4f232a4a83c2fc67a9551697) which lets you toggle this functionality from Python. Any reasonable engineer would infer, by reading this, that this feature was intended to be used from server (otherwise, why would there be Python bindings for it?) And there is still the point that the server-side repacking of quantization parameters is not thread safe. So just to respond directly to @dreiss's comment: "It's possible in theory to make it thread safe, but it's a significant chunk of work, and given the general lack of multi-threaded inference on mobile (except for within kernels), I don't think it's worth it" this argument doesn't hold water for the CPU side of things!
  2. I disagree that it would be a lot of work to make things thread safe. If the object is truly never to be used in a multithreaded setting, you can simply give bad performance characteristics (or throw an error) when the context is used in a multithreaded setting. The primitive you need is just a way to automatically retrieve the packed Tensor; since our refcounting is thread safe, that's sufficient to guarantee that the packed Tensor stays valid even if another thread decides to repack. As for repacking, you can just take out a big lock in that case. I think this level of care is warranted, because it is really really painful to debug multithreading errors when you know nothing about what is going on internally, and as established in (1), it doesn't seem there are enough safety barriers to prevent intrepid end users from trying to make something multithreaded.
  3. In general, we like to minimize situations where (1) there is a global toggle on behavior (e.g., releaseWeightsWhenPrepacking) that makes previously working programs start erroring and (2) there there are mobile ifdefs (e.g., changing the defaulting of this behavior.) So it makes my heart bleed when we add another mobile-specific behavior toggle. Obviously, there are cases when there's no other choice, and you've gotta do it. But if it's possible to internalize the behavior change to the model itself (e.g., when allocating the qnnpack context, specify whether or not it releases weights on packing or not), that makes it easier to test this behavior, since you no longer have to build on mobile and/or toggle a global flag, you can just test the behavior with respect to a change in the function parameter. (Speaking of which: where are the tests? :)

@supriyar
Copy link
Contributor Author

supriyar commented May 6, 2020

Thanks for the review and comments. Addressing some of them here

Any reasonable engineer would infer, by reading this, that this feature was intended to be used from server (otherwise, why would there be Python bindings for it?)

I intended to remove the python bindings for it. I didn't update the PR, my bad...

I know we have a lot of places in our code where on mobile we have ifdefs that just YOLO thread safety (e.g., fb limited capability where thread_local keyword isn't even supported.)

Looks like thread safety on mobile is a general issue that will need to be addressed even outside of this PR. In this case I can throw an error if we try to access the original weights (say from another thread) in the pack logic after it has been released by another thread. Checking pack_ptr.orig_weight.defined() should achieve that. @dreiss, thoughts?

we like to minimize situations where (1) there is a global toggle on behavior (e.g., releaseWeightsWhenPrepacking)

For this case I'm not sure how we can achieve this without modifying the APIs. One solution I can think of is modifying the *_prepack functions to take in an additional argument specifying if the original weights should be released. But prepack isn't called directly by the user (it gets called as part of __get__/__setstate__ calls in the module) so we would have to modify the module init function to swap it. But that would mean having different modules (for Linear and Conv) for mobile vs server and will make this change much more involved.

I'd also like to add that this is blocking some folks, so I'm trying to get this to land at the earliest.

Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495)

[ghstack-poisoned]
supriyar added a commit that referenced this pull request May 6, 2020
Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 7272533
Pull Request resolved: #37595
@ezyang
Copy link
Contributor

ezyang commented May 6, 2020

I'd also like to add that this is blocking some folks, so I'm trying to get this to land at the earliest.

Yes. Reamplifying: you shouldn't block this PR on my comments, they're for thinking about the future. I don't think I'm the right person to give the literal Approve here, though.

supriyar added 2 commits May 6, 2020 18:08
Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495)

[ghstack-poisoned]
Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495)

[ghstack-poisoned]
supriyar added a commit that referenced this pull request May 7, 2020
Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 8e3272c
Pull Request resolved: #37595
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7bf9d98.

@dreiss
Copy link
Contributor

dreiss commented May 7, 2020

But I do think it's important to convey some of the lower level concerns.

Points 1 and 2 are very convincing. I'll make sure we follow up to put locks around all of these mutable packed weights in {Q,X}NNPACK. For release_original_weights, maybe we can just make it a std::atomic?

For point 3, I don't know how to encode this in the model. For a given model, I might want to load it, wrap it, and save it. In that case, the original weights must be preserved. Or I might want to load it and run it. In that case, I don't need the original weights, and freeing them might be necessary to hit my memory budget. So I think it needs to be controllable even for a single model. And since server-side seems to not really care about memory usage (at least on the QNNPACK scale) and mobile is very unlikely to want to re-save a model, the divergent defaults give us automatic correct behavior with no user intervention in all cases that I'm aware of.

@kimishpatel
Copy link
Contributor

For

Do note that, if we were to pack weights inside prepacking ops, then with torchbind support we remove this ops completely. They become the attribute of the model and will be packed at model load time (sadly for QNNPACK we are not there yet).
If packing happens at model load time then I think this is the safest, avoiding the needs for locks and maintaining the invariant of immutability.
That said, the fact that prepacking ops exist as a separate op does expose us to thread-safety issue. So if you have a model where prepacking ops are not removed then you are not maintaining the immutability invariant. As of now for XNNPACK the optimization JIT pass converts model to containing prepacking variants of conv and linear, as well as removing the prepacking ops. Perhaps such a pass should make sure that this is done atomically.

@facebook-github-bot facebook-github-bot deleted the gh/supriyar/103/head branch May 11, 2020 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants