-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[quant] Release qnnpack original weights for conv/linear #37595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: d9f6c4d Pull Request resolved: #37595
💊 Build failures summary and remediationsAs of commit 37a0c96 (more details on the Dr. CI page): ✅ None of the build failures appear to be your fault 💚
❄️ 1 failure tentatively classified as flakybut reruns have not yet been triggered to confirm:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great. Thank Supriya. I left a couple of comments.
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 9c47b15 Pull Request resolved: #37595
As @kimishpatel says, I'm generally not a fan of any approach that involves explicitly freeing the original weights, as it is hard to ensure that no one else attempts to access the waits from somewhere else in the meantime. If we're in the universe where we can special case mobile, my ideal solution would be to simply avoid saving the original weights at all on mobile. This results in a very simple invariant, which is that you can never call unpack on mobile. Is there a reason this is not possible? (If we need a solution that can generalize to server, I think things are a bit harder.) |
Thanks for the context @ezyang. For conv and linear ops that use QNNPACK we have a pre-pack function that explicitly packs the weights as a separate aten function call which is called when loading the model. However, QNNPACK was written originally keeping caffe2 functionality in mind - in PyTorch we need the input scale in order to do packing which is only available at runtime. So we actually pack the weights during the first run call (see https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qlinear.cpp#L263) and store them in memory (essentially rendering the pre-pack op for QNNPACK a no-op). |
So if we should put aside the issue of prepacking op actually being a null-op. The problem persist regardless of that. |
To help me understand the situation better: what happens if the runtime input size varies on mobile? Also, what is the thread safety guarantees for these operators? |
@ezyang, we are just talking about weight packing here. So input size can vary. That is ok. |
sorry, I meant to ask "input range", not input size |
I think the methods on the packed weights are not thread safe, and we should probably document this. I agree with Kimish that we should not have a hard ifdef to control this. My vote would be a global flag that we check when packing that says whether to retain the packed weights or not. Having an ifdef to make that flag default to true on mobile and false on server should give us the best default experience while retaining the flexibility to save on mobile if we need to. I agree with Edward that explicitly freeing weights is dangerous. Were we able to determine whether the JIT is retaining a reference to the "setstate" arguments? |
Documentation would be good, though I'm not sure it's sufficient. I think this case is particularly dangerous because weights is morally a "read-only" argument, and the mutation is under the hood. In general, multithreaded reads to tensors are OK in PyTorch, and so that makes it doubly surprising that weights are not thread safe, unless you happened to know something about the underlying implementation. That being said, I'm not really sure how you would fix this, except by adding an extra packing stage where you know the input range (which you've said above that you're not going to do. I'm curious to know why, but I could definitely see why this might be time consuming to do). |
I think I am more confused :). Can you elaborate on what you mean by input range and the context? |
@ezyang, I should clarify. The weights of the model are still immutable. It is the structure in which they are wrapped, in this case torchbind'd class, is mutable and that is what gets mutated (at least in XNNPACK integration code).
I still dont understand the input range reference here. |
Does that mean the input scale and zero point? In the case of QNNPACK, that's not enough to make the prepack immutable. |
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 6db0b53 Pull Request resolved: #37595
aten/src/ATen/Context.cpp
Outdated
@@ -149,6 +149,14 @@ bool Context::isXNNPACKAvailable() const { | |||
#endif | |||
} | |||
|
|||
bool Context::releaseOriginalWeights() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can we chose a different name? Since as of now this applies specifically to qnnpack maybe we should call it out such. Like releaseQNNPACKOriginalWeights
. Not a great name perhaps but the current one is too broad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I thought you may need to use the same flag for XNNPACK as well. If that is not the case then I will make it QNNPACK specific. let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about setReleaseWeightsWhenPrepacking
?
Oops, I guess I meant input scale. Re your point about making it immutable, is this reflected in the code right now? When I was reading the code, I saw:
which suggested to me that input scale was the only thing being tested.
This doesn't feel materially different to me. You have a wrapper object as a stand in for the weights; unless you are telling me that this is actually some generalized "mutable context" for an entire operation that has more thread safety requirements? |
@ezyang, I was just making clarification. In theory thread-safety is an issue, but practically speaking |
@@ -128,6 +130,7 @@ class CAFFE2_API Context { | |||
bool deterministic_cudnn = false; | |||
bool benchmark_cudnn = false; | |||
bool enabled_mkldnn = true; | |||
bool release_original_weights = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we default this to true on mobile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could. Only issue I see with that is that someone trying out mobile build on server for experimentation purposes might want unpacking to work. But I'm okay with making it default if you want.
It could also happen if two different threads call into the same model at the same time. |
Yeah, I think of it as an opaque context object. It's possible in theory to make it thread safe, but it's a significant chunk of work, and given the general lack of multi-threaded inference on mobile (except for within kernels), I don't think it's worth it. |
This change looks good to me overall. Any objections to me accepting? |
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495) [ghstack-poisoned]
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c52df46 Pull Request resolved: #37595
The change to use I still find the overall design a bit questionable. Before I summarize my concerns, I want to preface that I am sort of parachuting into this without the historical context. So I don't think you should block this particular PR on these concerns. But I do think it's important to convey some of the lower level concerns.
|
Thanks for the review and comments. Addressing some of them here
I intended to remove the python bindings for it. I didn't update the PR, my bad...
Looks like thread safety on mobile is a general issue that will need to be addressed even outside of this PR. In this case I can throw an error if we try to access the original weights (say from another thread) in the pack logic after it has been released by another thread. Checking
For this case I'm not sure how we can achieve this without modifying the APIs. One solution I can think of is modifying the I'd also like to add that this is blocking some folks, so I'm trying to get this to land at the earliest. |
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495) [ghstack-poisoned]
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7272533 Pull Request resolved: #37595
Yes. Reamplifying: you shouldn't block this PR on my comments, they're for thinking about the future. I don't think I'm the right person to give the literal Approve here, though. |
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495) [ghstack-poisoned]
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21365495](https://our.internmc.facebook.com/intern/diff/D21365495) [ghstack-poisoned]
Summary: QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called. However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run. The change is gated by C10_MOBILE which is enabled for mobile builds. The change saves 36MB on device for Speech Model. Test Plan: python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 8e3272c Pull Request resolved: #37595
This pull request has been merged in 7bf9d98. |
Points 1 and 2 are very convincing. I'll make sure we follow up to put locks around all of these mutable packed weights in {Q,X}NNPACK. For For point 3, I don't know how to encode this in the model. For a given model, I might want to load it, wrap it, and save it. In that case, the original weights must be preserved. Or I might want to load it and run it. In that case, I don't need the original weights, and freeing them might be necessary to hit my memory budget. So I think it needs to be controllable even for a single model. And since server-side seems to not really care about memory usage (at least on the QNNPACK scale) and mobile is very unlikely to want to re-save a model, the divergent defaults give us automatic correct behavior with no user intervention in all cases that I'm aware of. |
Do note that, if we were to pack weights inside prepacking ops, then with torchbind support we remove this ops completely. They become the attribute of the model and will be packed at model load time (sadly for QNNPACK we are not there yet). |
Stack from ghstack:
Summary:
QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.
The change is gated by C10_MOBILE which is enabled for mobile builds.
The change saves 36MB on device for Speech Model.
Test Plan:
python test/test_quantization.py
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D21365495