-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to enable per channel requant. #37620
Conversation
Summary: Now channel wise quantization is supported for linear/conv. Depthwise conv are still pending. Tests are altered to generate per channel zero points and requant scales. All the kernels are fixed appropritately. Added per_channel member to conv_param structure. And replicated conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zp and scale were same across channels. This was to minimize code duplicaiton as the perf impact is estimated (to be measured though) to be low. However this is not likely the case for depthwise convs. Thus they will have separate kernels, which required us to introduce per_channel member to conv_param structure, to know which kernels to apply for depthwise. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit dcc721f (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_windows_vs2019_py36_cuda10.1_build (1/1)Step: "Build" (full log | diagnosis details | 🔁 rerun)
|
Summary: Now channel wise quantization is supported for linear/conv. Depthwise conv are still pending. Tests are altered to generate per channel zero points and requant scales. All the kernels are fixed appropritately. Added per_channel member to conv_param structure. And replicated conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zp and scale were same across channels. This was to minimize code duplicaiton as the perf impact is estimated (to be measured though) to be low. However this is not likely the case for depthwise convs. Thus they will have separate kernels, which required us to introduce per_channel member to conv_param structure, to know which kernels to apply for depthwise. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise conv are still pending. Tests are altered to generate per channel zero points and requant scales. All the kernels are fixed appropritately. Added per_channel member to conv_param structure. And replicated conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zp and scale were same across channels. This was to minimize code duplicaiton as the perf impact is estimated (to be measured though) to be low. However this is not likely the case for depthwise convs. Thus they will have separate kernels, which required us to introduce per_channel member to conv_param structure, to know which kernels to apply for depthwise. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise conv are still pending. Tests are altered to generate per channel zero points and requant scales. All the kernels are fixed appropritately. Added per_channel member to conv_param structure. And replicated conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zp and scale were same across channels. This was to minimize code duplicaiton as the perf impact is estimated (to be measured though) to be low. However this is not likely the case for depthwise convs. Thus they will have separate kernels, which required us to introduce per_channel member to conv_param structure, to know which kernels to apply for depthwise. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise conv are still pending. Tests are altered to generate per channel zero points and requant scales. All the kernels are fixed appropritately. Added per_channel member to conv_param structure. And replicated conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zp and scale were same across channels. This was to minimize code duplicaiton as the perf impact is estimated (to be measured though) to be low. However this is not likely the case for depthwise convs. Thus they will have separate kernels, which required us to introduce per_channel member to conv_param structure, to know which kernels to apply for depthwise. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise conv are still pending. Tests are altered to generate per channel zero points and requant scales. All the kernels are fixed appropritately. Added per_channel member to conv_param structure. And replicated conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zp and scale were same across channels. This was to minimize code duplicaiton as the perf impact is estimated (to be measured though) to be low. However this is not likely the case for depthwise convs. Thus they will have separate kernels, which required us to introduce per_channel member to conv_param structure, to know which kernels to apply for depthwise. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise conv are still pending. Tests are altered to generate per channel zero points and requant scales. All the kernels are fixed appropritately. Added per_channel member to conv_param structure. And replicated conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zp and scale were same across channels. This was to minimize code duplicaiton as the perf impact is estimated (to be measured though) to be low. However this is not likely the case for depthwise convs. Thus they will have separate kernels, which required us to introduce per_channel member to conv_param structure, to know which kernels to apply for depthwise. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise conv are still pending. Tests are altered to generate per channel zero points and requant scales. All the kernels are fixed appropritately. Added per_channel member to conv_param structure. And replicated conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zp and scale were same across channels. This was to minimize code duplicaiton as the perf impact is estimated (to be measured though) to be low. However this is not likely the case for depthwise convs. Thus they will have separate kernels, which required us to introduce per_channel member to conv_param structure, to know which kernels to apply for depthwise. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise conv are still pending. Tests are altered to generate per channel zero points and requant scales. All the kernels are fixed appropritately. Added per_channel member to conv_param structure. And replicated conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zp and scale were same across channels. This was to minimize code duplicaiton as the perf impact is estimated (to be measured though) to be low. However this is not likely the case for depthwise convs. Thus they will have separate kernels, which required us to introduce per_channel member to conv_param structure, to know which kernels to apply for depthwise. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise conv are still pending. Tests are altered to generate per channel zero points and requant scales. All the kernels are fixed appropritately. Added per_channel member to conv_param structure. And replicated conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zp and scale were same across channels. This was to minimize code duplicaiton as the perf impact is estimated (to be measured though) to be low. However this is not likely the case for depthwise convs. Thus they will have separate kernels, which required us to introduce per_channel member to conv_param structure, to know which kernels to apply for depthwise. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c7d903f06f118d24a5a2989939995851bcb1a939 Pull Request resolved: pytorch/pytorch#37620
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at all but the microkernels and it looks good except for minor comments.
@@ -505,8 +505,10 @@ at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply_impl( | |||
|
|||
double act_input_scale = act_nhwc.q_scale(); | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert?
@@ -5,6 +5,61 @@ | |||
#include <cstdlib> | |||
|
|||
namespace qnnpack { | |||
// For runtime quantization packing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is runtime quantization the same as dynamic quantization? If so, maybe a follow-up diff to unify the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamic quantization is when you quantize your input data every single time and dequantize the output to fp32. Runtime quantization is the other mode. Only reason it is called runtime is due to the way QNNPACK was integrated in pytorch.
if (kzp != 0) { | ||
// This part fills the packed wights with zero points for output channels | ||
// when they are not divisble by nr blocking parameter. | ||
// In that case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing the end of this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got half way through, will continue tomorrow.
@@ -57,8 +57,12 @@ BEGIN_FUNCTION pytorch_q8conv_ukernel_8x8__aarch64_neon | |||
LDR x9, [sp] | |||
# Load pointer to per channel zero points array | |||
LDR x10, [x8] | |||
# To go to a_zero_point | |||
ADD x8, x8, 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LDR x10, [x8], 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesnt compile on android.
ADD x8, x8, 4 | ||
# Load pointer to per channel requant scale | ||
LDR x10, [x8, 8]! | ||
ADD x8, x8, 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LD1R {v24.8b}, [x8], 8
LDR x10, [x8], 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem as before with LDRs.
// - v26 = requantization_scale channels 0-3 | ||
// - v31 = requantization_scale channels 4-7 | ||
LD1 {v26.4s}, [x10], 16 | ||
LD1 {v30.4s}, [x10] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a long shot but I am wondering if you can do this and remove LSL x9, x9, 2
and ADD x10, x10, x9
above, in case the value of x9 and x10 do not matter after this point:
LD1 {v26.4s}, [x10], x9, lsl 2
LD1 {v30.4s}, [x10, 16]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that form exist?. I checked but it does not seem to me that the variant of LD1 exist where shift can be folded in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, at least for LDR and STR. Check section 3, titled "Offset form: Scaled register as the offset" of this document here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes so for LDR yes there is. Not for LD1. But I think this I can use in base + offset calculation done via add.
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thanks.
@@ -100,14 +108,10 @@ BEGIN_FUNCTION pytorch_q8conv_ukernel_4x8__aarch32_neon | |||
# Load a_zero_point: | |||
# - d14 = a_zero_point | |||
VLD1.8 {d14[]}, [r9] | |||
ADD r9, r9, 4 | |||
# add 8 bytes to get to vfmax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry did not fix it here. It gets fixed by the perf related PR.
// - v26 = requantization_scale channels 0-3 | ||
// - v31 = requantization_scale channels 4-7 | ||
LD1 {v26.4s}, [x10], 16 | ||
LD1 {v30.4s}, [x10] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, at least for LDR and STR. Check section 3, titled "Offset form: Scaled register as the offset" of this document here.
LD1R {v26.4s}, [x8], 4 | ||
// - v26 = requantization_scale channels 0-3 | ||
// - v27 = requantization_scale channels 4-7 | ||
LD1 {v26.4s}, [x17], 16 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a glance it seems to me that you can save a couple of instructions in the block above if interested. It basically boils down to modifying the pointer arithmetic to take advantage of the free offsetting in the loads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes so pointer arithmetic can save lsl by folding it in. LD1 I dont think since it has only no offset and post-index variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am gonna make these changes in perf opt PR.
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
Summary: Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending. Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index. All the kernels are appropriately modified except for the depthwise ones. Tests are altered to generate per channel zero points and requant scales. Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels. However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv. Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for per channel and non-depthwise conv does not. Test Plan: Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test. fully-conntected-test, convolution-test. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D21339041](https://our.internmc.facebook.com/intern/diff/D21339041) [ghstack-poisoned]
This pull request has been merged in 1f16d4c. |
Stack from ghstack:
Summary:
Now channel wise quantization is supported for linear/conv. Depthwise convs are still pending.
Approach is same as with zero points. Pointer to requantization array is passed and it is looked up using output_channel_index.
All the kernels are appropriately modified except for the depthwise ones.
Tests are altered to generate per channel zero points and requant scales.
Unit tests are replicated for conv tests to exercise per_channel conv. This was not strictly needed since conv kernels were changed such that they did per channel anyway. When per channels is not needed zero point and requant scale were same across channels.
However for depthwise convolutions we will be using different set of kernels to do per channel, which required us to introduce per_channel member to conv_param structure, to know which kernels to use for depthwise conv.
Ensuing modifications were to keep everything in sync for both regular conv and depthwise so that we dont have caveat when reading the code, that why does depthwise have separate test for
per channel and non-depthwise conv does not.
Test Plan:
Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test.
fully-conntected-test, convolution-test.
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D21339041