-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Quant][FX] Lower QLinearLeakyReLU for onednn backend #88668
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88668
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1770001: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 12a730021a77e66efa24daa9f13ed466da9f91d8 Pull Request resolved: #88668
cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]
ghstack-source-id: 5b2aa1842dad9d92bd4c5bf3e45df1ec0beb79a4 Pull Request resolved: #88668
cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]
ghstack-source-id: 0700231f4a9ecfde0989cd0ae4b683843862623b Pull Request resolved: #88668
cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]
@@ -17,4 +18,5 @@ | |||
"BackendPatternConfig", | |||
"DTypeConfig", | |||
"ObservationType", | |||
"get_onednn_backend_config", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this change happen in previous PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
def test_fuse_linear_bn_leaky_relu_eval(self): | ||
# linear - bn - leaky_relu is fused for onednn backend only | ||
from torch.ao.quantization.backend_config import get_onednn_backend_config | ||
expected_nodes = [ | ||
ns.call_module(nni.LinearLeakyReLU), | ||
] | ||
expected_occurrence = { | ||
ns.call_module(nn.BatchNorm1d): 0, | ||
ns.call_module(nn.LeakyReLU): 0, | ||
} | ||
|
||
for with_bn in [True, False]: | ||
# test eval mode | ||
m = LinearBnLeakyReluModel(with_bn).eval() | ||
# fuse_fx is a top level api and only supports eval mode | ||
m = fuse_fx(m, | ||
backend_config=get_onednn_backend_config()) | ||
self.checkGraphModuleNodes( | ||
m, | ||
expected_node_list=expected_nodes, | ||
expected_node_occurrence=expected_occurrence) | ||
|
||
def test_no_fuse_linear_bn_leaky_relu_eval(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two tests should be in previous PR as well I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -253,6 +253,7 @@ def should_skip_lowering(op: torch.fx.node.Node, qconfig_map: Dict[str, QConfigA | |||
# 2) The replacement static quantized module class for lowering | |||
STATIC_LOWER_FUSED_MODULE_MAP: Dict[Type[nn.Module], Tuple[Type[nn.Module], Type[WeightedQuantizedModule]]] = { | |||
nni.LinearReLU: (nnqr.Linear, nniq.LinearReLU), | |||
nni.LinearLeakyReLU: (nnqr.Linear, nniq.LinearLeakyReLU), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this change is related to lowering, but currently it is not tested, we should add a test that goes through PTQ flow of fx graph mode quantization and lower the linear - leaky_relu pattern, and probably another test for (linear - bn - leaky_relu)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a test case in test/quantization/fx/test_quantize_fx.py
as TestQuantizeFx.test_linear_leaky_relu_lowering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also fuse this for fbgemm and qnnpack as well? should we skip that for these other backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We don't fuse this for other backends. I have added a test case to confirm the pattern is not fused by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case I think we should have a separate lowering function for onednn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fbgemm and qnnpack, linear + leaky_relu
are not fused and they are lowered separately. If users specify backend='onednn'
they still need to use onednn's backend config explicitly for prepare_fx
and convert_fx
to fuse and lower linear-leaky_relu
. So, a new lowering function is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah putting the comments here looks good
@@ -110,6 +110,7 @@ | |||
nni.ConvReLU2d: nniq.ConvReLU2d, | |||
nni.ConvReLU3d: nniq.ConvReLU3d, | |||
nni.LinearReLU: nniq.LinearReLU, | |||
nni.LinearLeakyReLU: nniq.LinearLeakyReLU, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, this is in eager mode quantization flow, we should add a similar test in https://github.com/pytorch/pytorch/blob/master/test/quantization/eager/test_quantize_eager_ptq.py#L76 I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only support fusion and lowering in FX mode. I have added a test case in test_quantize_fx.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one needs some changes, please see comments inline
cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]
cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Hi @jerryzh168. I have done changes per your comments. Do you have more comments? Thanks! |
Hi @jerryzh168. Is it ok to land this? Thanks |
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Hi @jerryzh168 Do you have more comments on this PR? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to discuss how lowering works for onednn, when these things are not supported in fbgemm/qnnpack
I guess the current behavior is, if people configured to quantize linear -> leaky_relu, it will produce the quantized::linear_leaky_relu op, but when you run the model it will error out and say it is not supported in fbgemm/qnnpack, is that correct? |
No, linear + leaky_relu fusion is only enabled if users use onednn's backend explicitly #88668 (comment) |
Linear + leaky_relu are not fused for fbgemm/qnnpack |
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, can we add a comment to the lowering code to explain the behavior? I think there are two things
1). backend_config
2). quantized engine
I think currently we are saying if we set backend_config to onednn_backend_config, we won't see the pattern for linear - leaky_relu, but my question is that if we do have "linear - leaky_relu" pattern (e.g. if user accidently uses onednn backend_config), the current lowering code will fuse it since it is shared by all backends. and if quantized engine is set to fbgemm, we'll run the quantized::linear_leaky_relu with fbgemm backend
I think in the future, since we are asking users to explicitly specify backend_config, we should probably also ask users to explicitly call the lowering code, and we should have separate lowering code for each backend as well.
maybe a comment explaining the above would be helpful.
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
I see. I have added comments. |
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
by "backend" you are talking about quantized engine right? we don't have this mechanism right now, but we might consider adding this in the future I think |
Yes, I mean qengine here. |
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
I think we may need to store target |
I see. Now people use |
**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Summary
Add quantization mappings for
QLinearLeakyReLU
for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode.Test plan
python test_quantization.py TestQuantizeFx
cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @leslie-fang-intel @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10