[Quant][FX] Lower QLinearLeakyReLU for onednn backend #88668

Xia-Weiwen · 2022-11-08T10:22:07Z

Stack from ghstack (oldest at bottom):

Summary
Add quantization mappings for QLinearLeakyReLU for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode.

Test plan
python test_quantization.py TestQuantizeFx

cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @leslie-fang-intel @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

[ghstack-poisoned]

pytorch-bot · 2022-11-08T10:22:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88668

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1770001:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 12a730021a77e66efa24daa9f13ed466da9f91d8 Pull Request resolved: #88668

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]

ghstack-source-id: 5b2aa1842dad9d92bd4c5bf3e45df1ec0beb79a4 Pull Request resolved: #88668

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]

ghstack-source-id: 0700231f4a9ecfde0989cd0ae4b683843862623b Pull Request resolved: #88668

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]

jerryzh168 · 2022-11-12T03:43:44Z

torch/ao/quantization/backend_config/__init__.py

@@ -17,4 +18,5 @@
    "BackendPatternConfig",
    "DTypeConfig",
    "ObservationType",
+    "get_onednn_backend_config",


should this change happen in previous PR?

jerryzh168 · 2022-11-12T03:44:15Z

test/quantization/fx/test_quantize_fx.py

+    def test_fuse_linear_bn_leaky_relu_eval(self):
+        # linear - bn - leaky_relu is fused for onednn backend only
+        from torch.ao.quantization.backend_config import get_onednn_backend_config
+        expected_nodes = [
+            ns.call_module(nni.LinearLeakyReLU),
+        ]
+        expected_occurrence = {
+            ns.call_module(nn.BatchNorm1d): 0,
+            ns.call_module(nn.LeakyReLU): 0,
+        }
+
+        for with_bn in [True, False]:
+            # test eval mode
+            m = LinearBnLeakyReluModel(with_bn).eval()
+            # fuse_fx is a top level api and only supports eval mode
+            m = fuse_fx(m,
+                        backend_config=get_onednn_backend_config())
+            self.checkGraphModuleNodes(
+                m,
+                expected_node_list=expected_nodes,
+                expected_node_occurrence=expected_occurrence)
+
+    def test_no_fuse_linear_bn_leaky_relu_eval(self):


these two tests should be in previous PR as well I think

jerryzh168 · 2022-11-12T03:46:53Z

torch/ao/quantization/fx/_lower_to_native_backend.py

@@ -253,6 +253,7 @@ def should_skip_lowering(op: torch.fx.node.Node, qconfig_map: Dict[str, QConfigA
 #   2) The replacement static quantized module class for lowering
 STATIC_LOWER_FUSED_MODULE_MAP: Dict[Type[nn.Module], Tuple[Type[nn.Module], Type[WeightedQuantizedModule]]] = {
    nni.LinearReLU: (nnqr.Linear, nniq.LinearReLU),
+    nni.LinearLeakyReLU: (nnqr.Linear, nniq.LinearLeakyReLU),


I guess this change is related to lowering, but currently it is not tested, we should add a test that goes through PTQ flow of fx graph mode quantization and lower the linear - leaky_relu pattern, and probably another test for (linear - bn - leaky_relu)

I have added a test case in test/quantization/fx/test_quantize_fx.py as TestQuantizeFx.test_linear_leaky_relu_lowering

do we also fuse this for fbgemm and qnnpack as well? should we skip that for these other backends?

No. We don't fuse this for other backends. I have added a test case to confirm the pattern is not fused by default.

in that case I think we should have a separate lowering function for onednn?

For fbgemm and qnnpack, linear + leaky_relu are not fused and they are lowered separately. If users specify backend='onednn' they still need to use onednn's backend config explicitly for prepare_fx and convert_fx to fuse and lower linear-leaky_relu. So, a new lowering function is not needed.

yeah putting the comments here looks good

jerryzh168 · 2022-11-12T03:48:06Z

torch/ao/quantization/quantization_mappings.py

@@ -110,6 +110,7 @@
    nni.ConvReLU2d: nniq.ConvReLU2d,
    nni.ConvReLU3d: nniq.ConvReLU3d,
    nni.LinearReLU: nniq.LinearReLU,
+    nni.LinearLeakyReLU: nniq.LinearLeakyReLU,


same here, this is in eager mode quantization flow, we should add a similar test in https://github.com/pytorch/pytorch/blob/master/test/quantization/eager/test_quantize_eager_ptq.py#L76 I think

We only support fusion and lowering in FX mode. I have added a test case in test_quantize_fx.py

jerryzh168

this one needs some changes, please see comments inline

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]

**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Xia-Weiwen · 2022-12-07T02:09:53Z

this one needs some changes, please see comments inline

Hi @jerryzh168. I have done changes per your comments. Do you have more comments? Thanks!

Xia-Weiwen · 2022-12-09T01:19:01Z

Hi @jerryzh168. Is it ok to land this? Thanks

**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Xia-Weiwen · 2022-12-14T01:17:04Z

Hi @jerryzh168 Do you have more comments on this PR? Thanks!

jerryzh168

need to discuss how lowering works for onednn, when these things are not supported in fbgemm/qnnpack

jerryzh168 · 2022-12-14T17:36:39Z

I guess the current behavior is, if people configured to quantize linear -> leaky_relu, it will produce the quantized::linear_leaky_relu op, but when you run the model it will error out and say it is not supported in fbgemm/qnnpack, is that correct?

Xia-Weiwen · 2022-12-15T01:24:03Z

I guess the current behavior is, if people configured to quantize linear -> leaky_relu, it will produce the quantized::linear_leaky_relu op, but when you run the model it will error out and say it is not supported in fbgemm/qnnpack, is that correct?

No, linear + leaky_relu fusion is only enabled if users use onednn's backend explicitly #88668 (comment)

Xia-Weiwen · 2022-12-15T01:24:40Z

need to discuss how lowering works for onednn, when these things are not supported in fbgemm/qnnpack

Linear + leaky_relu are not fused for fbgemm/qnnpack

**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

jerryzh168

I see, can we add a comment to the lowering code to explain the behavior? I think there are two things
1). backend_config
2). quantized engine

I think currently we are saying if we set backend_config to onednn_backend_config, we won't see the pattern for linear - leaky_relu, but my question is that if we do have "linear - leaky_relu" pattern (e.g. if user accidently uses onednn backend_config), the current lowering code will fuse it since it is shared by all backends. and if quantized engine is set to fbgemm, we'll run the quantized::linear_leaky_relu with fbgemm backend

I think in the future, since we are asking users to explicitly specify backend_config, we should probably also ask users to explicitly call the lowering code, and we should have separate lowering code for each backend as well.

maybe a comment explaining the above would be helpful.

**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Xia-Weiwen · 2022-12-15T03:58:35Z

I see, can we add a comment to the lowering code to explain the behavior? I think there are two things 1). backend_config 2). quantized engine

I think currently we are saying if we set backend_config to onednn_backend_config, we won't see the pattern for linear - leaky_relu, but my question is that if we do have "linear - leaky_relu" pattern (e.g. if user accidently uses onednn backend_config), the current lowering code will fuse it since it is shared by all backends. and if quantized engine is set to fbgemm, we'll run the quantized::linear_leaky_relu with fbgemm backend

I think in the future, since we are asking users to explicitly specify backend_config, we should probably also ask users to explicitly call the lowering code, and we should have separate lowering code for each backend as well.

maybe a comment explaining the above would be helpful.

I see. I have added comments.
BTW, do we have some mechanism to prevent end users from using a different backend_config as the backend they are using?

**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

jerryzh168 · 2022-12-16T02:08:35Z

BTW, do we have some mechanism to prevent end users from using a different backend_config as the backend they are using?

by "backend" you are talking about quantized engine right? we don't have this mechanism right now, but we might consider adding this in the future I think

Xia-Weiwen · 2022-12-16T02:22:07Z

BTW, do we have some mechanism to prevent end users from using a different backend_config as the backend they are using?

by "backend" you are talking about quantized engine right? we don't have this mechanism right now, but we might consider adding this in the future I think

Yes, I mean qengine here.
Maybe we can choose the qengine's backend_config automatically behind the scenes when users do not specify backend_config explicitly. In that case, we do not encourage end users to set backend_config explicitly unless they want to do some customization. How does that sound to you?

**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

jerryzh168 · 2022-12-16T03:12:52Z

BTW, do we have some mechanism to prevent end users from using a different backend_config as the backend they are using?

by "backend" you are talking about quantized engine right? we don't have this mechanism right now, but we might consider adding this in the future I think

Yes, I mean qengine here. Maybe we can choose the qengine's backend_config automatically behind the scenes when users do not specify backend_config explicitly. In that case, we do not encourage end users to set backend_config explicitly unless they want to do some customization. How does that sound to you?

I think we may need to store target backend/qengine in the model to do that, since qengine is used when we do inference, and backend_config is used when we quantize the model, we can have serialization/deserialization etc. in between. cc @vkuzo wondering what is latest plan on storing this information in the model?

Xia-Weiwen · 2022-12-16T04:03:25Z

I think we may need to store target backend/qengine in the model to do that, since qengine is used when we do inference, and backend_config is used when we quantize the model, we can have serialization/deserialization etc. in between. cc @vkuzo wondering what is latest plan on storing this information in the model?

I see. Now people use backend_config for quantization and use qengine for inference. If qengine info is stored in the model when doing quantization, qengine can be selected automatically for inference thus needless to specify qengine explicitly. That makes sense.

**Summary** Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. **Test plan** python test_quantization.py TestQuantizeFx cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Xia-Weiwen · 2022-12-19T00:42:39Z

@pytorchbot merge

pytorchmergebot · 2022-12-19T00:44:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[Quant][FX] Lower QLinearLeakyReLU for onednn backend

678aab7

[ghstack-poisoned]

Xia-Weiwen requested review from jerryzh168 and z-a-f as code owners November 8, 2022 10:22

pytorch-bot bot added the release notes: quantization release notes category label Nov 8, 2022

Xia-Weiwen mentioned this pull request Nov 8, 2022

[Quant] Add fused linear-leaky_relu op for onednn backend #88478

Closed

This was referenced Nov 8, 2022

[Quant] Add fused LinearLeakyReLU module for onednn backend #88661

Closed

[Quant][FX] Add backend config for onednn backend and fuse Linear-LeakyReLU #88665

Closed

github-actions bot added the oncall: quantization Quantization support in PyTorch label Nov 8, 2022

Xia-Weiwen added a commit that referenced this pull request Nov 8, 2022

[Quant][FX] Lower QLinearLeakyReLU for onednn backend

a50fa8c

ghstack-source-id: 12a730021a77e66efa24daa9f13ed466da9f91d8 Pull Request resolved: #88668

Xia-Weiwen marked this pull request as draft November 8, 2022 10:22

Xia-Weiwen requested a review from jgong5 November 8, 2022 10:23

pytorchbot added the open source label Nov 8, 2022

Update on "[Quant][FX] Lower QLinearLeakyReLU for onednn backend"

82954b7

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Nov 8, 2022

[Quant][FX] Lower QLinearLeakyReLU for onednn backend

9b107d1

ghstack-source-id: 5b2aa1842dad9d92bd4c5bf3e45df1ec0beb79a4 Pull Request resolved: #88668

jgong5 approved these changes Nov 9, 2022

View reviewed changes

Update on "[Quant][FX] Lower QLinearLeakyReLU for onednn backend"

4808d8f

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Nov 10, 2022

[Quant][FX] Lower QLinearLeakyReLU for onednn backend

037e036

ghstack-source-id: 0700231f4a9ecfde0989cd0ae4b683843862623b Pull Request resolved: #88668

Xia-Weiwen marked this pull request as ready for review November 11, 2022 05:30

Update on "[Quant][FX] Lower QLinearLeakyReLU for onednn backend"

296713a

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]

This was referenced Nov 11, 2022

[Quant] Add fused linear-tanh op for onednn backend #88879

Closed

[Quant] Add fused LinearTanh module for onednn backend #88923

Closed

jerryzh168 reviewed Nov 12, 2022

View reviewed changes

jerryzh168 requested changes Nov 12, 2022

View reviewed changes

Update on "[Quant][FX] Lower QLinearLeakyReLU for onednn backend"

dfdea3f

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]

Xia-Weiwen mentioned this pull request Nov 17, 2022

[Quant] lower fused LinearTanh for onednn backend #89188

Closed

Update on "[Quant][FX] Lower QLinearLeakyReLU for onednn backend"

77dbd93

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 leslie-fang-intel [ghstack-poisoned]

jerryzh168 requested changes Dec 14, 2022

View reviewed changes

Xia-Weiwen added 3 commits December 15, 2022 10:00

jerryzh168 approved these changes Dec 15, 2022

View reviewed changes

Xia-Weiwen added 3 commits December 15, 2022 16:10

pytorchmergebot added the Merged label Dec 19, 2022

pytorchmergebot closed this in 9ca41a9 Dec 19, 2022

facebook-github-bot deleted the gh/Xia-Weiwen/4/head branch June 8, 2023 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quant][FX] Lower QLinearLeakyReLU for onednn backend #88668

[Quant][FX] Lower QLinearLeakyReLU for onednn backend #88668

Xia-Weiwen commented Nov 8, 2022 •

edited

pytorch-bot bot commented Nov 8, 2022 •

edited

jerryzh168 Nov 12, 2022

Xia-Weiwen Nov 12, 2022

jerryzh168 Nov 12, 2022

Xia-Weiwen Nov 12, 2022

jerryzh168 Nov 12, 2022

Xia-Weiwen Nov 20, 2022

jerryzh168 Dec 9, 2022

Xia-Weiwen Dec 12, 2022

jerryzh168 Dec 14, 2022

Xia-Weiwen Dec 15, 2022

jerryzh168 Dec 16, 2022

jerryzh168 Nov 12, 2022

Xia-Weiwen Nov 20, 2022 •

edited

jerryzh168 left a comment

Xia-Weiwen commented Dec 7, 2022

Xia-Weiwen commented Dec 9, 2022 •

edited

Xia-Weiwen commented Dec 14, 2022

jerryzh168 left a comment

jerryzh168 commented Dec 14, 2022

Xia-Weiwen commented Dec 15, 2022

Xia-Weiwen commented Dec 15, 2022

jerryzh168 left a comment •

edited

Xia-Weiwen commented Dec 15, 2022 •

edited

jerryzh168 commented Dec 16, 2022

Xia-Weiwen commented Dec 16, 2022

jerryzh168 commented Dec 16, 2022 •

edited

Xia-Weiwen commented Dec 16, 2022

Xia-Weiwen commented Dec 19, 2022

pytorchmergebot commented Dec 19, 2022

[Quant][FX] Lower QLinearLeakyReLU for onednn backend #88668

[Quant][FX] Lower QLinearLeakyReLU for onednn backend #88668

Conversation

Xia-Weiwen commented Nov 8, 2022 • edited

pytorch-bot bot commented Nov 8, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88668

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xia-Weiwen Nov 20, 2022 • edited

Choose a reason for hiding this comment

jerryzh168 left a comment

Choose a reason for hiding this comment

Xia-Weiwen commented Dec 7, 2022

Xia-Weiwen commented Dec 9, 2022 • edited

Xia-Weiwen commented Dec 14, 2022

jerryzh168 left a comment

Choose a reason for hiding this comment

jerryzh168 commented Dec 14, 2022

Xia-Weiwen commented Dec 15, 2022

Xia-Weiwen commented Dec 15, 2022

jerryzh168 left a comment • edited

Choose a reason for hiding this comment

Xia-Weiwen commented Dec 15, 2022 • edited

jerryzh168 commented Dec 16, 2022

Xia-Weiwen commented Dec 16, 2022

jerryzh168 commented Dec 16, 2022 • edited

Xia-Weiwen commented Dec 16, 2022

Xia-Weiwen commented Dec 19, 2022

pytorchmergebot commented Dec 19, 2022

Merge started

Xia-Weiwen commented Nov 8, 2022 •

edited

pytorch-bot bot commented Nov 8, 2022 •

edited

Xia-Weiwen Nov 20, 2022 •

edited

Xia-Weiwen commented Dec 9, 2022 •

edited

jerryzh168 left a comment •

edited

Xia-Weiwen commented Dec 15, 2022 •

edited

jerryzh168 commented Dec 16, 2022 •

edited