-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Quant] onednn backend switch to ideep new api without affacting performance #91056
Conversation
…ormance [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91056
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e166522: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…ormance ghstack-source-id: 0df32fbbe7639bc0f5d4335715213502eb0ab1bd Pull Request resolved: #91056
…acting performance" > Reopen of #90354 **Summary** Onednn quantization backend switch to new API in `third_party/ideep`. - `struct forward_params` for conv/deconv are changed. Modify primitive cache accordingly. - Use new versions of `prepare` and `compute` API. Fp32 and int8 paths separated. The old ones will be deprecated. - Now `ideep::tensor::reorder_if_differ_in` supports block-to-block reorder. Use it instead of defining a util function `onednn_utils::try_reorder`. - For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch. - Use `is_channels_last` flag to specify layout of src/dst when querying expected weight desc. It won't impact correctness. Performance should be unaffected or slightly better. FBGEMM and QNNPACK backends are not affected. Performance results are given below. 1. End-to-end performance of static quantized models (from torchvision) (throughput: fps, higher is better) ![image](https://user-images.githubusercontent.com/12522207/206105879-45c59996-9804-4531-aa1f-dc962e6db5ab.png) 2. Op benchmark of dynamic quantized linear (Latency: ms, lower is better) ![image](https://user-images.githubusercontent.com/12522207/206124949-77352991-0fda-4285-a484-e20a5797262b.png) Test method & env: - Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz - Run multi-instances on a single node. Use one core for each instance. - Use Jemalloc and Intel OpenMP **Test plan** python test/test_quantization.py cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…ormance ghstack-source-id: c583ee8ab681ee8a1bb95609fe66c1fac742bbb8 Pull Request resolved: #91056
…ormance ghstack-source-id: c583ee8ab681ee8a1bb95609fe66c1fac742bbb8 Pull Request resolved: pytorch#91056
…ormance ghstack-source-id: c583ee8ab681ee8a1bb95609fe66c1fac742bbb8 Pull Request resolved: pytorch#91056
…acting performance" > Reopen of #90354 **Summary** Onednn quantization backend switch to new API in `third_party/ideep`. - `struct forward_params` for conv/deconv are changed. Modify primitive cache accordingly. - Use new versions of `prepare` and `compute` API. Fp32 and int8 paths separated. The old ones will be deprecated. - Now `ideep::tensor::reorder_if_differ_in` supports block-to-block reorder. Use it instead of defining a util function `onednn_utils::try_reorder`. - For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch. - Use `is_channels_last` flag to specify layout of src/dst when querying expected weight desc. It won't impact correctness. Performance should be unaffected or slightly better. FBGEMM and QNNPACK backends are not affected. Performance results are given below. 1. End-to-end performance of static quantized models (from torchvision) (throughput: fps, higher is better) ![image](https://user-images.githubusercontent.com/12522207/206105879-45c59996-9804-4531-aa1f-dc962e6db5ab.png) 2. Op benchmark of dynamic quantized linear (Latency: ms, lower is better) ![image](https://user-images.githubusercontent.com/12522207/206124949-77352991-0fda-4285-a484-e20a5797262b.png) Test method & env: - Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz - Run multi-instances on a single node. Use one core for each instance. - Use Jemalloc and Intel OpenMP **Test plan** python test/test_quantization.py cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…ormance ghstack-source-id: 5151171ed10b77c817a260c6d816a17d75f6e1f6 Pull Request resolved: #91056
…acting performance" > Reopen of #90354 **Summary** Onednn quantization backend switch to new API in `third_party/ideep`. - `struct forward_params` for conv/deconv are changed. Modify primitive cache accordingly. - Use new versions of `prepare` and `compute` API. Fp32 and int8 paths separated. The old ones will be deprecated. - Now `ideep::tensor::reorder_if_differ_in` supports block-to-block reorder. Use it instead of defining a util function `onednn_utils::try_reorder`. - For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch. - Use `is_channels_last` flag to specify layout of src/dst when querying expected weight desc. It won't impact correctness. Performance should be unaffected or slightly better. FBGEMM and QNNPACK backends are not affected. Performance results are given below. 1. End-to-end performance of static quantized models (from torchvision) (throughput: fps, higher is better) ![image](https://user-images.githubusercontent.com/12522207/206105879-45c59996-9804-4531-aa1f-dc962e6db5ab.png) 2. Op benchmark of dynamic quantized linear (Latency: ms, lower is better) ![image](https://user-images.githubusercontent.com/12522207/206124949-77352991-0fda-4285-a484-e20a5797262b.png) Test method & env: - Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz - Run multi-instances on a single node. Use one core for each instance. - Use Jemalloc and Intel OpenMP **Test plan** python test/test_quantization.py cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…ormance ghstack-source-id: 70704c08e9fb76d91951aab332fba94a128c1b15 Pull Request resolved: #91056
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Hmm, this seems to be using new iDeep APIs, why was it landed before internal update? |
Hi @malfet. Since the double checkout issue has been solved by #92239, I thought it would be OK to land this. If it is breaking something, please go ahead to revert it. |
Stack from ghstack (oldest at bottom):
Summary
Onednn quantization backend switch to new API in
third_party/ideep
.struct forward_params
for conv/deconv are changed. Modify primitive cache accordingly.prepare
andcompute
API. Fp32 and int8 paths separated. The old ones will be deprecated.ideep::tensor::reorder_if_differ_in
supports block-to-block reorder. Use it instead of defining a util functiononednn_utils::try_reorder
.is_channels_last
flag to specify layout of src/dst when querying expected weight desc.It won't impact correctness. Performance should be unaffected or slightly better.
FBGEMM and QNNPACK backends are not affected.
Performance results are given below.
End-to-end performance of static quantized models (from torchvision)
(throughput: fps, higher is better)
Op benchmark of dynamic quantized linear
(Latency: ms, lower is better)
Test method & env:
Test plan
python test/test_quantization.py
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10