Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Quant] onednn backend switch to ideep new api without affacting performance #91056

Closed
wants to merge 4 commits into from

Conversation

Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Dec 17, 2022

Stack from ghstack (oldest at bottom):

Reopen of #90354

Summary
Onednn quantization backend switch to new API in third_party/ideep.

  • struct forward_params for conv/deconv are changed. Modify primitive cache accordingly.
  • Use new versions of prepare and compute API. Fp32 and int8 paths separated. The old ones will be deprecated.
  • Now ideep::tensor::reorder_if_differ_in supports block-to-block reorder. Use it instead of defining a util function onednn_utils::try_reorder.
  • For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch.
  • Use is_channels_last flag to specify layout of src/dst when querying expected weight desc.

It won't impact correctness. Performance should be unaffected or slightly better.
FBGEMM and QNNPACK backends are not affected.

Performance results are given below.

  1. End-to-end performance of static quantized models (from torchvision)
    (throughput: fps, higher is better)
    image

  2. Op benchmark of dynamic quantized linear
    (Latency: ms, lower is better)
    image

Test method & env:

  • Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
  • Run multi-instances on a single node. Use one core for each instance.
  • Use Jemalloc and Intel OpenMP

Test plan
python test/test_quantization.py

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 17, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91056

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e166522:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: quantization release notes category label Dec 17, 2022
Xia-Weiwen added a commit that referenced this pull request Dec 17, 2022
…ormance

ghstack-source-id: 0df32fbbe7639bc0f5d4335715213502eb0ab1bd
Pull Request resolved: #91056
@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 17, 2022
…acting performance"



> Reopen of #90354

**Summary**
Onednn quantization backend switch to new API in `third_party/ideep`.
- `struct forward_params` for conv/deconv are changed. Modify primitive cache accordingly.
- Use new versions of `prepare` and `compute` API. Fp32 and int8 paths separated. The old ones will be deprecated.
- Now `ideep::tensor::reorder_if_differ_in` supports block-to-block reorder. Use it instead of defining a util function `onednn_utils::try_reorder`.
- For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch.
- Use `is_channels_last` flag to specify layout of src/dst when querying expected weight desc.

It won't impact correctness. Performance should be unaffected or slightly better.
FBGEMM and QNNPACK backends are not affected.

Performance results are given below.
1. End-to-end performance of static quantized models (from torchvision)
(throughput: fps, higher is better)
![image](https://user-images.githubusercontent.com/12522207/206105879-45c59996-9804-4531-aa1f-dc962e6db5ab.png)

2. Op benchmark of dynamic quantized linear
(Latency: ms, lower is better)
![image](https://user-images.githubusercontent.com/12522207/206124949-77352991-0fda-4285-a484-e20a5797262b.png)

Test method & env:
- Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
- Run multi-instances on a single node. Use one core for each instance.
- Use Jemalloc and Intel OpenMP

**Test plan**
python test/test_quantization.py


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Dec 18, 2022
…ormance

ghstack-source-id: c583ee8ab681ee8a1bb95609fe66c1fac742bbb8
Pull Request resolved: #91056
@Xia-Weiwen Xia-Weiwen added intel This tag is for PR from Intel ciflow/trunk Trigger trunk jobs on your pull request labels Dec 18, 2022
Xia-Weiwen added a commit to Xia-Weiwen/pytorch that referenced this pull request Jan 11, 2023
…ormance

ghstack-source-id: c583ee8ab681ee8a1bb95609fe66c1fac742bbb8
Pull Request resolved: pytorch#91056
Xia-Weiwen added a commit to Xia-Weiwen/pytorch that referenced this pull request Jan 16, 2023
…ormance

ghstack-source-id: c583ee8ab681ee8a1bb95609fe66c1fac742bbb8
Pull Request resolved: pytorch#91056
…acting performance"



> Reopen of #90354

**Summary**
Onednn quantization backend switch to new API in `third_party/ideep`.
- `struct forward_params` for conv/deconv are changed. Modify primitive cache accordingly.
- Use new versions of `prepare` and `compute` API. Fp32 and int8 paths separated. The old ones will be deprecated.
- Now `ideep::tensor::reorder_if_differ_in` supports block-to-block reorder. Use it instead of defining a util function `onednn_utils::try_reorder`.
- For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch.
- Use `is_channels_last` flag to specify layout of src/dst when querying expected weight desc.

It won't impact correctness. Performance should be unaffected or slightly better.
FBGEMM and QNNPACK backends are not affected.

Performance results are given below.
1. End-to-end performance of static quantized models (from torchvision)
(throughput: fps, higher is better)
![image](https://user-images.githubusercontent.com/12522207/206105879-45c59996-9804-4531-aa1f-dc962e6db5ab.png)

2. Op benchmark of dynamic quantized linear
(Latency: ms, lower is better)
![image](https://user-images.githubusercontent.com/12522207/206124949-77352991-0fda-4285-a484-e20a5797262b.png)

Test method & env:
- Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
- Run multi-instances on a single node. Use one core for each instance.
- Use Jemalloc and Intel OpenMP

**Test plan**
python test/test_quantization.py


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Jan 16, 2023
…ormance

ghstack-source-id: 5151171ed10b77c817a260c6d816a17d75f6e1f6
Pull Request resolved: #91056
…acting performance"



> Reopen of #90354

**Summary**
Onednn quantization backend switch to new API in `third_party/ideep`.
- `struct forward_params` for conv/deconv are changed. Modify primitive cache accordingly.
- Use new versions of `prepare` and `compute` API. Fp32 and int8 paths separated. The old ones will be deprecated.
- Now `ideep::tensor::reorder_if_differ_in` supports block-to-block reorder. Use it instead of defining a util function `onednn_utils::try_reorder`.
- For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch.
- Use `is_channels_last` flag to specify layout of src/dst when querying expected weight desc.

It won't impact correctness. Performance should be unaffected or slightly better.
FBGEMM and QNNPACK backends are not affected.

Performance results are given below.
1. End-to-end performance of static quantized models (from torchvision)
(throughput: fps, higher is better)
![image](https://user-images.githubusercontent.com/12522207/206105879-45c59996-9804-4531-aa1f-dc962e6db5ab.png)

2. Op benchmark of dynamic quantized linear
(Latency: ms, lower is better)
![image](https://user-images.githubusercontent.com/12522207/206124949-77352991-0fda-4285-a484-e20a5797262b.png)

Test method & env:
- Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
- Run multi-instances on a single node. Use one core for each instance.
- Use Jemalloc and Intel OpenMP

**Test plan**
python test/test_quantization.py


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Jan 16, 2023
…ormance

ghstack-source-id: 70704c08e9fb76d91951aab332fba94a128c1b15
Pull Request resolved: #91056
@Xia-Weiwen
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor

malfet commented Jan 19, 2023

Hmm, this seems to be using new iDeep APIs, why was it landed before internal update?

@Xia-Weiwen
Copy link
Collaborator Author

Hmm, this seems to be using new iDeep APIs, why was it landed before internal update?

Hi @malfet. Since the double checkout issue has been solved by #92239, I thought it would be OK to land this. If it is breaking something, please go ahead to revert it.
BTW, how can I know when internal update is done? Thanks.

@facebook-github-bot facebook-github-bot deleted the gh/Xia-Weiwen/9/head branch June 8, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: quantization release notes category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants