Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial integration of ZenDNN as backend into PyTorch #76242

Closed
wants to merge 4 commits into from

Conversation

ratan-prasad
Copy link

@ratan-prasad ratan-prasad commented Apr 22, 2022

This PR covers following

  • Added ZenDNN as third-party library
    • ZenDNN is a submodule for pytorch
    • modified pytorch build infra to build with ZenDNN when USE_ZENDNN is set
      to 1
    • pytorch build reverts to default behavior if USE_ZENDNN environmental
      variable is set to zero or unset
  • Added AOCL-blis as dependency of ZenDNN
    • AOCL-blis is a submodule for pytorch
    • AOCL-blis is built if USE_ZENDNN is set to 1
  • Added ZenDNN as backend to pytorch
    • added support for user level control to disable or enable zendnn backend
  • Added following changes
    • added zendnn layout and ZendnnCPU dispatch key.
    • tensor conversion between dense and zendnn layouts
    • zendnn convolution as a new ConvBackend
      • when built with zendnn (USE_ZENDNN=1) and gradient calculations are
        disabled (torch.no_grad) convolution is computed using ZenDNN library)
    • backward propagation is not supported.
    • support is only added for fp32 data type
  • Added few unit tests for pytorch + zendnn

Signed-off-by: Ratan Prasad ratan.prasad2@amd.com

Fixes #ISSUE_NUMBER

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 22, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 67503f7 (more details on the Dr. CI page):

Expand to see more

Commit 67503f7 was recently pushed. Waiting for builds...


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Apr 22, 2022
@mrshenli mrshenli added module: convolution Problems related to convolutions (THNN, THCUNN, CuDNN) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 24, 2022
@albanD albanD removed their request for review April 26, 2022 13:07
@soulitzer soulitzer removed their request for review April 28, 2022 16:24
@malfet
Copy link
Contributor

malfet commented May 4, 2022

@ratan-prasad can you please share some performance numbers in PR description? Also, what is the expected binary footprint increase as result of this change?

@soumith
Copy link
Member

soumith commented Jun 1, 2022

also, why is ZenDNN not just part of the rocm backend integration?

@ratan-prasad
Copy link
Author

@ratan-prasad can you please share some performance numbers in PR description? Also, what is the expected binary footprint increase as result of this change?
@malfet This PR contains only one op integrated to validate the procedure ahead of submitting the remaining PRs integrating the rest of the ZenDNN ops. To showcase the potential uplift using ZenDNN vs. Intel MKL-DNN on AMD CPU, we have benchmarked our full integrated version on PT v1.11.
PyTorch v1.11 Throughput(images/sec) BS:640, Milan 64C NPS4

Models Improvement
ResNet50 6.49%
GoogleNet 118.83%
vgg11 18.17%
Inception-v4 69.96%
Alexnet 17.99%

PyTorch v1.11 Latency(ms) BS:1 Milan 64C NPS4

Models Improvement
ResNet50 6.49%
GoogleNet 118.83%
vgg11 18.17%
Inception-v4 69.96%
Alexnet 17.99%

Wheel file size comparison with and without ZenDNN build as below

Build PR1 (size in MB)
Default build 105
Build with ZenDNN 107

@ratan-prasad
Copy link
Author

also, why is ZenDNN not just part of the rocm backend integration?

@soumith ZenDNN is CPU inference library on CPU and both have different release cycles based on customer needs as well as different CPU vs. GPU product release cycles and the corresponding need for timing-aligned software releases. This is the reason ZenDNN is not part of rocm integration.

@ratan-prasad ratan-prasad force-pushed the local_pt_pr_1 branch 2 times, most recently from c3b22d4 to 67604e3 Compare June 16, 2022 12:38
@huiguoo
Copy link
Contributor

huiguoo commented Jun 17, 2022

@ratan-prasad Are the performance numbers exactly the same for batch size 1 and batch size 640? Have you tried other batch sizes?

@ratan-prasad
Copy link
Author

ratan-prasad commented Jul 5, 2022

@ratan-prasad Are the performance numbers exactly the same for batch size 1 and batch size 640? Have you tried other batch sizes?

Hi @huiguoo Sorry for late reply. We have benchmarked for other batch sizes also. There was some typo for batch size 1 in previous reply. We observed that conda installation of PyTorch v1.11.0 performs better than pip based installation for oneDNN. Hence we have regenerated benchmarking data. Please find the updated improvements info as below.

 

Please find the benchmarking details below

PyTorch v1.11 Throughput(images/sec) BS:640, Milan 64C, NPS4

Models Improvement
ResNet50 11.09%
GoogleNet 113.91%
vgg11 17.57%
Inception-v4 70.13%
Alexnet 5.05%

PyTorch v1.11 Throughput(images/sec) BS:128, Milan 64C, NPS4

Models Improvement
ResNet50 4.14%
GoogleNet 107.84%
vgg11 13.66%
Inception-v4 63.66%
Alexnet 14.61%

PyTorch v1.11 Throughput(images/sec) BS:1, Milan 64C, NPS4

Models Improvement
ResNet50 12.38%
GoogleNet 66.27%
vgg11 154.63%
Inception-v4 47.40%
Alexnet 324.45%

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed solution makes zennDNN and BLIS a mandatory submodules of PyTorch even though dependence itself will be used very sporadically (as currently there are no CI/CD support)

Can we please solve mutual exclusivity issue or modify the PR to make those optional dependencies(checked out only if USE_ZENDNN is set to true) and reuse the same DNNCPU dispatch key

CMakeLists.txt Outdated
@@ -128,9 +128,11 @@ endif()

set(CPU_AARCH64 OFF)
set(CPU_INTEL OFF)
set(CPU_AMD OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain, what's the purpose of CPU_AMD here? You can replace CPU_INTEL with CPU_X64 if you want.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. We wanted to keep the existing code untouched as much as possible.
We will submit this correction in another patch on the same pull request.

CMakeLists.txt Outdated


cmake_dependent_option(
USE_ZENDNN "Use ZENDNN. Only available on x86 and x86_64." "${CPU_AMD}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be gated on new CPU_AMD but rather on generic X86 architecture.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. We will submit this correction in another patch on the same pull request.

Comment on lines 53 to +54
MkldnnCPU,
ZendnnCPU,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, as those are mutually exclusive right now, what's the harm of reusing the same dispatch key? (rename it to neutral DNNCPU)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use same dispatch key we are anticipating some challenges in ZenDNN integration.

E.g.
How to update “aten/src/ATen/native/native_functions.yaml” for mkldnn and zendnn integrations?

Currently a relu function entry in “aten/src/ATen/native/native_functions.yaml” looks as follows

func: relu(Tensor self) -> Tensor
device_check: NoCheck # TensorIterator
variants: function, method
dispatch:
CPU, CUDA: relu
MkldnnCPU: mkldnn_relu
ZendnnCPU: zendnn_relu
QuantizedCPU: relu_quantized_cpu

How to make the same dnnCPU dispatch key to point to mkldnn_relu or zendnn_relu based on CMAKE Level flags? Can you please point us to any example for this?

Note:- ATEN calls mkldnn using IDEEP API and IDEEP library. ATEN will be calling ZenDNN directly using ZenDNN API without IDEEP API/library

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet I think one issue with sharing the same key is that we have to build separate packages for AMD and Intel which isn't ideal. Ideally we can use the same package and at runtime pick different backend.

@ratan-prasad curious about the mutually exclusive part: why is that the case? Is it symbol conflict or something?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xw285cornell Yes, ZenDNN uses AMD-BLIS (a CBLAS API library) for optimal matrix operations on AMD hardware. AMD-BLIS has symbol clashes with MKL. That's why we are planning for mutually exclusive builds.
Can you help us with sample code or existing similar integrations where same dispatch key shared between two different libraries?

Comment on lines +977 to +1021
constexpr auto sparse_and_sparsecsr_and_mkldnn_ks_and_zendnn_ks =
c10::sparse_ks | c10::sparse_csr_ks | c10::mkldnn_ks | c10::zendnn_ks;
if (!key_set_.has_any(sparse_and_sparsecsr_and_mkldnn_ks_and_zendnn_ks)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: it there would be the same key, no need to add additional check here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as previous.

@ratan-prasad
Copy link
Author

Proposed solution makes zennDNN and BLIS a mandatory submodules of PyTorch even though dependence itself will be used very sporadically (as currently there are no CI/CD support)

Can we please solve mutual exclusivity issue or modify the PR to make those optional dependencies(checked out only if USE_ZENDNN is set to true) and reuse the same DNNCPU dispatch key

Should we modify “CMakeLists.txt” or “cmake/Dependencies.cmake” so that ZenDNN and BLIS repositories will be downloaded during setup.py execution ? (when USE_ZENDNN=1 is exported).

Note: We are also looking forward to enable CI runs for ZenDNN builds.

ratan-prasad and others added 3 commits August 9, 2022 12:15
This PR covers following
- Added ZenDNN as third-party library
    - ZenDNN is a submodule for pytorch
    - modified pytorch build infra to build with ZenDNN when USE_ZENDNN is set
      to 1
    - pytorch build reverts to default behavior if USE_ZENDNN environmental
      variable is set to zero or unset
- Added blis (dependency of ZenDNN) as third-party library
    - blis is a submodule for pytorch
    - modified pytorch build infra to build with blis when BLAS is set to BLIS
    - blis is built if USE_ZENDNN is set to 1
- Added ZenDNN as backend to pytorch
    - added support for user level control to disable or enable zendnn backend
- Added following changes
    - added zendnn layout and ZendnnCPU dispatch key.
    - tensor conversion between dense and zendnn layouts
    - zendnn convolution as a new ConvBackend
        - when built with zendnn (USE_ZENDNN=1) and gradient calculations are
          disabled (torch.no_grad) convolution is computed using ZenDNN library)
	- backward propagation is not supported.
	- support is only added for fp32 data type
- Added few unit tests for pytorch + zendnn

Signed-off-by: Ratan Prasad <ratan.prasad2@amd.com>
Fixing following CI errors:
-lint errors and warnings
-failure in test_autograd.py
-failure in test_mkldnn.py

Signed-off-by: ratan-prasad <ratan.prasad2@amd.com>
Fixed following CI issue:
- failure in test_public_bindings.py
- failure in test_overrides.py
- failure in test_sparse_csr.py
- failure in documentation building

Signed-off-by: ratan-prasad <ratan.prasad2@amd.com>
Signed-off-by: Naveen Kumar T <NAVEEN.THANGUDU@amd.com>
Fixed following
- Fixed android buid issue
- ZenDNN uses AOCL-blis library as dependency. AOCL-blis is used by ZenDNN
independent of BLAS setting
- BLAS=BLIS build changes is restored as system default
- fixed failure in test_profiler.py for USE_ZENDNN=1 build

Signed-off-by: ratan-prasad <ratan.prasad2@amd.com>
@b0noI
Copy link
Contributor

b0noI commented Aug 31, 2022

is there RFC for this? If no we probably should proceed with the review but for the next big changes like this we probably should have RFC first. If there is RFC reviewed and approved already please link it.

@ratan-prasad
Copy link
Author

is there RFC for this? If no we probably should proceed with the review but for the next big changes like this we probably should have RFC first. If there is RFC reviewed and approved already please link it.

Hi @b0noI,
There is no RFC for this. We had an meeting/email communication earlier where it was suggested to submit a small PR first before rest of the PRs are submitted. Please find the link of issue raised here:
#76244
Once this patch is merged, we would submit a RFC first and then rest of the PRs.

@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

@linux-foundation-easycla
Copy link

CLA Missing ID CLA Not Signed

@Willian-Zhang
Copy link

Very excited to see this is going on rapidly. Just saw Zen 4 EPYC release today and huge performance boost along with it, can't wait to get one myself playing around.

@github-actions
Copy link

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jan 10, 2023
@pytorch-bot pytorch-bot bot added the release notes: jit release notes category label Jan 10, 2023
@ratan-prasad
Copy link
Author

HI @xw285cornell, @malfet, @bdhirsh, @robieta can you please help us in removing the Stale tag?

@kit1980 kit1980 removed the Stale label Jan 11, 2023
@malfet
Copy link
Contributor

malfet commented Jan 11, 2023

@ratan-prasad i'm not sure, why it should be removed. PR in it's current form reached a stalemate and as such can not be merged.

@kit1980 kit1980 added the Stale label Jan 11, 2023
@jgong5
Copy link
Collaborator

jgong5 commented Jan 11, 2023

I took a quick look at the ZenDNN code: https://github.com/amd/ZenDNN. It seems a lot of code was just a clone of oneDNN (a.k.a. MKLDNN). Have you considered to contribute optimizations to oneDNN directly? This can make PyTorch side simpler. You don't have to clone the oneDNN integration again here. cc @vpirogov

@naveenthangudu
Copy link

@ratan-prasad i'm not sure, why it should be removed. PR in it's current form reached a stalemate and as such can not be merged.

Can you please let us what can be done to bypass this stalemate?

@soumith
Copy link
Member

soumith commented Jan 13, 2023

what would definitely remove the stalemate is to upstream your patches to OneDNN.
I didn't realize ZenDNN is a fork of OneDNN with a bunch of duplicated code -- but after @jgong5 pointed it out, I went and looked and it seems true.
It seems completely wasteful and bug-prone to fork it.

If you wish to continue maintaining the fork because of company dynamics, etc., then as Nikita stated above, this PR needs to:

  1. be rebased onto latest master commit
  2. made a completely optional cmake flag (and no submodules should be committed to the pytorch repo), as @ratan-prasad described in this comment: Initial integration of ZenDNN as backend into PyTorch #76242 (comment)
  3. Because ZenDNN is a fork of OneDNN with minor API changes, similar to how ROCm does a regex of s/cuda/hip/, the ZenDNN patch can regex s/mkldnn/zendnn/ to keep the changes to PyTorch core minimal (instead of having ~3000 lines enter PyTorch core that we'll have to maintain)

@naveenthangudu
Copy link

naveenthangudu commented Feb 9, 2023

@malfet @soumith Thank you for inputs to avoid stalemate on ZenDNN integration.

1. be rebased onto latest `master` commit

With substitution approach, integration changes will be different compared to this PR. Hence, we are planning to raise new PR.

2. made a completely optional cmake flag (and no submodules should be committed to the pytorch repo), as @ratan-prasad described in this comment: [Initial integration of ZenDNN as backend into PyTorch #76242 (comment)](https://github.com/pytorch/pytorch/pull/76242#issuecomment-1179977899)

Can we use cmake API(“ExternalProject_Add”) for downloading ZenDNN repos?

3. Because ZenDNN is a fork of OneDNN with minor API changes, similar to how ROCm does a regex of `s/cuda/hip/`, the ZenDNN patch can regex `s/mkldnn/zendnn/` to keep the changes to PyTorch core minimal (instead of having ~3000 lines enter PyTorch core that we'll have to maintain)

currently working on a POC of this approach. Should I create an RFC for this approach?

Note: PyTorch doesnt call oneDNN directly. It calls oneDNN using another library ideep

@gottbrath
Copy link
Contributor

gottbrath commented Feb 10, 2023

  1. Because ZenDNN is a fork of OneDNN with minor API changes, similar to how ROCm does a regex of s/cuda/hip/, the ZenDNN patch can regex s/mkldnn/zendnn/ to keep the changes to PyTorch core minimal (instead of having ~3000 lines enter PyTorch core that we'll have to maintain)

currently working on a POC of this approach. Should I create an RFC for this approach?

I think a POC and an RFC is a great idea for something of this magnitude/potential complexity.

@gottbrath
Copy link
Contributor

see pytorch/rfcs#52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed module: convolution Problems related to convolutions (THNN, THCUNN, CuDNN) oncall: jit Add this issue/PR to JIT oncall triage queue open source release notes: jit release notes category Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet