New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial integration of ZenDNN as backend into PyTorch #76242
Conversation
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 67503f7 (more details on the Dr. CI page): Expand to see moreCommit 67503f7 was recently pushed. Waiting for builds... This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
e998e5e
to
d67a093
Compare
@ratan-prasad can you please share some performance numbers in PR description? Also, what is the expected binary footprint increase as result of this change? |
2eca46d
to
63cec6c
Compare
63cec6c
to
9fbf0c2
Compare
also, why is ZenDNN not just part of the rocm backend integration? |
PyTorch v1.11 Latency(ms) BS:1 Milan 64C NPS4
Wheel file size comparison with and without ZenDNN build as below
|
@soumith ZenDNN is CPU inference library on CPU and both have different release cycles based on customer needs as well as different CPU vs. GPU product release cycles and the corresponding need for timing-aligned software releases. This is the reason ZenDNN is not part of rocm integration. |
c3b22d4
to
67604e3
Compare
@ratan-prasad Are the performance numbers exactly the same for batch size 1 and batch size 640? Have you tried other batch sizes? |
Hi @huiguoo Sorry for late reply. We have benchmarked for other batch sizes also. There was some typo for batch size 1 in previous reply. We observed that conda installation of PyTorch v1.11.0 performs better than pip based installation for oneDNN. Hence we have regenerated benchmarking data. Please find the updated improvements info as below.
Please find the benchmarking details below PyTorch v1.11 Throughput(images/sec) BS:640, Milan 64C, NPS4
PyTorch v1.11 Throughput(images/sec) BS:128, Milan 64C, NPS4
PyTorch v1.11 Throughput(images/sec) BS:1, Milan 64C, NPS4
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed solution makes zennDNN
and BLIS
a mandatory submodules of PyTorch even though dependence itself will be used very sporadically (as currently there are no CI/CD support)
Can we please solve mutual exclusivity issue or modify the PR to make those optional dependencies(checked out only if USE_ZENDNN is set to true) and reuse the same DNNCPU
dispatch key
CMakeLists.txt
Outdated
@@ -128,9 +128,11 @@ endif() | |||
|
|||
set(CPU_AARCH64 OFF) | |||
set(CPU_INTEL OFF) | |||
set(CPU_AMD OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain, what's the purpose of CPU_AMD
here? You can replace CPU_INTEL
with CPU_X64
if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. We wanted to keep the existing code untouched as much as possible.
We will submit this correction in another patch on the same pull request.
CMakeLists.txt
Outdated
|
||
|
||
cmake_dependent_option( | ||
USE_ZENDNN "Use ZENDNN. Only available on x86 and x86_64." "${CPU_AMD}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be gated on new CPU_AMD
but rather on generic X86 architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. We will submit this correction in another patch on the same pull request.
MkldnnCPU, | ||
ZendnnCPU, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, as those are mutually exclusive right now, what's the harm of reusing the same dispatch key? (rename it to neutral DNNCPU
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use same dispatch key we are anticipating some challenges in ZenDNN integration.
E.g.
How to update “aten/src/ATen/native/native_functions.yaml” for mkldnn and zendnn integrations?
Currently a relu function entry in “aten/src/ATen/native/native_functions.yaml” looks as follows
func: relu(Tensor self) -> Tensor
device_check: NoCheck # TensorIterator
variants: function, method
dispatch:
CPU, CUDA: relu
MkldnnCPU: mkldnn_relu
ZendnnCPU: zendnn_relu
QuantizedCPU: relu_quantized_cpu
How to make the same dnnCPU dispatch key to point to mkldnn_relu or zendnn_relu based on CMAKE Level flags? Can you please point us to any example for this?
Note:- ATEN calls mkldnn using IDEEP API and IDEEP library. ATEN will be calling ZenDNN directly using ZenDNN API without IDEEP API/library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malfet I think one issue with sharing the same key is that we have to build separate packages for AMD and Intel which isn't ideal. Ideally we can use the same package and at runtime pick different backend.
@ratan-prasad curious about the mutually exclusive part: why is that the case? Is it symbol conflict or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xw285cornell Yes, ZenDNN uses AMD-BLIS (a CBLAS API library) for optimal matrix operations on AMD hardware. AMD-BLIS has symbol clashes with MKL. That's why we are planning for mutually exclusive builds.
Can you help us with sample code or existing similar integrations where same dispatch key shared between two different libraries?
constexpr auto sparse_and_sparsecsr_and_mkldnn_ks_and_zendnn_ks = | ||
c10::sparse_ks | c10::sparse_csr_ks | c10::mkldnn_ks | c10::zendnn_ks; | ||
if (!key_set_.has_any(sparse_and_sparsecsr_and_mkldnn_ks_and_zendnn_ks)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: it there would be the same key, no need to add additional check here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer as previous.
Should we modify “CMakeLists.txt” or “cmake/Dependencies.cmake” so that ZenDNN and BLIS repositories will be downloaded during setup.py execution ? (when USE_ZENDNN=1 is exported). Note: We are also looking forward to enable CI runs for ZenDNN builds. |
This PR covers following - Added ZenDNN as third-party library - ZenDNN is a submodule for pytorch - modified pytorch build infra to build with ZenDNN when USE_ZENDNN is set to 1 - pytorch build reverts to default behavior if USE_ZENDNN environmental variable is set to zero or unset - Added blis (dependency of ZenDNN) as third-party library - blis is a submodule for pytorch - modified pytorch build infra to build with blis when BLAS is set to BLIS - blis is built if USE_ZENDNN is set to 1 - Added ZenDNN as backend to pytorch - added support for user level control to disable or enable zendnn backend - Added following changes - added zendnn layout and ZendnnCPU dispatch key. - tensor conversion between dense and zendnn layouts - zendnn convolution as a new ConvBackend - when built with zendnn (USE_ZENDNN=1) and gradient calculations are disabled (torch.no_grad) convolution is computed using ZenDNN library) - backward propagation is not supported. - support is only added for fp32 data type - Added few unit tests for pytorch + zendnn Signed-off-by: Ratan Prasad <ratan.prasad2@amd.com>
Fixing following CI errors: -lint errors and warnings -failure in test_autograd.py -failure in test_mkldnn.py Signed-off-by: ratan-prasad <ratan.prasad2@amd.com>
Fixed following CI issue: - failure in test_public_bindings.py - failure in test_overrides.py - failure in test_sparse_csr.py - failure in documentation building Signed-off-by: ratan-prasad <ratan.prasad2@amd.com> Signed-off-by: Naveen Kumar T <NAVEEN.THANGUDU@amd.com>
Fixed following - Fixed android buid issue - ZenDNN uses AOCL-blis library as dependency. AOCL-blis is used by ZenDNN independent of BLAS setting - BLAS=BLIS build changes is restored as system default - fixed failure in test_profiler.py for USE_ZENDNN=1 build Signed-off-by: ratan-prasad <ratan.prasad2@amd.com>
67604e3
to
67503f7
Compare
is there RFC for this? If no we probably should proceed with the review but for the next big changes like this we probably should have RFC first. If there is RFC reviewed and approved already please link it. |
Hi @b0noI, |
/easycla As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details. This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign. |
|
Very excited to see this is going on rapidly. Just saw Zen 4 EPYC release today and huge performance boost along with it, can't wait to get one myself playing around. |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
HI @xw285cornell, @malfet, @bdhirsh, @robieta can you please help us in removing the Stale tag? |
@ratan-prasad i'm not sure, why it should be removed. PR in it's current form reached a stalemate and as such can not be merged. |
I took a quick look at the ZenDNN code: https://github.com/amd/ZenDNN. It seems a lot of code was just a clone of oneDNN (a.k.a. MKLDNN). Have you considered to contribute optimizations to oneDNN directly? This can make PyTorch side simpler. You don't have to clone the oneDNN integration again here. cc @vpirogov |
Can you please let us what can be done to bypass this stalemate? |
what would definitely remove the stalemate is to upstream your patches to OneDNN. If you wish to continue maintaining the fork because of company dynamics, etc., then as Nikita stated above, this PR needs to:
|
@malfet @soumith Thank you for inputs to avoid stalemate on ZenDNN integration.
With substitution approach, integration changes will be different compared to this PR. Hence, we are planning to raise new PR.
Can we use cmake API(“ExternalProject_Add”) for downloading ZenDNN repos?
currently working on a POC of this approach. Should I create an RFC for this approach? Note: PyTorch doesnt call oneDNN directly. It calls oneDNN using another library ideep |
I think a POC and an RFC is a great idea for something of this magnitude/potential complexity. |
see pytorch/rfcs#52 |
This PR covers following
to 1
variable is set to zero or unset
disabled (torch.no_grad) convolution is computed using ZenDNN library)
Signed-off-by: Ratan Prasad ratan.prasad2@amd.com
Fixes #ISSUE_NUMBER
cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel