-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLAS library build for AArch64 wheels #679
Comments
Note, some other issues identified with the 1.8 wheels for AArch64: |
Hi, @nSircombe Sorry if I'm asking a naive question. May I know why you build OpenBLAS for a TARGET for 'neoverseN1'? As you noticed, 'neoverseN1' is a very specialized Arm64 core. In OpenBLAS, for a balance between general and performance, maybe you can consider This file [1] from OpenBLAS can explain what you described above (GCC7, GCC9 flags for neoverseN1): This blog [2] from Arm Inc. explains the difference of various compiler flags across architectures: -march, -mtune, and -mcpu on Arm64. With several very easy-reading graphs. That's why we know that That's why I think CORTEXA72 is a better idea. PS: I have no idea about the performance part of CortexA72. so no comments about that. |
Hi @docularxu,
Simply because I'm only intending to use my builds on N1 hardware. In practice I would usually just rely on OpenBLAS to pick the right target (assuming I'm building on target hardware). But this creates a problem for building a portable whl. As it stands the current default in So, I think something needs to be done - I think there are two options here:
With the current prevalence of N1 designs in the infrastructure space (e.g. Graviton2), I would argue there's a case for supporting this SoC explicitly in some form. ...microarchitecture tuning can have a significant impact on the performance of BLAS implementations, however I don't know if that's the case for |
I agree that having a wheel specifically targeted an N1 is ideal. For other users a generic v8.0 wheel could be created so it would work on Raspberry PIs and other platforms. @nSircombe can you check what performance difference you find? |
Hi @AGSaidi,
I'll take a look and report back. When it comes to performance though - I think the missing ...unless there's a good reason why this build is single-threaded? As I said, I'm not aware of one, but if there is an issue it's clearly something we'll need to work on. |
@nSircombe @docularxu - was there an update on this ? |
I don't have any benchmarking that I can share for the impact of However, I think it probably makes sense to split this into two: OpenBLAS with OpenMP; The issue with the compiler flags is more about consistency - at the moment the choices don't appear to be consistent through the build. A Neoverse-N1 tuned build would be desirable, but from a performance point of view, the OpenBLAS build is going to be far more impactful. On this front, I'd like to understand if there's a reason for not setting the OpenMP flag that I'm currently unaware of, and if so, what a reasonable set of tests / expectations would be, so that we can work on a resolution or mitigation. |
There are a number of issues with the current AArch64 whl build in https://github.com/pytorch/builder/blob/master/build_aarch64_wheel.py which appear to be impacting the performance of the finished whl.
OpenBLAS has not been built with
USE_OPENMP=1
.The finished PyTorch build is not using a multithreaded BLAS backend as a result. This impacts performance, and results in the following warning (
OMP_NUM_THREADS
times) for a simple TorchVision ResNet50 inference exampleOpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.
OpenBLAS is built for a NeoverseN1 target, but with a version of GCC that does not support
-mtune=neoverse-n1
.OpenBLAS correctly identifies the t6g (NeoverseN1) platform it is being built on, but GCC only provides support for
-mtune=neoverse-n1
from v9 onwards. So the build progresses with-march=armv8.2-a -mtune=cortex-a72
instead. Note: targeting the v8.2 ISA risks generating a binary which is not portable, a "generic" build would need to be provided for portability, although this would impact performance.The build has
USE_EIGEN_FOR_BLAS
set.This can be seen in the output of
print(*torch.__config__.show().split("\n"), sep="\n")
. As I understand it this should not be required if a BLAS library like OpenBLAS is provided.-march
and-mtune
do not appear to have been set for the PyTorch build.Building with
-mcpu=native
will chose the appropriate-march
and-mtune
for the host system (again this will have implications for portability).Updating
build__aarch64_wheel.py
so that the OpenBLAS build uses:and the PyTorch build uses:
Results in:
OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.
warning.Will it be possible to update the AArch64 build to support: multi-threaded OpenBLAS; disablement of Eigen BLAS; use of correct Neoverse optimisations throughout, as this will ensure the .whl gives better performance, consistent with what you would get if building from source.
The text was updated successfully, but these errors were encountered: