BLAS library build for AArch64 wheels #679

nSircombe · 2021-03-17T17:11:04Z

There are a number of issues with the current AArch64 whl build in https://github.com/pytorch/builder/blob/master/build_aarch64_wheel.py which appear to be impacting the performance of the finished whl.

OpenBLAS has not been built with USE_OPENMP=1.
The finished PyTorch build is not using a multithreaded BLAS backend as a result. This impacts performance, and results in the following warning (OMP_NUM_THREADS times) for a simple TorchVision ResNet50 inference example OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.
OpenBLAS is built for a NeoverseN1 target, but with a version of GCC that does not support -mtune=neoverse-n1.
OpenBLAS correctly identifies the t6g (NeoverseN1) platform it is being built on, but GCC only provides support for -mtune=neoverse-n1 from v9 onwards. So the build progresses with -march=armv8.2-a -mtune=cortex-a72 instead. Note: targeting the v8.2 ISA risks generating a binary which is not portable, a "generic" build would need to be provided for portability, although this would impact performance.
The build has USE_EIGEN_FOR_BLAS set.
This can be seen in the output of print(*torch.__config__.show().split("\n"), sep="\n"). As I understand it this should not be required if a BLAS library like OpenBLAS is provided.
-march and -mtune do not appear to have been set for the PyTorch build.
Building with -mcpu=native will chose the appropriate -march and -mtune for the host system (again this will have implications for portability).

Updating build__aarch64_wheel.py so that the OpenBLAS build uses:

LDFLAGS=-lgfortran make TARGET=NEOVERSEN1 USE_OPENMP=1 NO_SHARED=1 -j8

and the PyTorch build uses:

build_vars += f"OpenBLAS_HOME='/opt/OpenBLAS' BLAS='OpenBLAS' USE_MKLDNN=0 USE_OPENMP=1 USE_LAPACK=1 USE_CUDA=0 USE_FBGEMM=0 USE_DISTRIBUTED=0 CXXFLAGS='-mcpu=native -O3'"

Results in:

the disappearance of the OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option. warning.
30% speedup in a simple ResNet50 inference example
70% fall in latency for a simple BERT example.

Will it be possible to update the AArch64 build to support: multi-threaded OpenBLAS; disablement of Eigen BLAS; use of correct Neoverse optimisations throughout, as this will ensure the .whl gives better performance, consistent with what you would get if building from source.

The text was updated successfully, but these errors were encountered:

nSircombe · 2021-03-17T17:13:42Z

Note, some other issues identified with the 1.8 wheels for AArch64:

ManyLinux v1.8 release .whl for AArch64 (Arm) does not work on CentOS 8 pytorch#53597
TorchVision and TorchAudio wheels for AArch64 absent from https://download.pytorch.org/whl/torch_stable.html pytorch#54104

docularxu · 2021-03-30T08:47:37Z

2. OpenBLAS is built for a NeoverseN1 target, but with a version of GCC that does not support -mtune=neoverse-n1.
OpenBLAS correctly identifies the t6g (NeoverseN1) platform it is being built on, but GCC only provides support for -mtune=neoverse-n1 from v9 onwards. So the build progresses with -march=armv8.2-a -mtune=cortex-a72 instead. Note: targeting the v8.2 ISA risks generating a binary which is not portable, a "generic" build would need to be provided for portability, although this would impact performance.

Hi, @nSircombe Sorry if I'm asking a naive question. May I know why you build OpenBLAS for a TARGET for 'neoverseN1'? As you noticed, 'neoverseN1' is a very specialized Arm64 core. In OpenBLAS, for a balance between general and performance, maybe you can consider TARGET=CORTEXA72 , which corresponds to -march=armv8-a -mtune=cortex-a72.

This file [1] from OpenBLAS can explain what you described above (GCC7, GCC9 flags for neoverseN1):
[1] https://github.com/xianyi/OpenBLAS/blob/63b03efc2af332c88b86d4fd8079d00f4b439adf/Makefile.arm64#L27

This blog [2] from Arm Inc. explains the difference of various compiler flags across architectures: -march, -mtune, and -mcpu on Arm64. With several very easy-reading graphs. That's why we know that -march=armv8-a are guarantee'ed to run on more archs than -march=arm8.2-a.
[2] https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu

That's why I think CORTEXA72 is a better idea. PS: I have no idea about the performance part of CortexA72. so no comments about that.

nSircombe · 2021-03-30T20:01:37Z

Hi @docularxu,

May I know why you build OpenBLAS for a TARGET for 'neoverseN1'?

Simply because I'm only intending to use my builds on N1 hardware. In practice I would usually just rely on OpenBLAS to pick the right target (assuming I'm building on target hardware). But this creates a problem for building a portable whl.

As it stands the current default inbuild_aarch64_wheel.py doesn't explicitly set a target, so its left unto OpenBLAS to decide what to build for. If this script is run on t6g (a Neoverse-N1 based SoC) then OpenBLAS will identify it as such and set: -mtune=neoverse-n1 -march=armv8.2-a, for GCC 9 or greater, and -mtune=cortex-a72 -march=armv8.2-a for older GCC versions - so, at present, it's not choosing a72 by design, it's just a side-effect of builds done with GCC<9. Importantly, in neither case does it pick -march=armv8-a so runs the risk of generating 8.2 instructions, which will impact the portability of the binary.

So, I think something needs to be done - I think there are two options here:

pick a target for OpenBLAS that is more portable (i.e. doesn't use v8.2), such as a72 or generic.
as above but also provide an N1 build as a separate whl.

With the current prevalence of N1 designs in the infrastructure space (e.g. Graviton2), I would argue there's a case for supporting this SoC explicitly in some form.

...microarchitecture tuning can have a significant impact on the performance of BLAS implementations, however I don't know if that's the case for TARGET=CORTEXA72 v.s TARGET=NEROVERSEN1 running on N1 hardware (for example) - I should check!

AGSaidi · 2021-04-06T13:56:01Z

I agree that having a wheel specifically targeted an N1 is ideal. For other users a generic v8.0 wheel could be created so it would work on Raspberry PIs and other platforms. @nSircombe can you check what performance difference you find?

nSircombe · 2021-04-07T08:42:10Z

Hi @AGSaidi,

can you check what performance difference you find?

I'll take a look and report back.

When it comes to performance though - I think the missing USE_OPENMP=1 build flag is going to be much more significant. I've always built PyTorch linked against a multithreaded BLAS library without issue, and it would make sense for the builder script to do this too I believe, irrespective of the performance & portability implications of the various mcpu/mtune/march options.

...unless there's a good reason why this build is single-threaded? As I said, I'm not aware of one, but if there is an issue it's clearly something we'll need to work on.

AWSNB · 2021-04-27T18:59:52Z

@nSircombe @docularxu - was there an update on this ?

nSircombe · 2021-04-28T15:40:45Z

I don't have any benchmarking that I can share for the impact of mcpu etc.

However, I think it probably makes sense to split this into two: OpenBLAS with OpenMP; mcpu, march, mtune flags.

The issue with the compiler flags is more about consistency - at the moment the choices don't appear to be consistent through the build. A Neoverse-N1 tuned build would be desirable, but from a performance point of view, the OpenBLAS build is going to be far more impactful. On this front, I'd like to understand if there's a reason for not setting the OpenMP flag that I'm currently unaware of, and if so, what a reasonable set of tests / expectations would be, so that we can work on a resolution or mitigation.

nSircombe · 2022-09-21T06:56:33Z

Closing this as I think the outstanding issues are resolved by:
fae8f0f
...and...
#818

nSircombe mentioned this issue May 4, 2021

OpenBLAS OpenMP support AArch64 builds. #758

Open

rgommers added the pypi PyPI, pip, wheel related issues label Nov 26, 2021

nSircombe closed this as completed Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BLAS library build for AArch64 wheels #679

BLAS library build for AArch64 wheels #679

nSircombe commented Mar 17, 2021 •

edited

Loading

nSircombe commented Mar 17, 2021

docularxu commented Mar 30, 2021

nSircombe commented Mar 30, 2021 •

edited

Loading

AGSaidi commented Apr 6, 2021

nSircombe commented Apr 7, 2021 •

edited

Loading

AWSNB commented Apr 27, 2021

nSircombe commented Apr 28, 2021

nSircombe commented Sep 21, 2022

BLAS library build for AArch64 wheels #679

BLAS library build for AArch64 wheels #679

Comments

nSircombe commented Mar 17, 2021 • edited Loading

nSircombe commented Mar 17, 2021

docularxu commented Mar 30, 2021

nSircombe commented Mar 30, 2021 • edited Loading

AGSaidi commented Apr 6, 2021

nSircombe commented Apr 7, 2021 • edited Loading

AWSNB commented Apr 27, 2021

nSircombe commented Apr 28, 2021

nSircombe commented Sep 21, 2022

nSircombe commented Mar 17, 2021 •

edited

Loading

nSircombe commented Mar 30, 2021 •

edited

Loading

nSircombe commented Apr 7, 2021 •

edited

Loading