Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAS library build for AArch64 wheels #679

Closed
nSircombe opened this issue Mar 17, 2021 · 8 comments
Closed

BLAS library build for AArch64 wheels #679

nSircombe opened this issue Mar 17, 2021 · 8 comments
Labels
pypi PyPI, pip, wheel related issues

Comments

@nSircombe
Copy link

nSircombe commented Mar 17, 2021

There are a number of issues with the current AArch64 whl build in https://github.com/pytorch/builder/blob/master/build_aarch64_wheel.py which appear to be impacting the performance of the finished whl.

  1. OpenBLAS has not been built with USE_OPENMP=1.
    The finished PyTorch build is not using a multithreaded BLAS backend as a result. This impacts performance, and results in the following warning (OMP_NUM_THREADS times) for a simple TorchVision ResNet50 inference example OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.

  2. OpenBLAS is built for a NeoverseN1 target, but with a version of GCC that does not support -mtune=neoverse-n1.
    OpenBLAS correctly identifies the t6g (NeoverseN1) platform it is being built on, but GCC only provides support for -mtune=neoverse-n1 from v9 onwards. So the build progresses with -march=armv8.2-a -mtune=cortex-a72 instead. Note: targeting the v8.2 ISA risks generating a binary which is not portable, a "generic" build would need to be provided for portability, although this would impact performance.

  3. The build has USE_EIGEN_FOR_BLAS set.
    This can be seen in the output of print(*torch.__config__.show().split("\n"), sep="\n"). As I understand it this should not be required if a BLAS library like OpenBLAS is provided.

  4. -march and -mtune do not appear to have been set for the PyTorch build.
    Building with -mcpu=native will chose the appropriate -march and -mtune for the host system (again this will have implications for portability).

Updating build__aarch64_wheel.py so that the OpenBLAS build uses:

LDFLAGS=-lgfortran make TARGET=NEOVERSEN1 USE_OPENMP=1 NO_SHARED=1 -j8

and the PyTorch build uses:

build_vars += f"OpenBLAS_HOME='/opt/OpenBLAS' BLAS='OpenBLAS' USE_MKLDNN=0 USE_OPENMP=1 USE_LAPACK=1 USE_CUDA=0 USE_FBGEMM=0 USE_DISTRIBUTED=0 CXXFLAGS='-mcpu=native -O3'"

Results in:

  • the disappearance of the OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option. warning.
  • 30% speedup in a simple ResNet50 inference example
  • 70% fall in latency for a simple BERT example.

Will it be possible to update the AArch64 build to support: multi-threaded OpenBLAS; disablement of Eigen BLAS; use of correct Neoverse optimisations throughout, as this will ensure the .whl gives better performance, consistent with what you would get if building from source.

@docularxu
Copy link

2. OpenBLAS is built for a NeoverseN1 target, but with a version of GCC that does not support -mtune=neoverse-n1.
OpenBLAS correctly identifies the t6g (NeoverseN1) platform it is being built on, but GCC only provides support for -mtune=neoverse-n1 from v9 onwards. So the build progresses with -march=armv8.2-a -mtune=cortex-a72 instead. Note: targeting the v8.2 ISA risks generating a binary which is not portable, a "generic" build would need to be provided for portability, although this would impact performance.

Hi, @nSircombe Sorry if I'm asking a naive question. May I know why you build OpenBLAS for a TARGET for 'neoverseN1'? As you noticed, 'neoverseN1' is a very specialized Arm64 core. In OpenBLAS, for a balance between general and performance, maybe you can consider TARGET=CORTEXA72 , which corresponds to -march=armv8-a -mtune=cortex-a72.

This file [1] from OpenBLAS can explain what you described above (GCC7, GCC9 flags for neoverseN1):
[1] https://github.com/xianyi/OpenBLAS/blob/63b03efc2af332c88b86d4fd8079d00f4b439adf/Makefile.arm64#L27

This blog [2] from Arm Inc. explains the difference of various compiler flags across architectures: -march, -mtune, and -mcpu on Arm64. With several very easy-reading graphs. That's why we know that -march=armv8-a are guarantee'ed to run on more archs than -march=arm8.2-a.
[2] https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu

That's why I think CORTEXA72 is a better idea. PS: I have no idea about the performance part of CortexA72. so no comments about that.

@nSircombe
Copy link
Author

nSircombe commented Mar 30, 2021

Hi @docularxu,

May I know why you build OpenBLAS for a TARGET for 'neoverseN1'?

Simply because I'm only intending to use my builds on N1 hardware. In practice I would usually just rely on OpenBLAS to pick the right target (assuming I'm building on target hardware). But this creates a problem for building a portable whl.

As it stands the current default inbuild_aarch64_wheel.py doesn't explicitly set a target, so its left unto OpenBLAS to decide what to build for. If this script is run on t6g (a Neoverse-N1 based SoC) then OpenBLAS will identify it as such and set: -mtune=neoverse-n1 -march=armv8.2-a, for GCC 9 or greater, and -mtune=cortex-a72 -march=armv8.2-a for older GCC versions - so, at present, it's not choosing a72 by design, it's just a side-effect of builds done with GCC<9. Importantly, in neither case does it pick -march=armv8-a so runs the risk of generating 8.2 instructions, which will impact the portability of the binary.

So, I think something needs to be done - I think there are two options here:

  • pick a target for OpenBLAS that is more portable (i.e. doesn't use v8.2), such as a72 or generic.
  • as above but also provide an N1 build as a separate whl.

With the current prevalence of N1 designs in the infrastructure space (e.g. Graviton2), I would argue there's a case for supporting this SoC explicitly in some form.

...microarchitecture tuning can have a significant impact on the performance of BLAS implementations, however I don't know if that's the case for TARGET=CORTEXA72 v.s TARGET=NEROVERSEN1 running on N1 hardware (for example) - I should check!

@AGSaidi
Copy link

AGSaidi commented Apr 6, 2021

I agree that having a wheel specifically targeted an N1 is ideal. For other users a generic v8.0 wheel could be created so it would work on Raspberry PIs and other platforms. @nSircombe can you check what performance difference you find?

@nSircombe
Copy link
Author

nSircombe commented Apr 7, 2021

Hi @AGSaidi,

can you check what performance difference you find?

I'll take a look and report back.

When it comes to performance though - I think the missing USE_OPENMP=1 build flag is going to be much more significant. I've always built PyTorch linked against a multithreaded BLAS library without issue, and it would make sense for the builder script to do this too I believe, irrespective of the performance & portability implications of the various mcpu/mtune/march options.

...unless there's a good reason why this build is single-threaded? As I said, I'm not aware of one, but if there is an issue it's clearly something we'll need to work on.

@AWSNB
Copy link

AWSNB commented Apr 27, 2021

@nSircombe @docularxu - was there an update on this ?

@nSircombe
Copy link
Author

I don't have any benchmarking that I can share for the impact of mcpu etc.

However, I think it probably makes sense to split this into two: OpenBLAS with OpenMP; mcpu, march, mtune flags.

The issue with the compiler flags is more about consistency - at the moment the choices don't appear to be consistent through the build. A Neoverse-N1 tuned build would be desirable, but from a performance point of view, the OpenBLAS build is going to be far more impactful. On this front, I'd like to understand if there's a reason for not setting the OpenMP flag that I'm currently unaware of, and if so, what a reasonable set of tests / expectations would be, so that we can work on a resolution or mitigation.

@rgommers rgommers added the pypi PyPI, pip, wheel related issues label Nov 26, 2021
@nSircombe
Copy link
Author

Closing this as I think the outstanding issues are resolved by:
fae8f0f
...and...
#818

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pypi PyPI, pip, wheel related issues
Projects
None yet
Development

No branches or pull requests

5 participants