Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Aarch64 support #183

Closed
wants to merge 6 commits into from
Closed

Conversation

coreyjjames
Copy link

@coreyjjames coreyjjames commented Dec 2, 2019

This pull request adds Aarch64 support to the FLAC project.
fixes #156

What is included in this pull request:

  • Edited configure script to check for aarch64 CPU and define FLAC__CPU_AARCH64 if aarch64 is detected.
  • Edited configure script to check for arm_neon.h and define FLAC__HAS_NEONINTRIN if arm_neon.h is detected.
  • Add the logic required to choose the aarch64 versions of the lpc_compute_autocorrelation function. When on an aarch64 machine.
  • Translated SSE intrinsics from the lpc_intrin_sse.c file into neon intrinsics and put them into a file called lpc_intrin_neon.c.
  • Added the new file lpc_intrin_neon.c to the src/libFLAC/Makefile.am.

Performance Boost to encoding:
I tested the performance with two aarch64 machines. The test I ran was encoding a .wav file to .flac. The size of the wave file was 1.57 gigabytes.
The machine, with a cortex-a57 8 threads, I got a performance increase of 106.57% for the compute autocorrelation function.
A savings of 0m8.281 seconds.
The machine, with a cortex-a53 24 threads, I got a performance increase of 254.6% for the compute autocorrelation function.
A savings of 1m40.166 seconds.

@petterreinholdtsen
Copy link

Will this patch do CPU feature detection at runtime or compile time? It look like it will do it at compile time, and I wonder if it is better to do it at run time?

@coreyjjames
Copy link
Author

From what I could see all CPU detection for this project is done at compile time. So I followed the convention that the project is currently doing.

@petterreinholdtsen
Copy link

To me it look like the sorting into x86 or ppc is done at compile time, and the feature detection is done at run time. Does something like that make sense for aarch64?

@coreyjjames
Copy link
Author

Oh sorry I miss understood you. Yes, I implemented the featured detection for aarch64 at runtime. I followed the same pattern as the x86 and ppc. The program will decide what feature to run based on conditions at runtime. For example, it will choose which auto correlation function to run based on the variable encoder->protected_->max_lpc_order.

@erikd
Copy link
Member

erikd commented Jan 4, 2020

Sorry, forgot about / lost track of this PR.

@coreyjjames would you be able to squash this down to a single commit? Also, is there anyway we can set up CI for Aarch64?

@coreyjjames
Copy link
Author

@erikd I squashed the commits into a single commit and check out this link from the Travis CI documentation. It looks like Travis CI should be able to do Arrch64 testing. https://docs.travis-ci.com/user/multi-cpu-architectures/

@coreyjjames
Copy link
Author

coreyjjames commented Jan 7, 2020

@erikd I added your PR #191 to this branch. I think the issue may be a problem with the GCC compiler version. Travis is using GCC version 5.4.1 and I tested with GCC version 8.3.1.

@erikd
Copy link
Member

erikd commented Jan 7, 2020

@coreyjjames What Linux distro are you running? Are you able to figure out which header file provides these functions and what package provides that header file?

@coreyjjames
Copy link
Author

@erikd Been busy just got some time to look into this issue again. So it seems like the intrinsic "vcopyq_laneq_f32" causes the problem.

From my research, the reason for the issue is the "vcopyq_laneq_f32" intrinsic it is one of the Aarch64 exclusive intrinsic and Travis CI is running arm64 (ARMv8) that is why we are getting an error.

I am going to look into a substitute for the "vcopyq_laneq_f32" intrinsic. I am going to see if I can find a solution that is more compatible with the different versions of ARM.

@NotTsunami
Copy link
Contributor

NotTsunami commented Jan 23, 2020

Can you update the CMake configuration as well?

@NotTsunami
Copy link
Contributor

Can you rebase this again? The CPU detection for CMake was recently changed so that might be the fault of the falsely reported optimizations. @coreyjjames

@NotTsunami NotTsunami mentioned this pull request May 21, 2020
@coreyjjames
Copy link
Author

@NotTsunami

Ya, I can rebase again.

Also, are you seeing the issues on the arm64 and OSX builds?
I just noticed it is building with arm64 OSX that is a bug with the merge. On my fork, Travis did not do that. It is not supposed to build with arm64 and OSX.

@NotTsunami
Copy link
Contributor

Exactly that.

@coreyjjames
Copy link
Author

I just pushed the rebase, and Travis is still building with arm64 and OSX. On my fork, it does not do that. Does anyone know why it is doing this?

@coreyjjames
Copy link
Author

I believe I fixed the problem with Travis building with arm and osx.

I am still looking into why the Autotools tests are considerably slower than the Cmake tests. From my debugging so far, the Autotools tests seem to be running the optimization code, but I am not getting the speed increase as I see with Cmake tests.

I am thinking its something with the build files if anybody wants to take a look, I would appreciate a second perspective.

configure.ac Outdated
@@ -156,6 +156,12 @@ case "$host_cpu" in
AH_TEMPLATE(FLAC__CPU_PPC, [define if building for PowerPC])
asm_optimisation=$asm_opt
;;
arm*|aarch64*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all we need is:
arm64|aarch64)

Otherwise, armv7 gets caught up here too (and is still very common). I don't believe we need anything else here, but someone with more architecture knowledge can chime in.

elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "([xX]|i[346])86")
set(IA32 TRUE)
add_definitions(-DFLAC__CPU_IA32 -DFLAC__ALIGN_MALLOC_DATA)
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "(arm|aarch64)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be:
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "(arm64|aarch64)")

Otherwise you will catch armv7, which I don't believe you are supporting.

@@ -154,6 +154,7 @@ typedef enum {
FLAC__CPUINFO_TYPE_IA32,
FLAC__CPUINFO_TYPE_X86_64,
FLAC__CPUINFO_TYPE_PPC,
FLAC__CPUINFO_TYPE_ARM,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisited this today, and I think FLAC__CPUINFO_TYPE_ARM64 is a more accurate name

@NotTsunami
Copy link
Contributor

Any progress on debugging? Hopefully this can get merged soon.

@coreyjjames Friendly ping

@coreyjjames
Copy link
Author

@NotTsunami

Hey,

Still working on the auto tools build configuration. I am thinking there is a compiler option that needs to be set or unset. That is causing the tests to run slow.

Either that or everything is correct and it just runs slower building with autotools vs CMake.

I should have some time tonight to work on it.

@coreyjjames
Copy link
Author

Update on progress

I cannot find any issue that would explain why the Autotools test suite is significantly slower than the CMake test suite. I did notice that on all platforms, the Autotools test suite is slower than the CMake test suite; this is leading me to think that it is correct, and if we want it to build faster, more optimizations need to be added to improve the performance.

I believe I am ready for this pull request to be reviewed, please let me know if you would like any changes to be made.

@nnghuy
Copy link

nnghuy commented Nov 21, 2020

I found myself here after seeing the native Apple M1 ARM64 vs Rosetta translated FLAC encoding benchmark on Phoronix.

FWIW, this PR compiles and runs on an M1 Air.

As a sanity check, I ran the benchmark above with native 1.3.3 and got a similar 28.x sec. I reran with this PR and got 14.x sec.

Rosetta translation is still slightly faster (!!!), but the current improvements are large.

@coreyjjames
Copy link
Author

@nnghuy Cool, Thanks for testing this PR on the new M1 chip! Glad to hear it compiles and runs.

This PR basically only adds one optimization to the FLAC project. It would be interesting if we could beat Rosetta's score with more optimizations.

tmkk added a commit to tmkk/flac that referenced this pull request Dec 9, 2020
@moisespr123
Copy link

Seems to be running fine on Termux on a Samsung Galaxy S20 Ultra. Great job!

@ktmf01
Copy link
Collaborator

ktmf01 commented Apr 29, 2022

Closing because of the merge of #270 and #332

@ktmf01 ktmf01 closed this Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aarch64 support
7 participants