-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Aarch64 support #183
Add Aarch64 support #183
Conversation
Will this patch do CPU feature detection at runtime or compile time? It look like it will do it at compile time, and I wonder if it is better to do it at run time? |
From what I could see all CPU detection for this project is done at compile time. So I followed the convention that the project is currently doing. |
To me it look like the sorting into x86 or ppc is done at compile time, and the feature detection is done at run time. Does something like that make sense for aarch64? |
Oh sorry I miss understood you. Yes, I implemented the featured detection for aarch64 at runtime. I followed the same pattern as the x86 and ppc. The program will decide what feature to run based on conditions at runtime. For example, it will choose which auto correlation function to run based on the variable encoder->protected_->max_lpc_order. |
Sorry, forgot about / lost track of this PR. @coreyjjames would you be able to squash this down to a single commit? Also, is there anyway we can set up CI for Aarch64? |
f1d79f2
to
9f08b02
Compare
@erikd I squashed the commits into a single commit and check out this link from the Travis CI documentation. It looks like Travis CI should be able to do Arrch64 testing. https://docs.travis-ci.com/user/multi-cpu-architectures/ |
9f08b02
to
9242aab
Compare
@coreyjjames What Linux distro are you running? Are you able to figure out which header file provides these functions and what package provides that header file? |
@erikd Been busy just got some time to look into this issue again. So it seems like the intrinsic "vcopyq_laneq_f32" causes the problem. From my research, the reason for the issue is the "vcopyq_laneq_f32" intrinsic it is one of the Aarch64 exclusive intrinsic and Travis CI is running arm64 (ARMv8) that is why we are getting an error. I am going to look into a substitute for the "vcopyq_laneq_f32" intrinsic. I am going to see if I can find a solution that is more compatible with the different versions of ARM. |
Can you update the CMake configuration as well? |
Can you rebase this again? The CPU detection for CMake was recently changed so that might be the fault of the falsely reported optimizations. @coreyjjames |
Ya, I can rebase again. Also, are you seeing the issues on the arm64 and OSX builds? |
Exactly that. |
8143fe5
to
f592baa
Compare
I just pushed the rebase, and Travis is still building with arm64 and OSX. On my fork, it does not do that. Does anyone know why it is doing this? |
878afa7
to
605dccf
Compare
fadc721
to
6223ab6
Compare
I believe I fixed the problem with Travis building with arm and osx. I am still looking into why the Autotools tests are considerably slower than the Cmake tests. From my debugging so far, the Autotools tests seem to be running the optimization code, but I am not getting the speed increase as I see with Cmake tests. I am thinking its something with the build files if anybody wants to take a look, I would appreciate a second perspective. |
configure.ac
Outdated
@@ -156,6 +156,12 @@ case "$host_cpu" in | |||
AH_TEMPLATE(FLAC__CPU_PPC, [define if building for PowerPC]) | |||
asm_optimisation=$asm_opt | |||
;; | |||
arm*|aarch64*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all we need is:
arm64|aarch64)
Otherwise, armv7 gets caught up here too (and is still very common). I don't believe we need anything else here, but someone with more architecture knowledge can chime in.
src/libFLAC/CMakeLists.txt
Outdated
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "([xX]|i[346])86") | ||
set(IA32 TRUE) | ||
add_definitions(-DFLAC__CPU_IA32 -DFLAC__ALIGN_MALLOC_DATA) | ||
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "(arm|aarch64)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be:
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "(arm64|aarch64)")
Otherwise you will catch armv7, which I don't believe you are supporting.
@@ -154,6 +154,7 @@ typedef enum { | |||
FLAC__CPUINFO_TYPE_IA32, | |||
FLAC__CPUINFO_TYPE_X86_64, | |||
FLAC__CPUINFO_TYPE_PPC, | |||
FLAC__CPUINFO_TYPE_ARM, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisited this today, and I think FLAC__CPUINFO_TYPE_ARM64 is a more accurate name
Any progress on debugging? Hopefully this can get merged soon. @coreyjjames Friendly ping |
Hey, Still working on the auto tools build configuration. I am thinking there is a compiler option that needs to be set or unset. That is causing the tests to run slow. Either that or everything is correct and it just runs slower building with autotools vs CMake. I should have some time tonight to work on it. |
Update on progressI cannot find any issue that would explain why the Autotools test suite is significantly slower than the CMake test suite. I did notice that on all platforms, the Autotools test suite is slower than the CMake test suite; this is leading me to think that it is correct, and if we want it to build faster, more optimizations need to be added to improve the performance. I believe I am ready for this pull request to be reviewed, please let me know if you would like any changes to be made. |
I found myself here after seeing the native Apple M1 ARM64 vs Rosetta translated FLAC encoding benchmark on Phoronix. FWIW, this PR compiles and runs on an M1 Air. As a sanity check, I ran the benchmark above with native 1.3.3 and got a similar Rosetta translation is still slightly faster (!!!), but the current improvements are large. |
@nnghuy Cool, Thanks for testing this PR on the new M1 chip! Glad to hear it compiles and runs. This PR basically only adds one optimization to the FLAC project. It would be interesting if we could beat Rosetta's score with more optimizations. |
Seems to be running fine on Termux on a Samsung Galaxy S20 Ultra. Great job! |
This pull request adds Aarch64 support to the FLAC project.
fixes #156
What is included in this pull request:
Performance Boost to encoding:
I tested the performance with two aarch64 machines. The test I ran was encoding a .wav file to .flac. The size of the wave file was 1.57 gigabytes.
The machine, with a cortex-a57 8 threads, I got a performance increase of 106.57% for the compute autocorrelation function.
A savings of 0m8.281 seconds.
The machine, with a cortex-a53 24 threads, I got a performance increase of 254.6% for the compute autocorrelation function.
A savings of 1m40.166 seconds.