Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove all assembler #452

Merged
merged 2 commits into from Oct 15, 2022
Merged

Remove all assembler #452

merged 2 commits into from Oct 15, 2022

Conversation

ktmf01
Copy link
Collaborator

@ktmf01 ktmf01 commented Sep 15, 2022

libFLAC still contains quite a bit of assembler. However, I'm unable to properly maintain this and I'm not sure it is being covered by fuzzing. Also, it only benefits 32-bit machines that lack SSE4.1, which was introduced in 2006. The last Intel processor lacking SSE4.1 were the Intel Atom CPUs with codename Penwell, which were succeeded by Merrifield in 2014. The last AMD processor lacking SSE4.1 was the Bobcat series, which was succeeded by Jaguar in 2013.

I revived an old computer I had in storage with an AMD Sempron 3000+ CPU, which has ISA extensions up to SSE3, thus lacking SSSE3 and SSE4.1.

There was no measurable difference for 16-bit inputs.

For 24-bit inputs, preset 2 (only using fixed subframes) gave a speed increase of about 3%. Preset 8 shows a speed decrease of about 8% Setting -8p gives a speed decrease of 12%. To me, these decreases seem an acceptable trade-off. Note that this only applies to encoding on 32-bit machines lacking SSE 4.1, and does not apply to 16-bit inputs.

I probably won't merge this for a while. Feedback is welcome.

@ktmf01
Copy link
Collaborator Author

ktmf01 commented Sep 16, 2022

I just did some more checking with an Intel Atom N450, which has ISA extensions up to SSSE3.

Similarly to the AMD Sempron and as was expected, this made no change for 16-bit inputs.

For 24-bit inputs encoded with compression level 2 no measurable difference was seen. For -8 the slowdown was less than 1%. For -8p the slowdown was 2%.

So, differences with this CPU are even smaller than with the Sempron.

@ktmf01
Copy link
Collaborator Author

ktmf01 commented Sep 22, 2022

More testing, this time with a Pentium 4 2.6GHz, Family 15, Model 2. As far as I can find out, that means it has codename Northwood, has a 130nm lithography and supports MMX, SSE and SSE2.

No changes for 16-bit inputs.

For 24-bit inputs encoded with compression level 2 no measurable difference was seen. For -8 the slowdown was 3%. For -8p the slowdown was 5%.

@manxorist
Copy link
Contributor

Are your performance numbers ASM vs -march=i686 (or anything similarly generic) or ASM vs -march=pentium4/bonnell/athlon-xp for the particular CPU? Given that such old CPUs are nowadays kind of a special case, generating optimized code for exactly the CPU one wants to target may be somewhat more manageable than it was when they were current.

Note that even though I am someone who likes to keep old systems around and still tries to properly support them myself, and regardless of which numbers you did provide, I actually do agree with the decision to remove this assembler code form FLAC.
I personally do not have any use case to care about FLAC encode performance on such systems though, so my opinion probably does not weigh much.

@ktmf01
Copy link
Collaborator Author

ktmf01 commented Sep 23, 2022

Actually, this has nothing to do with march tunings. FLAC is by default configured to add -msse2 to the command line (and similar to Visual Studio project files). I made two static builds without any -march or -mtune, one with and one without this patch. So, these performance number are HEAD vs HEAD^, nothing else really.

Regarding support, this does not drop support for anything. You probably know, but I just wanted to write that down for anyone reading this. FLAC can pretty much be compiled for anything, I've done testing on a 486 some time ago, no problem. It is just that FLAC is configured by default on x86 to add -msse2 to the command line (but this is easy to undo) and this PR drops hardware-specific improvements. FLAC can still be compiled for a 486 (and probably even older), it will just not run as fast. I think that is fair, given how old these CPUs are.

@manxorist
Copy link
Contributor

Actually, this has nothing to do with march tunings. FLAC is by default configured to add -msse2 to the command line (and similar to Visual Studio project files).

I had originally missed that you have an SSE3 Sempron, so probably K8-based instead of Athlon-XP-based.
-msse2 would not work on an Athlon-XP era Sempron, as they only have SSE1. Note that there were a lot of CPUs called "Sempron 3000+" with very different CPU cores and features sets (https://en.wikipedia.org/wiki/List_of_AMD_Sempron_processors).

I made two static builds without any -march or -mtune, one with and one without this patch. So, these performance number are HEAD vs HEAD^, nothing else really.

My point/question was only to gauge how much (if anything) of the performance loss of removing the x86 assembler implementation could potentially be re-gained by compiling explicitly for the specific CPU at hand.

In any case, as said, I personally am fine with removing the x86 assembler implementation.

@ktmf01
Copy link
Collaborator Author

ktmf01 commented Oct 6, 2022

I've just tested on an AMD Phenom II X2 550. That is of a different generation than the Sempron

Here's an overview of all tested systems

CPU SSE 24-bit audio -2 24-bit audio -8 24-bit audio -8p
AMD Sempron 3000+ SSE3 +3% -8% -12%
AMD Phenom II X2 550 SSSE3 + SSE4a 0% -1% -2%
Intel Pentium 4 2.6GHz, Family 15, Model 2 SSE2 0% -3% -5%
Intel Atom N450 SSSE3 0% -1% -2%

On none of the systems a measureable difference for 16-bit audio was found.

@sezero sezero mentioned this pull request Oct 8, 2022
@ktmf01 ktmf01 linked an issue Oct 10, 2022 that may be closed by this pull request
@ktmf01 ktmf01 merged commit 75ef795 into xiph:master Oct 15, 2022
@manxorist
Copy link
Contributor

The numbers provided by @ktmf01 do not reflect the cost on CPUs where the impact is to be expected the most severe: Those which do not support SSE2, and thus did not already make use of SSE2 for most algorithms anyway:

CPU: AMD Duron 1800 (AthlonXP core, MMX, MMXEXT, 3DNOW, 3DNOWEXT, SSE, no SSE2)
Compiler: GCC 12.2.0

build settings:

  • 1.4.0: ./configure --disable-sse --enable-static --disable-shared
  • remove-asm: ./configure --disable-sse --enable-static --disable-shared
  • remove-asm-opt: CFLAGS="-march=native -mtune=native -O3 -funroll-loops" CXXFLAGS="-march=native -mtune=native -O3 -funroll-loops" ./configure --disable-sse --enable-static --disable-shared

test commands:

  • 16bit -2: time ./src/flac/flac -c -2 /tmp/16.wav > /dev/null
  • 16bit -8: time ./src/flac/flac -c -8 /tmp/16.wav > /dev/null
  • 16bit -8p: time ./src/flac/flac -c -8p /tmp/16.wav > /dev/null
  • 24bit -2: time ./src/flac/flac -c -2 /tmp/24.wav > /dev/null
  • 24bit -8: time ./src/flac/flac -c -8 /tmp/24.wav > /dev/null

Reported time is user time, best of 3 runs.

version 16bit -2 16bit -8 16bit -8p 24bit -2 24bit -8
1.4.0 1.252s 10.483s 47.617s 2.000s 15.302s
remove-asm 1.395s (-11%) 12.424s (-19%) 66.886s (-40%) 1.964s (+2%) 16.742s (-9%)
remove-asm-opt 1.447s (-16%) 12.536s (-20%) 67.768s (-42%) 2.001s (-0%) 16.911s (-11%)

So, on systems without SSE2, the impact on 16bit input is actually way worse than the impact on 24bit input.

My hunch that some of the loss could be mitigated by tuning for the particular CPU used did turn out to not be true. I guess modern GCC does not tune properly for old CPUs any more.

I still agree with removing the x86 asm implementation because of maintenance reasons.

@ktmf01
Copy link
Collaborator Author

ktmf01 commented Oct 17, 2022

Thanks for testing! I didn't have access to a pre-SSE2 CPU that I could run these tests on without too much trouble. I have a Pentium Pro around but that has Windows 98 installed, on which FLAC doesn't work because of unicode support.

I think the performance hit for -8 is fair, 20% isn't too bad. Frankly I expected way worse. -8p is IMO slow on modern CPUs already, so I wouldn't use it on CPUs that old anyway.

@sezero
Copy link
Contributor

sezero commented Oct 17, 2022

The OBJ_FORMAT stuff in configury can be removed too after this:
https://github.com/xiph/flac/blob/master/configure.ac#L225-L234

@sezero
Copy link
Contributor

sezero commented Oct 17, 2022

The OBJ_FORMAT stuff in configury can be removed too after this: https://github.com/xiph/flac/blob/master/configure.ac#L225-L234

Well I was wrong about this one because the visibility attribute checks later rely on OBJ_FORMAT for simplicity. That AC_SUBST(OBJ_FORMAT) can certainly be removed though.

@ktmf01 ktmf01 deleted the remove-asm branch May 12, 2023 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

android build issues
3 participants