New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove all assembler #452
Remove all assembler #452
Conversation
I just did some more checking with an Intel Atom N450, which has ISA extensions up to SSSE3. Similarly to the AMD Sempron and as was expected, this made no change for 16-bit inputs. For 24-bit inputs encoded with compression level 2 no measurable difference was seen. For -8 the slowdown was less than 1%. For -8p the slowdown was 2%. So, differences with this CPU are even smaller than with the Sempron. |
More testing, this time with a Pentium 4 2.6GHz, Family 15, Model 2. As far as I can find out, that means it has codename Northwood, has a 130nm lithography and supports MMX, SSE and SSE2. No changes for 16-bit inputs. For 24-bit inputs encoded with compression level 2 no measurable difference was seen. For -8 the slowdown was 3%. For -8p the slowdown was 5%. |
Are your performance numbers ASM vs Note that even though I am someone who likes to keep old systems around and still tries to properly support them myself, and regardless of which numbers you did provide, I actually do agree with the decision to remove this assembler code form FLAC. |
Actually, this has nothing to do with march tunings. FLAC is by default configured to add -msse2 to the command line (and similar to Visual Studio project files). I made two static builds without any Regarding support, this does not drop support for anything. You probably know, but I just wanted to write that down for anyone reading this. FLAC can pretty much be compiled for anything, I've done testing on a 486 some time ago, no problem. It is just that FLAC is configured by default on x86 to add -msse2 to the command line (but this is easy to undo) and this PR drops hardware-specific improvements. FLAC can still be compiled for a 486 (and probably even older), it will just not run as fast. I think that is fair, given how old these CPUs are. |
I had originally missed that you have an SSE3 Sempron, so probably K8-based instead of Athlon-XP-based.
My point/question was only to gauge how much (if anything) of the performance loss of removing the x86 assembler implementation could potentially be re-gained by compiling explicitly for the specific CPU at hand. In any case, as said, I personally am fine with removing the x86 assembler implementation. |
I've just tested on an AMD Phenom II X2 550. That is of a different generation than the Sempron Here's an overview of all tested systems
On none of the systems a measureable difference for 16-bit audio was found. |
The numbers provided by @ktmf01 do not reflect the cost on CPUs where the impact is to be expected the most severe: Those which do not support SSE2, and thus did not already make use of SSE2 for most algorithms anyway: CPU: AMD Duron 1800 (AthlonXP core, MMX, MMXEXT, 3DNOW, 3DNOWEXT, SSE, no SSE2) build settings:
test commands:
Reported time is user time, best of 3 runs.
So, on systems without SSE2, the impact on 16bit input is actually way worse than the impact on 24bit input. My hunch that some of the loss could be mitigated by tuning for the particular CPU used did turn out to not be true. I guess modern GCC does not tune properly for old CPUs any more. I still agree with removing the x86 asm implementation because of maintenance reasons. |
Thanks for testing! I didn't have access to a pre-SSE2 CPU that I could run these tests on without too much trouble. I have a Pentium Pro around but that has Windows 98 installed, on which FLAC doesn't work because of unicode support. I think the performance hit for -8 is fair, 20% isn't too bad. Frankly I expected way worse. -8p is IMO slow on modern CPUs already, so I wouldn't use it on CPUs that old anyway. |
The OBJ_FORMAT stuff in configury can be removed too after this: |
Well I was wrong about this one because the visibility attribute checks later rely on OBJ_FORMAT for simplicity. That |
libFLAC still contains quite a bit of assembler. However, I'm unable to properly maintain this and I'm not sure it is being covered by fuzzing. Also, it only benefits 32-bit machines that lack SSE4.1, which was introduced in 2006. The last Intel processor lacking SSE4.1 were the Intel Atom CPUs with codename Penwell, which were succeeded by Merrifield in 2014. The last AMD processor lacking SSE4.1 was the Bobcat series, which was succeeded by Jaguar in 2013.
I revived an old computer I had in storage with an AMD Sempron 3000+ CPU, which has ISA extensions up to SSE3, thus lacking SSSE3 and SSE4.1.
There was no measurable difference for 16-bit inputs.
For 24-bit inputs, preset 2 (only using fixed subframes) gave a speed increase of about 3%. Preset 8 shows a speed decrease of about 8% Setting -8p gives a speed decrease of 12%. To me, these decreases seem an acceptable trade-off. Note that this only applies to encoding on 32-bit machines lacking SSE 4.1, and does not apply to 16-bit inputs.
I probably won't merge this for a while. Feedback is welcome.