Remove all assembler and intrinsics from decoder #347

ktmf01 · 2022-05-12T18:24:06Z

This commit drops all use of assembler and intrinsics from the libFLAC decoder. This is because they are only for 32-bit x86, hard to debug, maintain and fuzz properly, and because the decoder has much greater security risks than the encoder.

I've tested the impact on decoding speed of this change with clang 14, gcc 11.3 and MSVC 2022, on a Intel Kaby Lake-R processor with 16-bit and 24-bit input. For each compiler, 4 32-bit compiles were tested: one with asm optimizations (called asm), one without asm optimizations (but with -msse2 or /arch:sse2, called unrolled), one without asm optimizations and with loop unrolling of (called plain) and one with a switch-case statement (called switchcase).
16-bit input
24-bit input

These graphs are a bit hard to read, so here are my own findings:

Not supplying either pre-unrolled code or a switch-case statements tremendously hurts decoding speed for all compilers
Clang cannot properly optimize the switch-case statements
MSVC optimizes switch-case statement slightly better than a fully unrolled for 16-bit input
For both GCC and Clang disabling asm optimizations in favor or unrolled loops increases decoding speed by about 5% for 16-bit input
For MSVC disabling asm optimizations in favor of unrolled loops does not change much for 16-bit inputs
For GCC disabling asm optimizations in favor of unrolled loops decreases decoding speed by about 5% for 24-bit inputs
For Clang and MSVC disabling asm optimizations in favor of unrolled loops decreases decoding speed by about 10% for 24-bit inputs

TL;DR: this change generally slightly improves decoding speed with 16-bit input and slightly decreases decoding speed with 24-bit input

I'm currently also running tests on a AMD Jaguar CPU to compare the results

This commit drops all use of assembler and intrinsics from the libFLAC decoder. This is because they are only for 32-bit x86, hard to debug, maintain and fuzz properly, and because the decoder has much greater security risks than the encoder.

ktmf01 · 2022-05-13T12:45:49Z

Results on AMD jaguar:

jaguar-32bit-decoding16.pdf
jaguar-32bit-decoding24.pdf

Results seem reasonably close to the results on Kaby Lake-R

for 16-bit input, GCC is faster with this patch, Clang doesn't measurably change and MSVC is slightly slower
for 24-bit input, all three compilers show about 10% less performance on plain C code as opposed to assembly/intrinsics

ktmf01 merged commit febff86 into xiph:master May 26, 2022

ktmf01 mentioned this pull request Mar 8, 2023

Speed up bitreader #564

Merged

ktmf01 deleted the remove-decoder-asm branch May 12, 2023 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove all assembler and intrinsics from decoder #347

Remove all assembler and intrinsics from decoder #347

ktmf01 commented May 12, 2022

ktmf01 commented May 13, 2022

Remove all assembler and intrinsics from decoder #347

Remove all assembler and intrinsics from decoder #347

Conversation

ktmf01 commented May 12, 2022

ktmf01 commented May 13, 2022