Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove all assembler and intrinsics from decoder #347

Merged
merged 1 commit into from
May 26, 2022

Conversation

ktmf01
Copy link
Collaborator

@ktmf01 ktmf01 commented May 12, 2022

This commit drops all use of assembler and intrinsics from the libFLAC decoder. This is because they are only for 32-bit x86, hard to debug, maintain and fuzz properly, and because the decoder has much greater security risks than the encoder.

I've tested the impact on decoding speed of this change with clang 14, gcc 11.3 and MSVC 2022, on a Intel Kaby Lake-R processor with 16-bit and 24-bit input. For each compiler, 4 32-bit compiles were tested: one with asm optimizations (called asm), one without asm optimizations (but with -msse2 or /arch:sse2, called unrolled), one without asm optimizations and with loop unrolling of (called plain) and one with a switch-case statement (called switchcase).
16-bit input
24-bit input

These graphs are a bit hard to read, so here are my own findings:

  • Not supplying either pre-unrolled code or a switch-case statements tremendously hurts decoding speed for all compilers
  • Clang cannot properly optimize the switch-case statements
  • MSVC optimizes switch-case statement slightly better than a fully unrolled for 16-bit input
  • For both GCC and Clang disabling asm optimizations in favor or unrolled loops increases decoding speed by about 5% for 16-bit input
  • For MSVC disabling asm optimizations in favor of unrolled loops does not change much for 16-bit inputs
  • For GCC disabling asm optimizations in favor of unrolled loops decreases decoding speed by about 5% for 24-bit inputs
  • For Clang and MSVC disabling asm optimizations in favor of unrolled loops decreases decoding speed by about 10% for 24-bit inputs

TL;DR: this change generally slightly improves decoding speed with 16-bit input and slightly decreases decoding speed with 24-bit input

I'm currently also running tests on a AMD Jaguar CPU to compare the results

This commit drops all use of assembler and intrinsics from the libFLAC
decoder. This is because they are only for 32-bit x86, hard to debug,
maintain and fuzz properly, and because the decoder has much greater
security risks than the encoder.
@ktmf01
Copy link
Collaborator Author

ktmf01 commented May 13, 2022

Results on AMD jaguar:

jaguar-32bit-decoding16.pdf
jaguar-32bit-decoding24.pdf

Results seem reasonably close to the results on Kaby Lake-R

  • for 16-bit input, GCC is faster with this patch, Clang doesn't measurably change and MSVC is slightly slower
  • for 24-bit input, all three compilers show about 10% less performance on plain C code as opposed to assembly/intrinsics

@ktmf01 ktmf01 merged commit febff86 into xiph:master May 26, 2022
@ktmf01 ktmf01 mentioned this pull request Mar 8, 2023
@ktmf01 ktmf01 deleted the remove-decoder-asm branch May 12, 2023 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant