Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create SIMD-accelerated version of compute_gru function #191

wants to merge 12 commits into
base: master
Choose a base branch


Copy link

@Ameobea Ameobea commented Jul 19, 2021

I did some CPU profiling of pulseeffects / easyeffects which includes this library as a dependency for its noise reduction functionality. This led me to see that the compute_gru function was the hottest one in the whole application:

The changes in this pull request create a new function compute_gru_avx which has the same functionality as compute_gru but uses SIMD intrinsics to accelerate the function dramatically. The main changes focus on computing the sums in the GRU function 8 at a time and using FMAs to combine multiplications and adds into a single operation, increasing accuracy as a result. This also serves to reduce the overhead of loop counter checking which my profiling let me to believe was the most expensive part of the whole function before these changes. Additionally, converting the weights (which are stored as 8-bit signed integers) is done 8 at a time using SIMD for an additional speedup.

I also made some changes to the build configuration for compiler flags that use -O3 instead of -O2 which yielded some benefits on my machine as well as pasing -march=native which facilitates the SIMD used. If these CPU features aren't available, they will be disabled at build-time.

After the optimizations, the compute_gru_avx function uses only ~4.4% of the total CPU time compared to ~19.63% from before - a 4.45x speedup:

Here is a compiler explorer that shows the full assembly produced by the optimized compute_gru_avx function:

Testing done on my own machine using pulseeffects + built with the optimized code seems to work identically to before with reduced CPU usage for the application.

Let me know if you think this is something that you'd like to get merged into the project. I'm happy to make any changes necessary. There may be a better/different way you'd like to handle the CPU feature detection and and I'd love suggestions on how to handle that.

Copy link

@a-rose a-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work ! I have worked on an AVX implementation, so there are a few pitfalls I can help you avoid. I'm not a maintainer on this repo though, so I can't guarantee this will ever be accepeted.

src/rnn.c Outdated Show resolved Hide resolved
src/rnn.c Outdated Show resolved Hide resolved
src/rnn.c Outdated Show resolved Hide resolved
Copy link

Ameobea commented Jul 20, 2021

@a-rose thank you very much for the review. I'm glad you were aware of the fact that FMA support doesn't always exist when AVX2 support does.

Ameobea and others added 2 commits July 20, 2021 17:09
 * We already have the value in a register, so avoid spilling to stack and reading back
Copy link

a-rose commented Jul 21, 2021

Thanks for taking the time to look into it :) I have created a PR on your fork to improve SIMD detection: Ameobea#1

Copy link

Ameobea commented Jul 25, 2021

I have also opened a pull request on the Xiph-run Gitlab instance referenced in the README:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

Successfully merging this pull request may close these issues.

None yet

3 participants