-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmarks for floating point math #618
Conversation
4d58cdb
to
f1f9786
Compare
What is the reason for using global_asm? I would think that normal inline asm should be enough as long as you specify the constraints? Or is it because you want to accurately measure function call overhead as well? |
Normal inline assembly seemed to save rax even when not needed, https://rust.godbolt.org/z/8vfnWE84s, so I just used global to be closer to what LLVM generates. Am I just missing a constraint? |
Ah, |
eea497f
to
3cee637
Compare
Everything is passing now so I think this should be good. Here is the output of a full run on a ryzen 5900X: `cargo bench` output
Good to know that we are pretty close to the system implementations for most everything. Interesting that our f128 add/sub/mul routines appear to be 2x as fast, I wonder why |
Some assorted notes after looking at the graphs, just for reference: |
This adds comparisons among the compiler-builtins function, system functions if available, and optionally handwritten assembly. These also help us identify inconsistencies between this crate and system functions, which may otherwise go unnoticed if intrinsics get lowered to inline operations rather than library calls.
On most CPUs, division instructions have variable latency depending on the input values. |
Enough to be slower than soft float on average? That seems surprising. Thanks for the review. |
The hardware division instruction blocks the pipeline, while the soft-float implementation is fully pipelined. This means that with a sufficiently wide CPU, the execution units could be running instructions from different iterations of the benchmark loop concurrently. On the other hand CPUs tend to only have 1 division unit, so no parallelism is possible. |
This adds comparisons among the compiler-builtins function, system functions if available, and optionally handwritten assembly.
These also serve as some additional testing since we check our functions against assembly operations.
Run with
cargo bench --features benchmarking-reports
to get the HTML charts