Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarks for floating point math #618

Merged
merged 1 commit into from
May 24, 2024

Conversation

tgross35
Copy link
Contributor

@tgross35 tgross35 commented May 20, 2024

This adds comparisons among the compiler-builtins function, system functions if available, and optionally handwritten assembly.

These also serve as some additional testing since we check our functions against assembly operations.

Run with cargo bench --features benchmarking-reports to get the HTML charts

@tgross35 tgross35 force-pushed the benchmarking branch 8 times, most recently from 4d58cdb to f1f9786 Compare May 20, 2024 22:56
@Amanieu
Copy link
Member

Amanieu commented May 21, 2024

What is the reason for using global_asm? I would think that normal inline asm should be enough as long as you specify the constraints? Or is it because you want to accurately measure function call overhead as well?

@tgross35
Copy link
Contributor Author

Normal inline assembly seemed to save rax even when not needed, https://rust.godbolt.org/z/8vfnWE84s, so I just used global to be closer to what LLVM generates. Am I just missing a constraint?

@tgross35
Copy link
Contributor Author

tgross35 commented May 21, 2024

Ah, options(nomem, nostack) seems to eliminate those extra instructions. I'll change them

@tgross35 tgross35 force-pushed the benchmarking branch 6 times, most recently from eea497f to 3cee637 Compare May 23, 2024 08:57
@tgross35
Copy link
Contributor Author

Everything is passing now so I think this should be good. Here is the output of a full run on a ryzen 5900X:

`cargo bench` output
add_f32/compiler-builtins
                        time:   [45.859 µs 45.931 µs 46.008 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
add_f32/system          time:   [43.582 µs 43.668 µs 43.771 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
add_f32/assembly (x86_64 unix)
                        time:   [10.701 µs 10.716 µs 10.731 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

add_f64/compiler-builtins
                        time:   [46.210 µs 46.277 µs 46.351 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
add_f64/system          time:   [47.215 µs 47.291 µs 47.370 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
add_f64/assembly (x86_64 unix)
                        time:   [12.821 µs 12.839 µs 12.857 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

add_f128/compiler-builtins
                        time:   [85.654 µs 85.787 µs 85.918 µs]
add_f128/system         time:   [228.41 µs 228.80 µs 229.20 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

cmp_f32_gt/compiler-builtins
                        time:   [23.489 µs 23.527 µs 23.564 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
cmp_f32_gt/system       time:   [18.432 µs 18.453 µs 18.476 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
cmp_f32_gt/assembly (x86_64 unix)
                        time:   [10.769 µs 10.782 µs 10.794 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

cmp_f32_unord/compiler-builtins
                        time:   [15.071 µs 15.089 µs 15.107 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
cmp_f32_unord/system    time:   [15.101 µs 15.121 µs 15.141 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
cmp_f32_unord/assembly (x86_64 unix)
                        time:   [12.849 µs 12.866 µs 12.883 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

cmp_f64_gt/compiler-builtins
                        time:   [21.825 µs 21.922 µs 22.087 µs]
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe
cmp_f64_gt/system       time:   [20.405 µs 20.439 µs 20.477 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
cmp_f64_gt/assembly (x86_64 unix)
                        time:   [10.724 µs 10.736 µs 10.749 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

cmp_f64_unord/compiler-builtins
                        time:   [15.126 µs 15.156 µs 15.199 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
cmp_f64_unord/system    time:   [13.036 µs 13.053 µs 13.071 µs]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
cmp_f64_unord/assembly (x86_64 unix)
                        time:   [12.846 µs 12.864 µs 12.882 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

cmp_f128_gt/compiler-builtins
                        time:   [28.313 µs 28.347 µs 28.381 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe
cmp_f128_gt/system      time:   [154.94 µs 155.14 µs 155.35 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

cmp_f128_unord/compiler-builtins
                        time:   [19.992 µs 20.031 µs 20.070 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
cmp_f128_unord/system   time:   [149.79 µs 150.04 µs 150.31 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

conv_u32_f32/compiler-builtins
                        time:   [1.2045 µs 1.2069 µs 1.2093 µs]
conv_u32_f32/system     time:   [1.1904 µs 1.1927 µs 1.1950 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high severe
conv_u32_f32/assembly (x86_64 unix)
                        time:   [733.35 ns 734.44 ns 735.60 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

conv_u32_f64/compiler-builtins
                        time:   [971.62 ns 973.20 ns 974.87 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
conv_u32_f64/system     time:   [1.0867 µs 1.0887 µs 1.0907 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
conv_u32_f64/assembly (x86_64 unix)
                        time:   [728.52 ns 729.41 ns 730.37 ns]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

conv_u64_f32/compiler-builtins
                        time:   [1.5048 µs 1.5090 µs 1.5158 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
conv_u64_f32/system     time:   [1.5019 µs 1.5049 µs 1.5084 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

conv_u64_f64/compiler-builtins
                        time:   [1.5474 µs 1.5496 µs 1.5519 µs]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
conv_u64_f64/system     time:   [1.5490 µs 1.5520 µs 1.5554 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

conv_u128_f32/compiler-builtins
                        time:   [2.8557 µs 2.8599 µs 2.8646 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
conv_u128_f32/system    time:   [2.7020 µs 2.7088 µs 2.7171 µs]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

conv_u128_f64/compiler-builtins
                        time:   [2.9879 µs 2.9936 µs 2.9999 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
conv_u128_f64/system    time:   [2.8498 µs 2.8555 µs 2.8615 µs]

conv_i32_f32/compiler-builtins
                        time:   [1.4327 µs 1.4353 µs 1.4379 µs]
conv_i32_f32/system     time:   [1.3060 µs 1.3086 µs 1.3112 µs]
conv_i32_f32/assembly (x86_64 unix)
                        time:   [612.79 ns 613.87 ns 614.98 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

conv_i32_f64/compiler-builtins
                        time:   [1.3251 µs 1.3281 µs 1.3315 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
conv_i32_f64/system     time:   [1.0827 µs 1.0853 µs 1.0881 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
conv_i32_f64/assembly (x86_64 unix)
                        time:   [612.04 ns 613.06 ns 614.09 ns]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

conv_i64_f32/compiler-builtins
                        time:   [1.7645 µs 1.7683 µs 1.7725 µs]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
conv_i64_f32/system     time:   [1.6442 µs 1.6468 µs 1.6497 µs]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
conv_i64_f32/assembly (x86_64 unix)
                        time:   [805.54 ns 807.70 ns 810.15 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

conv_i64_f64/compiler-builtins
                        time:   [1.6287 µs 1.6321 µs 1.6356 µs]
conv_i64_f64/system     time:   [1.4991 µs 1.5013 µs 1.5037 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
conv_i64_f64/assembly (x86_64 unix)
                        time:   [673.76 ns 674.93 ns 676.18 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

conv_i128_f32/compiler-builtins
                        time:   [3.5739 µs 3.5804 µs 3.5870 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
conv_i128_f32/system    time:   [3.6621 µs 3.6694 µs 3.6773 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild

conv_i128_f64/compiler-builtins
                        time:   [3.6649 µs 3.6708 µs 3.6774 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
conv_i128_f64/system    time:   [3.5847 µs 3.5906 µs 3.5966 µs]

conv_f64_u32/compiler-builtins
                        time:   [966.48 ns 967.94 ns 969.43 ns]
conv_f64_u32/system     time:   [933.81 ns 934.81 ns 935.83 ns]

conv_f64_u64/compiler-builtins
                        time:   [1.0334 µs 1.0352 µs 1.0368 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
conv_f64_u64/system     time:   [1.0020 µs 1.0034 µs 1.0047 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

conv_f64_u128/compiler-builtins
                        time:   [1.0873 µs 1.0888 µs 1.0904 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
conv_f64_u128/system    time:   [968.30 ns 969.63 ns 971.09 ns]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

conv_f64_i32/compiler-builtins
                        time:   [1.3794 µs 1.3813 µs 1.3833 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
conv_f64_i32/system     time:   [1.2685 µs 1.2707 µs 1.2731 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

conv_f64_i64/compiler-builtins
                        time:   [1.3832 µs 1.3860 µs 1.3892 µs]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
conv_f64_i64/system     time:   [1.0905 µs 1.0922 µs 1.0940 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

conv_f64_i128/compiler-builtins
                        time:   [1.2663 µs 1.2681 µs 1.2699 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
conv_f64_i128/system    time:   [1.1310 µs 1.1337 µs 1.1374 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

conv_f32_u32/compiler-builtins
                        time:   [1.1730 µs 1.1742 µs 1.1755 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
conv_f32_u32/system     time:   [842.07 ns 843.04 ns 844.02 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

conv_f32_u64/compiler-builtins
                        time:   [1.1232 µs 1.1264 µs 1.1314 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
conv_f32_u64/system     time:   [1.0487 µs 1.0539 µs 1.0607 µs]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

conv_f32_u128/compiler-builtins
                        time:   [1.3045 µs 1.3074 µs 1.3105 µs]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
conv_f32_u128/system    time:   [871.99 ns 872.76 ns 873.56 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

conv_f32_i32/compiler-builtins
                        time:   [1.1105 µs 1.1120 µs 1.1135 µs]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
conv_f32_i32/system     time:   [1.1064 µs 1.1079 µs 1.1096 µs]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

conv_f32_i64/compiler-builtins
                        time:   [1.2249 µs 1.2264 µs 1.2280 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
conv_f32_i64/system     time:   [1.1197 µs 1.1211 µs 1.1227 µs]
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe

conv_f32_i128/compiler-builtins
                        time:   [1.2571 µs 1.2588 µs 1.2606 µs]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
conv_f32_i128/system    time:   [973.33 ns 974.45 ns 975.64 ns]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

div_f32/compiler-builtins
                        time:   [55.069 µs 55.176 µs 55.310 µs]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
div_f32/system          time:   [56.354 µs 56.438 µs 56.522 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
div_f32/assembly (x86_64 unix)
                        time:   [15.879 µs 15.902 µs 15.925 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

div_f64/compiler-builtins
                        time:   [68.537 µs 68.645 µs 68.764 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
div_f64/system          time:   [69.545 µs 69.673 µs 69.810 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
div_f64/assembly (x86_64 unix)
                        time:   [18.197 µs 18.225 µs 18.251 µs]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

extend_f16_f32/compiler-builtins
                        time:   [3.6086 µs 3.6147 µs 3.6210 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
extend_f16_f32/system   time:   [1.0113 µs 1.0126 µs 1.0138 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

extend_f16_f128/compiler-builtins
                        time:   [5.5726 µs 5.5803 µs 5.5886 µs]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

extend_f32_f64/compiler-builtins
                        time:   [1.2409 µs 1.2434 µs 1.2459 µs]
extend_f32_f64/system   time:   [985.84 ns 987.18 ns 988.57 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

extend_f32_f128/compiler-builtins
                        time:   [3.2178 µs 3.2237 µs 3.2295 µs]
extend_f32_f128/system  time:   [6.5044 µs 6.5150 µs 6.5273 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

extend_f64_f128/compiler-builtins
                        time:   [2.9025 µs 2.9071 µs 2.9120 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
extend_f64_f128/system  time:   [6.6715 µs 6.6821 µs 6.6932 µs]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

mul_f32/compiler-builtins
                        time:   [36.301 µs 36.354 µs 36.415 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
mul_f32/system          time:   [36.700 µs 36.743 µs 36.790 µs]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
mul_f32/assembly (x86_64 unix)
                        time:   [15.002 µs 15.205 µs 15.459 µs]

mul_f64/compiler-builtins
                        time:   [38.250 µs 38.322 µs 38.401 µs]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
mul_f64/system          time:   [40.807 µs 40.869 µs 40.937 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
mul_f64/assembly (x86_64 unix)
                        time:   [14.896 µs 15.211 µs 15.517 µs]
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe

mul_f128/compiler-builtins
                        time:   [135.89 µs 136.10 µs 136.31 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
mul_f128/system         time:   [346.06 µs 346.71 µs 347.39 µs]

powi_f32/compiler-builtins
                        time:   [204.51 µs 204.93 µs 205.37 µs]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
powi_f32/system         time:   [203.78 µs 204.24 µs 204.72 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

powi_f64/compiler-builtins
                        time:   [208.40 µs 209.05 µs 209.71 µs]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  1 (1.00%) high severe
powi_f64/system         time:   [211.65 µs 212.10 µs 212.54 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

sub_f32/compiler-builtins
                        time:   [44.472 µs 44.535 µs 44.597 µs]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
sub_f32/system          time:   [45.334 µs 45.413 µs 45.496 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
sub_f32/assembly (x86_64 unix)
                        time:   [12.774 µs 12.793 µs 12.813 µs]

sub_f64/compiler-builtins
                        time:   [47.760 µs 47.821 µs 47.884 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
sub_f64/system          time:   [47.426 µs 47.499 µs 47.572 µs]
sub_f64/assembly (x86_64 unix)
                        time:   [12.788 µs 12.817 µs 12.856 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

sub_f128/compiler-builtins
                        time:   [86.759 µs 86.897 µs 87.058 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
sub_f128/system         time:   [229.70 µs 230.08 µs 230.49 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

trunc_f32_f16/compiler-builtins
                        time:   [1.4415 µs 1.4440 µs 1.4465 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
trunc_f32_f16/system    time:   [1.4037 µs 1.4061 µs 1.4087 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

trunc_f64_f16/compiler-builtins
                        time:   [1.4715 µs 1.4740 µs 1.4765 µs]
trunc_f64_f16/system    time:   [1.4664 µs 1.4689 µs 1.4715 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

trunc_f64_f32/compiler-builtins
                        time:   [1.5099 µs 1.5125 µs 1.5154 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
trunc_f64_f32/system    time:   [1.3014 µs 1.3049 µs 1.3102 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
trunc_f64_f32/assembly (x86_64 unix)
                        time:   [632.63 ns 633.95 ns 635.34 ns]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

trunc_f128_f16/compiler-builtins
                        time:   [1.6181 µs 1.6204 µs 1.6230 µs]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

trunc_f128_f32/compiler-builtins
                        time:   [1.7718 µs 1.7747 µs 1.7779 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
trunc_f128_f32/system   time:   [27.131 µs 27.169 µs 27.211 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

trunc_f128_f64/compiler-builtins
                        time:   [1.5871 µs 1.5890 µs 1.5909 µs]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
trunc_f128_f64/system   time:   [26.529 µs 26.578 µs 26.630 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild


running 52 tests
test memcmp_builtin_1048576           ... bench:      16,811 ns/iter (+/- 1,132) = 62374 MB/s
test memcmp_builtin_16                ... bench:           2 ns/iter (+/- 0) = 8000 MB/s
test memcmp_builtin_32                ... bench:           2 ns/iter (+/- 0) = 16000 MB/s
test memcmp_builtin_4096              ... bench:          30 ns/iter (+/- 0) = 136533 MB/s
test memcmp_builtin_64                ... bench:           2 ns/iter (+/- 0) = 32000 MB/s
test memcmp_builtin_8                 ... bench:           2 ns/iter (+/- 0) = 4000 MB/s
test memcmp_builtin_unaligned_1048575 ... bench:      17,306 ns/iter (+/- 426) = 60590 MB/s
test memcmp_builtin_unaligned_15      ... bench:           2 ns/iter (+/- 0) = 8000 MB/s
test memcmp_builtin_unaligned_31      ... bench:           2 ns/iter (+/- 0) = 16000 MB/s
test memcmp_builtin_unaligned_4095    ... bench:          37 ns/iter (+/- 1) = 110702 MB/s
test memcmp_builtin_unaligned_63      ... bench:           2 ns/iter (+/- 0) = 32000 MB/s
test memcmp_builtin_unaligned_7       ... bench:           2 ns/iter (+/- 0) = 4000 MB/s
test memcmp_rust_1048576              ... bench:      27,765 ns/iter (+/- 747) = 37766 MB/s
test memcmp_rust_16                   ... bench:           2 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_32                   ... bench:           3 ns/iter (+/- 0) = 10666 MB/s
test memcmp_rust_4096                 ... bench:         114 ns/iter (+/- 3) = 35929 MB/s
test memcmp_rust_64                   ... bench:           3 ns/iter (+/- 0) = 21333 MB/s
test memcmp_rust_8                    ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test memcmp_rust_unaligned_1048575    ... bench:      27,738 ns/iter (+/- 983) = 37802 MB/s
test memcmp_rust_unaligned_15         ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_unaligned_31         ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_unaligned_4095       ... bench:         112 ns/iter (+/- 2) = 36571 MB/s
test memcmp_rust_unaligned_63         ... bench:           4 ns/iter (+/- 0) = 16000 MB/s
test memcmp_rust_unaligned_7          ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test memcpy_builtin_1048576           ... bench:      17,696 ns/iter (+/- 766) = 59254 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      18,245 ns/iter (+/- 562) = 57471 MB/s
test memcpy_builtin_1048576_offset    ... bench:      18,075 ns/iter (+/- 371) = 58012 MB/s
test memcpy_builtin_4096              ... bench:          34 ns/iter (+/- 1) = 120470 MB/s
test memcpy_builtin_4096_misalign     ... bench:          46 ns/iter (+/- 0) = 89043 MB/s
test memcpy_builtin_4096_offset       ... bench:          46 ns/iter (+/- 1) = 89043 MB/s
test memcpy_rust_1048576              ... bench:     241,145 ns/iter (+/- 29,893) = 4348 MB/s
test memcpy_rust_1048576_misalign     ... bench:      18,112 ns/iter (+/- 1,545) = 57893 MB/s
test memcpy_rust_1048576_offset       ... bench:      17,993 ns/iter (+/- 349) = 58276 MB/s
test memcpy_rust_4096                 ... bench:          44 ns/iter (+/- 1) = 93090 MB/s
test memcpy_rust_4096_misalign        ... bench:          59 ns/iter (+/- 1) = 69423 MB/s
test memcpy_rust_4096_offset          ... bench:          59 ns/iter (+/- 0) = 69423 MB/s
test memmove_builtin_1048576          ... bench:      17,453 ns/iter (+/- 477) = 60079 MB/s
test memmove_builtin_1048576_misalign ... bench:      17,493 ns/iter (+/- 506) = 59942 MB/s
test memmove_builtin_4096             ... bench:          27 ns/iter (+/- 0) = 151703 MB/s
test memmove_builtin_4096_misalign    ... bench:          30 ns/iter (+/- 1) = 136533 MB/s
test memmove_rust_1048576             ... bench:      35,803 ns/iter (+/- 1,682) = 29287 MB/s
test memmove_rust_1048576_misalign    ... bench:      35,578 ns/iter (+/- 2,772) = 29472 MB/s
test memmove_rust_4096                ... bench:         153 ns/iter (+/- 3) = 26771 MB/s
test memmove_rust_4096_misalign       ... bench:         177 ns/iter (+/- 4) = 23141 MB/s
test memset_builtin_1048576           ... bench:      62,855 ns/iter (+/- 1,751) = 16682 MB/s
test memset_builtin_1048576_offset    ... bench:      62,675 ns/iter (+/- 2,074) = 16730 MB/s
test memset_builtin_4096              ... bench:          40 ns/iter (+/- 1) = 102400 MB/s
test memset_builtin_4096_offset       ... bench:          49 ns/iter (+/- 1) = 83591 MB/s
test memset_rust_1048576              ... bench:      62,896 ns/iter (+/- 1,459) = 16671 MB/s
test memset_rust_1048576_offset       ... bench:      10,490 ns/iter (+/- 373) = 99959 MB/s
test memset_rust_4096                 ... bench:          48 ns/iter (+/- 0) = 85333 MB/s
test memset_rust_4096_offset          ... bench:          50 ns/iter (+/- 1) = 81920 MB/s

Good to know that we are pretty close to the system implementations for most everything. Interesting that our f128 add/sub/mul routines appear to be 2x as fast, I wonder why

@tgross35 tgross35 marked this pull request as ready for review May 23, 2024 09:26
@tgross35
Copy link
Contributor Author

tgross35 commented May 23, 2024

Some assorted notes after looking at the graphs, just for reference:

  • most operations are near or faster than system. Everything involving f128 is significantly faster for whatever reason

  • float<->int conversions are slower

  • f16->f32 is significantly slower

  • powi is slower

  • I don't know what is going on with div assembly tests

    image

This adds comparisons among the compiler-builtins function, system
functions if available, and optionally handwritten assembly.

These also help us identify inconsistencies between this crate and
system functions, which may otherwise go unnoticed if intrinsics get
lowered to inline operations rather than library calls.
@Amanieu
Copy link
Member

Amanieu commented May 24, 2024

On most CPUs, division instructions have variable latency depending on the input values.

@Amanieu Amanieu merged commit 2fe1fd1 into rust-lang:master May 24, 2024
24 checks passed
@tgross35 tgross35 deleted the benchmarking branch May 24, 2024 21:24
@tgross35
Copy link
Contributor Author

Enough to be slower than soft float on average? That seems surprising.

Thanks for the review.

@Amanieu
Copy link
Member

Amanieu commented May 25, 2024

The hardware division instruction blocks the pipeline, while the soft-float implementation is fully pipelined. This means that with a sufficiently wide CPU, the execution units could be running instructions from different iterations of the benchmark loop concurrently. On the other hand CPUs tend to only have 1 division unit, so no parallelism is possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants