Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use memcpy and memset from musl for performance benefit #277

Merged
merged 2 commits into from Sep 15, 2022

Conversation

stskeeps
Copy link
Contributor

@stskeeps stskeeps commented Sep 14, 2022

Base: commit 37e18b7bf307fa4a8c745feebfcba54a0ba74f30

  • src/string/memcpy.c
  • src/string/memset.c

This was compiled into assembly with:

clang-14 -target riscv32 -march=rv32im -O3 -S memcpy.c -nostdlib -fno-builtin -funroll-loops

and labels manually updated to not conflict.

License is MIT, see https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT

This work was prompted by slow performance seen in compiler-builtins in Rust being the memcpy/memset:

https://github.com/rust-lang/compiler-builtins/blob/master/src/mem/mod.rs#L25

cargo bench output from risc0/zkvm/sdk/rust, initially ran on baseline of f99fe08 / current main at the time, with all CPUs in performance governor on a Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz - please confirm independently (impact seen in memset and memcpy benchmarks):

Benchmarking raw_sha/0: Warming up for 3.0000 s
Benchmarking raw_sha/0: Collecting 100 samples in estimated 20.002 s (20k iterations)
Benchmarking raw_sha/0: Analyzing
raw_sha/0               time:   [1.0618 ms 1.0627 ms 1.0640 ms]
                        thrpt:  [0.0000   B/s 0.0000   B/s 0.0000   B/s]
                 change:
                        time:   [+0.1294% +0.2691% +0.4031%] (p = 0.00 < 0.05)
                        thrpt:  [-0.4015% -0.2684% -0.1293%]
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
Benchmarking raw_sha/64
Benchmarking raw_sha/64: Warming up for 3.0000 s
Benchmarking raw_sha/64: Collecting 100 samples in estimated 20.049 s (20k iterations)
Benchmarking raw_sha/64: Analyzing
raw_sha/64              time:   [1.1025 ms 1.1036 ms 1.1051 ms]
                        thrpt:  [56.555 KiB/s 56.633 KiB/s 56.691 KiB/s]
                 change:
                        time:   [+1.1370% +1.3065% +1.4829%] (p = 0.00 < 0.05)
                        thrpt:  [-1.4612% -1.2897% -1.1242%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Benchmarking raw_sha/512
Benchmarking raw_sha/512: Warming up for 3.0000 s
Benchmarking raw_sha/512: Collecting 100 samples in estimated 20.061 s (18k iterations)
Benchmarking raw_sha/512: Analyzing
raw_sha/512             time:   [1.3080 ms 1.3089 ms 1.3103 ms]
                        thrpt:  [381.59 KiB/s 382.00 KiB/s 382.28 KiB/s]
                 change:
                        time:   [-1.1905% -1.0083% -0.8434%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8506% +1.0186% +1.2049%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
Benchmarking raw_sha/2048
Benchmarking raw_sha/2048: Warming up for 3.0000 s
Benchmarking raw_sha/2048: Collecting 100 samples in estimated 20.134 s (6600 iterations)
Benchmarking raw_sha/2048: Analyzing
raw_sha/2048            time:   [4.0441 ms 4.0465 ms 4.0500 ms]
                        thrpt:  [493.83 KiB/s 494.25 KiB/s 494.54 KiB/s]
                 change:
                        time:   [+0.0022% +0.0709% +0.1631%] (p = 0.08 > 0.05)
                        thrpt:  [-0.1628% -0.0708% -0.0022%]
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
Benchmarking raw_sha/8192
Benchmarking raw_sha/8192: Warming up for 3.0000 s
Benchmarking raw_sha/8192: Collecting 100 samples in estimated 22.252 s (200 iterations)
Benchmarking raw_sha/8192: Analyzing
raw_sha/8192            time:   [342.26 ms 342.38 ms 342.53 ms]
                        thrpt:  [23.356 KiB/s 23.366 KiB/s 23.374 KiB/s]
                 change:
                        time:   [+0.5052% +0.5480% +0.5962%] (p = 0.00 < 0.05)
                        thrpt:  [-0.5927% -0.5450% -0.5027%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmarking memset/memset/32
Benchmarking memset/memset/32: Warming up for 3.0000 s
Benchmarking memset/memset/32: Collecting 100 samples in estimated 5.1077 s (4400 iterations)
Benchmarking memset/memset/32: Analyzing
memset/memset/32        time:   [1.4813 ms 1.4826 ms 1.4848 ms]
                        change: [+5.8207% +6.0225% +6.2050%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
Benchmarking memset/memset/64
Benchmarking memset/memset/64: Warming up for 3.0000 s
Benchmarking memset/memset/64: Collecting 100 samples in estimated 5.1182 s (3900 iterations)
Benchmarking memset/memset/64: Analyzing
memset/memset/64        time:   [1.7197 ms 1.7215 ms 1.7244 ms]
                        change: [-14.422% -14.312% -14.160%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
Benchmarking memset/memset/128
Benchmarking memset/memset/128: Warming up for 3.0000 s
Benchmarking memset/memset/128: Collecting 100 samples in estimated 5.0341 s (2500 iterations)
Benchmarking memset/memset/128: Analyzing
memset/memset/128       time:   [2.4919 ms 2.4948 ms 2.4990 ms]
                        change: [-18.771% -18.663% -18.531%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe
Benchmarking memset/memset/256
Benchmarking memset/memset/256: Warming up for 3.0000 s
Benchmarking memset/memset/256: Collecting 100 samples in estimated 5.0387 s (2000 iterations)
Benchmarking memset/memset/256: Analyzing
memset/memset/256       time:   [3.3286 ms 3.3320 ms 3.3372 ms]
                        change: [-39.949% -39.831% -39.715%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
Benchmarking memset/memset/512
Benchmarking memset/memset/512: Warming up for 3.0000 s
Benchmarking memset/memset/512: Collecting 100 samples in estimated 5.2650 s (1200 iterations)
Benchmarking memset/memset/512: Analyzing
memset/memset/512       time:   [5.5545 ms 5.5604 ms 5.5694 ms]
                        change: [-49.845% -49.787% -49.692%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
Benchmarking memset/memset/1024
Benchmarking memset/memset/1024: Warming up for 3.0000 s
Benchmarking memset/memset/1024: Collecting 100 samples in estimated 5.7669 s (700 iterations)
Benchmarking memset/memset/1024: Analyzing
memset/memset/1024      time:   [9.9648 ms 9.9737 ms 9.9878 ms]
                        change: [-60.101% -60.002% -59.921%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking memset/memset/2048
Benchmarking memset/memset/2048: Warming up for 3.0000 s
Benchmarking memset/memset/2048: Collecting 100 samples in estimated 6.3538 s (400 iterations)
Benchmarking memset/memset/2048: Analyzing
memset/memset/2048      time:   [24.343 ms 24.416 ms 24.524 ms]
                        change: [-54.541% -54.365% -54.156%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
Benchmarking memset/memset/4096
Benchmarking memset/memset/4096: Warming up for 3.0000 s
Benchmarking memset/memset/4096: Collecting 100 samples in estimated 6.2785 s (200 iterations)
Benchmarking memset/memset/4096: Analyzing
memset/memset/4096      time:   [51.883 ms 51.952 ms 52.065 ms]
                        change: [-58.149% -58.049% -57.944%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Benchmarking memcpy/memcpy-aligned/32
Benchmarking memcpy/memcpy-aligned/32: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-aligned/32: Collecting 100 samples in estimated 5.0745 s (4400 iterations)
Benchmarking memcpy/memcpy-aligned/32: Analyzing
memcpy/memcpy-aligned/32
                        time:   [1.5943 ms 1.5958 ms 1.5980 ms]
                        thrpt:  [19.555 KiB/s 19.583 KiB/s 19.601 KiB/s]
                 change:
                        time:   [-36.552% -36.378% -36.236%] (p = 0.00 < 0.05)
                        thrpt:  [+56.828% +57.179% +57.610%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
Benchmarking memcpy/memcpy-src-unaligned/32
Benchmarking memcpy/memcpy-src-unaligned/32: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-src-unaligned/32: Collecting 100 samples in estimated 5.1645 s (2000 iterations)
Benchmarking memcpy/memcpy-src-unaligned/32: Analyzing
memcpy/memcpy-src-unaligned/32
                        time:   [4.7691 ms 4.7752 ms 4.7859 ms]
                        thrpt:  [6.5296 KiB/s 6.5442 KiB/s 6.5525 KiB/s]
                 change:
                        time:   [+48.326% +48.690% +49.060%] (p = 0.00 < 0.05)
                        thrpt:  [-32.913% -32.746% -32.581%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
Benchmarking memcpy/memcpy-dst-unaligned/32
Benchmarking memcpy/memcpy-dst-unaligned/32: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-dst-unaligned/32: Collecting 100 samples in estimated 5.1212 s (2300 iterations)
Benchmarking memcpy/memcpy-dst-unaligned/32: Analyzing
memcpy/memcpy-dst-unaligned/32
                        time:   [3.0566 ms 3.0613 ms 3.0676 ms]
                        thrpt:  [10.187 KiB/s 10.208 KiB/s 10.224 KiB/s]
                 change:
                        time:   [-9.7214% -9.5813% -9.3898%] (p = 0.00 < 0.05)
                        thrpt:  [+10.363% +10.597% +10.768%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
Benchmarking memcpy/memcpy-both-unaligned/32
Benchmarking memcpy/memcpy-both-unaligned/32: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-both-unaligned/32: Collecting 100 samples in estimated 5.1614 s (2000 iterations)
Benchmarking memcpy/memcpy-both-unaligned/32: Analyzing
memcpy/memcpy-both-unaligned/32
                        time:   [4.7674 ms 4.7749 ms 4.7871 ms]
                        thrpt:  [6.5279 KiB/s 6.5446 KiB/s 6.5549 KiB/s]
                 change:
                        time:   [+47.505% +47.852% +48.250%] (p = 0.00 < 0.05)
                        thrpt:  [-32.546% -32.365% -32.206%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking memcpy/memcpy-aligned/64
Benchmarking memcpy/memcpy-aligned/64: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-aligned/64: Collecting 100 samples in estimated 5.0931 s (2600 iterations)
Benchmarking memcpy/memcpy-aligned/64: Analyzing
memcpy/memcpy-aligned/64
                        time:   [3.6453 ms 3.6504 ms 3.6589 ms]
                        thrpt:  [17.082 KiB/s 17.121 KiB/s 17.145 KiB/s]
                 change:
                        time:   [-18.152% -17.915% -17.676%] (p = 0.00 < 0.05)
                        thrpt:  [+21.471% +21.826% +22.177%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking memcpy/memcpy-src-unaligned/64
Benchmarking memcpy/memcpy-src-unaligned/64: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-src-unaligned/64: Collecting 100 samples in estimated 5.3355 s (1300 iterations)
Benchmarking memcpy/memcpy-src-unaligned/64: Analyzing
memcpy/memcpy-src-unaligned/64
                        time:   [7.5037 ms 7.5145 ms 7.5318 ms]
                        thrpt:  [8.2981 KiB/s 8.3173 KiB/s 8.3292 KiB/s]
                 change:
                        time:   [-0.1381% +0.1473% +0.4127%] (p = 0.36 > 0.05)
                        thrpt:  [-0.4111% -0.1470% +0.1383%]
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
Benchmarking memcpy/memcpy-dst-unaligned/64
Benchmarking memcpy/memcpy-dst-unaligned/64: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-dst-unaligned/64: Collecting 100 samples in estimated 5.2836 s (1600 iterations)
Benchmarking memcpy/memcpy-dst-unaligned/64: Analyzing
memcpy/memcpy-dst-unaligned/64
                        time:   [6.1180 ms 6.1255 ms 6.1388 ms]
                        thrpt:  [10.181 KiB/s 10.203 KiB/s 10.216 KiB/s]
                 change:
                        time:   [-24.664% -24.449% -24.218%] (p = 0.00 < 0.05)
                        thrpt:  [+31.958% +32.362% +32.738%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking memcpy/memcpy-both-unaligned/64
Benchmarking memcpy/memcpy-both-unaligned/64: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-both-unaligned/64: Collecting 100 samples in estimated 5.3421 s (1300 iterations)
Benchmarking memcpy/memcpy-both-unaligned/64: Analyzing
memcpy/memcpy-both-unaligned/64
                        time:   [7.4843 ms 7.4940 ms 7.5110 ms]
                        thrpt:  [8.3211 KiB/s 8.3400 KiB/s 8.3508 KiB/s]
                 change:
                        time:   [-1.4076% -1.1277% -0.8816%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8895% +1.1405% +1.4277%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
Benchmarking memcpy/memcpy-aligned/128
Benchmarking memcpy/memcpy-aligned/128: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-aligned/128: Collecting 100 samples in estimated 5.1577 s (2000 iterations)
Benchmarking memcpy/memcpy-aligned/128: Analyzing
memcpy/memcpy-aligned/128
                        time:   [5.5651 ms 5.5719 ms 5.5830 ms]
                        thrpt:  [22.389 KiB/s 22.434 KiB/s 22.461 KiB/s]
                 change:
                        time:   [-38.617% -38.458% -38.303%] (p = 0.00 < 0.05)
                        thrpt:  [+62.082% +62.491% +62.910%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
Benchmarking memcpy/memcpy-src-unaligned/128
Benchmarking memcpy/memcpy-src-unaligned/128: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-src-unaligned/128: Collecting 100 samples in estimated 5.4060 s (900 iterations)
Benchmarking memcpy/memcpy-src-unaligned/128: Analyzing
memcpy/memcpy-src-unaligned/128
                        time:   [12.341 ms 12.356 ms 12.380 ms]
                        thrpt:  [10.097 KiB/s 10.117 KiB/s 10.129 KiB/s]
                 change:
                        time:   [-20.704% -20.500% -20.288%] (p = 0.00 < 0.05)
                        thrpt:  [+25.451% +25.786% +26.109%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking memcpy/memcpy-dst-unaligned/128
Benchmarking memcpy/memcpy-dst-unaligned/128: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-dst-unaligned/128: Collecting 100 samples in estimated 5.0109 s (1000 iterations)
Benchmarking memcpy/memcpy-dst-unaligned/128: Analyzing
memcpy/memcpy-dst-unaligned/128
                        time:   [11.045 ms 11.059 ms 11.082 ms]
                        thrpt:  [11.280 KiB/s 11.303 KiB/s 11.317 KiB/s]
                 change:
                        time:   [-29.840% -29.640% -29.454%] (p = 0.00 < 0.05)
                        thrpt:  [+41.751% +42.126% +42.532%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
Benchmarking memcpy/memcpy-both-unaligned/128
Benchmarking memcpy/memcpy-both-unaligned/128: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-both-unaligned/128: Collecting 100 samples in estimated 5.3790 s (900 iterations)
Benchmarking memcpy/memcpy-both-unaligned/128: Analyzing
memcpy/memcpy-both-unaligned/128
                        time:   [12.246 ms 12.259 ms 12.283 ms]
                        thrpt:  [10.177 KiB/s 10.196 KiB/s 10.208 KiB/s]
                 change:
                        time:   [-21.285% -21.079% -20.869%] (p = 0.00 < 0.05)
                        thrpt:  [+26.373% +26.709% +27.040%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Benchmarking memcpy/memcpy-aligned/256
Benchmarking memcpy/memcpy-aligned/256: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-aligned/256: Collecting 100 samples in estimated 5.0056 s (1000 iterations)
Benchmarking memcpy/memcpy-aligned/256: Analyzing
memcpy/memcpy-aligned/256
                        time:   [13.227 ms 13.248 ms 13.277 ms]
                        thrpt:  [18.830 KiB/s 18.871 KiB/s 18.900 KiB/s]
                 change:
                        time:   [-39.723% -39.590% -39.432%] (p = 0.00 < 0.05)
                        thrpt:  [+65.104% +65.534% +65.902%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Benchmarking memcpy/memcpy-src-unaligned/256
Benchmarking memcpy/memcpy-src-unaligned/256: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-src-unaligned/256: Collecting 100 samples in estimated 5.1123 s (500 iterations)
Benchmarking memcpy/memcpy-src-unaligned/256: Analyzing
memcpy/memcpy-src-unaligned/256
                        time:   [26.605 ms 26.630 ms 26.673 ms]
                        thrpt:  [9.3727 KiB/s 9.3880 KiB/s 9.3969 KiB/s]
                 change:
                        time:   [-21.295% -21.217% -21.078%] (p = 0.00 < 0.05)
                        thrpt:  [+26.707% +26.931% +27.056%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
Benchmarking memcpy/memcpy-dst-unaligned/256
Benchmarking memcpy/memcpy-dst-unaligned/256: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-dst-unaligned/256: Collecting 100 samples in estimated 5.8625 s (600 iterations)
Benchmarking memcpy/memcpy-dst-unaligned/256: Analyzing
memcpy/memcpy-dst-unaligned/256
                        time:   [22.837 ms 22.867 ms 22.911 ms]
                        thrpt:  [10.912 KiB/s 10.933 KiB/s 10.947 KiB/s]
                 change:
                        time:   [-33.115% -32.968% -32.809%] (p = 0.00 < 0.05)
                        thrpt:  [+48.830% +49.182% +49.509%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) high mild
  7 (7.00%) high severe
Benchmarking memcpy/memcpy-both-unaligned/256
Benchmarking memcpy/memcpy-both-unaligned/256: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-both-unaligned/256: Collecting 100 samples in estimated 5.1244 s (500 iterations)
Benchmarking memcpy/memcpy-both-unaligned/256: Analyzing
memcpy/memcpy-both-unaligned/256
                        time:   [26.658 ms 26.683 ms 26.726 ms]
                        thrpt:  [9.3541 KiB/s 9.3691 KiB/s 9.3779 KiB/s]
                 change:
                        time:   [-21.284% -21.205% -21.085%] (p = 0.00 < 0.05)
                        thrpt:  [+26.718% +26.912% +27.039%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking memcpy/memcpy-aligned/512
Benchmarking memcpy/memcpy-aligned/512: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-aligned/512: Collecting 100 samples in estimated 5.4270 s (400 iterations)
Benchmarking memcpy/memcpy-aligned/512: Analyzing
memcpy/memcpy-aligned/512
                        time:   [58.562 ms 58.599 ms 58.660 ms]
                        thrpt:  [8.5237 KiB/s 8.5325 KiB/s 8.5380 KiB/s]
                 change:
                        time:   [-26.740% -26.685% -26.612%] (p = 0.00 < 0.05)
                        thrpt:  [+36.262% +36.398% +36.500%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking memcpy/memcpy-src-unaligned/512
Benchmarking memcpy/memcpy-src-unaligned/512: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-src-unaligned/512: Collecting 100 samples in estimated 5.0885 s (200 iterations)
Benchmarking memcpy/memcpy-src-unaligned/512: Analyzing
memcpy/memcpy-src-unaligned/512
                        time:   [116.51 ms 116.60 ms 116.72 ms]
                        thrpt:  [4.2836 KiB/s 4.2882 KiB/s 4.2916 KiB/s]
                 change:
                        time:   [-4.5735% -4.4402% -4.3179%] (p = 0.00 < 0.05)
                        thrpt:  [+4.5128% +4.6465% +4.7926%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Benchmarking memcpy/memcpy-dst-unaligned/512
Benchmarking memcpy/memcpy-dst-unaligned/512: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-dst-unaligned/512: Collecting 100 samples in estimated 5.0658 s (200 iterations)
Benchmarking memcpy/memcpy-dst-unaligned/512: Analyzing
memcpy/memcpy-dst-unaligned/512
                        time:   [116.91 ms 116.99 ms 117.11 ms]
                        thrpt:  [4.2697 KiB/s 4.2739 KiB/s 4.2767 KiB/s]
                 change:
                        time:   [-3.6594% -3.5414% -3.4122%] (p = 0.00 < 0.05)
                        thrpt:  [+3.5328% +3.6715% +3.7983%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking memcpy/memcpy-both-unaligned/512
Benchmarking memcpy/memcpy-both-unaligned/512: Warming up for 3.0000 s
Benchmarking memcpy/memcpy-both-unaligned/512: Collecting 100 samples in estimated 5.1402 s (200 iterations)
Benchmarking memcpy/memcpy-both-unaligned/512: Analyzing
memcpy/memcpy-both-unaligned/512
                        time:   [117.23 ms 117.30 ms 117.42 ms]
                        thrpt:  [4.2581 KiB/s 4.2625 KiB/s 4.2653 KiB/s]
                 change:
                        time:   [-3.9466% -3.8320% -3.7054%] (p = 0.00 < 0.05)
                        thrpt:  [+3.8480% +3.9847% +4.1087%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Benchmarking memcpy/memcpy-aligned/1024
Benchmarking memcpy/memcpy-aligned/1024: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.4s, or reduce sample count to 90.
Benchmarking memcpy/memcpy-aligned/1024: Collecting 100 samples in estimated 5.3572 s (100 iterations)
Benchmarking memcpy/memcpy-aligned/1024: Analyzing
memcpy/memcpy-aligned/1024
                        time:   [440.01 ms 440.35 ms 440.77 ms]
                        thrpt:  [2.2688 KiB/s 2.2709 KiB/s 2.2727 KiB/s]
                 change:
                        time:   [-2.4192% -2.3331% -2.2312%] (p = 0.00 < 0.05)
                        thrpt:  [+2.2821% +2.3888% +2.4791%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
Benchmarking memcpy/memcpy-src-unaligned/1024
Benchmarking memcpy/memcpy-src-unaligned/1024: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.6s, or reduce sample count to 70.
Benchmarking memcpy/memcpy-src-unaligned/1024: Collecting 100 samples in estimated 6.5826 s (100 iterations)
Benchmarking memcpy/memcpy-src-unaligned/1024: Analyzing
memcpy/memcpy-src-unaligned/1024
                        time:   [452.05 ms 452.93 ms 453.83 ms]
                        thrpt:  [2.2035 KiB/s 2.2079 KiB/s 2.2121 KiB/s]
                 change:
                        time:   [-2.0596% -1.8762% -1.6849%] (p = 0.00 < 0.05)
                        thrpt:  [+1.7138% +1.9120% +2.1029%]
                        Performance has improved.
Benchmarking memcpy/memcpy-dst-unaligned/1024
Benchmarking memcpy/memcpy-dst-unaligned/1024: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.5s, or reduce sample count to 70.
Benchmarking memcpy/memcpy-dst-unaligned/1024: Collecting 100 samples in estimated 6.5283 s (100 iterations)
Benchmarking memcpy/memcpy-dst-unaligned/1024: Analyzing
memcpy/memcpy-dst-unaligned/1024
                        time:   [446.84 ms 447.56 ms 448.34 ms]
                        thrpt:  [2.2305 KiB/s 2.2343 KiB/s 2.2379 KiB/s]
                 change:
                        time:   [-3.0021% -2.8397% -2.6701%] (p = 0.00 < 0.05)
                        thrpt:  [+2.7434% +2.9227% +3.0950%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  18 (18.00%) high severe
Benchmarking memcpy/memcpy-both-unaligned/1024
Benchmarking memcpy/memcpy-both-unaligned/1024: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.6s, or reduce sample count to 70.
Benchmarking memcpy/memcpy-both-unaligned/1024: Collecting 100 samples in estimated 6.6369 s (100 iterations)
Benchmarking memcpy/memcpy-both-unaligned/1024: Analyzing
memcpy/memcpy-both-unaligned/1024
                        time:   [449.23 ms 449.67 ms 450.19 ms]
                        thrpt:  [2.2213 KiB/s 2.2238 KiB/s 2.2260 KiB/s]
                 change:
                        time:   [-2.0304% -1.9207% -1.8147%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8482% +1.9584% +2.0725%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
Benchmarking memcpy/memcpy-aligned/2048
Benchmarking memcpy/memcpy-aligned/2048: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 21.6s, or reduce sample count to 20.
Benchmarking memcpy/memcpy-aligned/2048: Collecting 100 samples in estimated 21.553 s (100 iterations)
Benchmarking memcpy/memcpy-aligned/2048: Analyzing
memcpy/memcpy-aligned/2048
                        time:   [646.24 ms 646.77 ms 647.37 ms]
                        thrpt:  [3.0894 KiB/s 3.0923 KiB/s 3.0948 KiB/s]
                 change:
                        time:   [-3.0802% -2.9934% -2.8965%] (p = 0.00 < 0.05)
                        thrpt:  [+2.9829% +3.0858% +3.1781%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
Benchmarking memcpy/memcpy-src-unaligned/2048
Benchmarking memcpy/memcpy-src-unaligned/2048: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 24.9s, or reduce sample count to 20.
Benchmarking memcpy/memcpy-src-unaligned/2048: Collecting 100 samples in estimated 24.893 s (100 iterations)
Benchmarking memcpy/memcpy-src-unaligned/2048: Analyzing
memcpy/memcpy-src-unaligned/2048
                        time:   [665.93 ms 666.16 ms 666.45 ms]
                        thrpt:  [3.0010 KiB/s 3.0023 KiB/s 3.0033 KiB/s]
                 change:
                        time:   [-28.022% -27.973% -27.926%] (p = 0.00 < 0.05)
                        thrpt:  [+38.746% +38.838% +38.932%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking memcpy/memcpy-dst-unaligned/2048
Benchmarking memcpy/memcpy-dst-unaligned/2048: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 24.9s, or reduce sample count to 20.
Benchmarking memcpy/memcpy-dst-unaligned/2048: Collecting 100 samples in estimated 24.890 s (100 iterations)
Benchmarking memcpy/memcpy-dst-unaligned/2048: Analyzing
memcpy/memcpy-dst-unaligned/2048
                        time:   [665.98 ms 666.25 ms 666.59 ms]
                        thrpt:  [3.0003 KiB/s 3.0019 KiB/s 3.0031 KiB/s]
                 change:
                        time:   [-28.079% -28.039% -27.998%] (p = 0.00 < 0.05)
                        thrpt:  [+38.885% +38.964% +39.041%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
Benchmarking memcpy/memcpy-both-unaligned/2048
Benchmarking memcpy/memcpy-both-unaligned/2048: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 24.8s, or reduce sample count to 20.
Benchmarking memcpy/memcpy-both-unaligned/2048: Collecting 100 samples in estimated 24.759 s (100 iterations)
Benchmarking memcpy/memcpy-both-unaligned/2048: Analyzing
memcpy/memcpy-both-unaligned/2048
                        time:   [661.32 ms 661.56 ms 661.87 ms]
                        thrpt:  [3.0217 KiB/s 3.0231 KiB/s 3.0243 KiB/s]
                 change:
                        time:   [-28.640% -28.583% -28.529%] (p = 0.00 < 0.05)
                        thrpt:  [+39.918% +40.023% +40.134%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe
Benchmarking memcpy/memcpy-aligned/4096
Benchmarking memcpy/memcpy-aligned/4096: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 64.2s, or reduce sample count to 10.
Benchmarking memcpy/memcpy-aligned/4096: Collecting 100 samples in estimated 64.241 s (100 iterations)
Benchmarking memcpy/memcpy-aligned/4096: Analyzing
memcpy/memcpy-aligned/4096
                        time:   [1.2954 s 1.2960 s 1.2966 s]
                        thrpt:  [3.0850 KiB/s 3.0865 KiB/s 3.0878 KiB/s]
                 change:
                        time:   [-3.0484% -2.9906% -2.9388%] (p = 0.00 < 0.05)
                        thrpt:  [+3.0278% +3.0827% +3.1442%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe
Benchmarking memcpy/memcpy-src-unaligned/4096
Benchmarking memcpy/memcpy-src-unaligned/4096: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 106.8s, or reduce sample count to 10.
Benchmarking memcpy/memcpy-src-unaligned/4096: Collecting 100 samples in estimated 106.82 s (100 iterations)
Benchmarking memcpy/memcpy-src-unaligned/4096: Analyzing
memcpy/memcpy-src-unaligned/4096
                        time:   [1.3275 s 1.3279 s 1.3283 s]
                        thrpt:  [3.0113 KiB/s 3.0123 KiB/s 3.0132 KiB/s]
                 change:
                        time:   [-27.869% -27.840% -27.808%] (p = 0.00 < 0.05)
                        thrpt:  [+38.520% +38.582% +38.638%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
Benchmarking memcpy/memcpy-dst-unaligned/4096
Benchmarking memcpy/memcpy-dst-unaligned/4096: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 107.3s, or reduce sample count to 10.
Benchmarking memcpy/memcpy-dst-unaligned/4096: Collecting 100 samples in estimated 107.30 s (100 iterations)
Benchmarking memcpy/memcpy-dst-unaligned/4096: Analyzing
memcpy/memcpy-dst-unaligned/4096
                        time:   [1.3365 s 1.3369 s 1.3374 s]
                        thrpt:  [2.9909 KiB/s 2.9919 KiB/s 2.9929 KiB/s]
                 change:
                        time:   [-27.403% -27.369% -27.332%] (p = 0.00 < 0.05)
                        thrpt:  [+37.613% +37.682% +37.746%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe
Benchmarking memcpy/memcpy-both-unaligned/4096
Benchmarking memcpy/memcpy-both-unaligned/4096: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 107.7s, or reduce sample count to 10.
Benchmarking memcpy/memcpy-both-unaligned/4096: Collecting 100 samples in estimated 107.74 s (100 iterations)
Benchmarking memcpy/memcpy-both-unaligned/4096: Analyzing
memcpy/memcpy-both-unaligned/4096
                        time:   [1.3384 s 1.3388 s 1.3392 s]
                        thrpt:  [2.9868 KiB/s 2.9878 KiB/s 2.9887 KiB/s]
                 change:
                        time:   [-27.310% -27.276% -27.243%] (p = 0.00 < 0.05)
                        thrpt:  [+37.443% +37.506% +37.571%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Signed-off-by: Carsten Munk carsten@zippie.com

@jbruestle
Copy link
Contributor

CI is angry, looks like you need to run rustfmt.

@stskeeps
Copy link
Contributor Author

CI is angry, looks like you need to run rustfmt.

Indeed, it had the right to be annoyed, pushed an update.

@stskeeps
Copy link
Contributor Author

That error looks more like some kind of CI/build system/cache error, the commit clearly includes risc0/zkvm/sdk/rust/guest/src/memcpy.s as a path and https://doc.rust-lang.org/std/macro.include_str.html says it's located relative to the current file..

@stskeeps
Copy link
Contributor Author

Fixed, bazelbuild/rules_rust#459

@flaub
Copy link
Member

flaub commented Sep 14, 2022

Can you include the copyright header in each .s?

Base: commit 37e18b7bf307fa4a8c745feebfcba54a0ba74f30
- src/string/memcpy.c
- src/string/memset.c

This was compiled into assembly with:

 clang-14 -target riscv32 -march=rv32im -O3 -S memcpy.c -nostdlib -fno-builtin -funroll-loops

and labels manually updated to not conflict

License is MIT, see https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT and indicated in the .s files

Signed-off-by: Carsten Munk <carsten@zippie.com>
@stskeeps
Copy link
Contributor Author

Fixes:

  • Seems like the guest sdk was built for x86_64 in bazel too, made the global_asm only active on riscv32
  • Added copyright from musl to the assembly files

Copy link
Member

@flaub flaub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks!

@flaub flaub enabled auto-merge (squash) September 15, 2022 19:47
@flaub flaub merged commit d471cc6 into risc0:main Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants