[ci] add gitlab pipeline for criterion benchmarks #422

sergejparity · 2022-09-02T07:57:50Z

No description provided.

codecov-commenter · 2022-09-02T08:21:09Z

Codecov Report

Merging #422 (e916aec) into master (8fd6464) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #422   +/-   ##
=======================================
  Coverage   79.38%   79.38%           
=======================================
  Files          72       72           
  Lines        6198     6198           
=======================================
  Hits         4920     4920           
  Misses       1278     1278

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Robbepop

looking forward to trying it out. what are your plans on proper formatting of the output?

Robbepop · 2022-09-02T08:21:25Z

.gitlab-ci.yml

+  - cargo --version
+  - rustup +nightly show
+  - cargo +nightly --version
+  - cargo spellcheck --version


not needed, right?

Yes. It's just debug info. Will remove later

please remove this line for merge

Robbepop · 2022-09-02T11:14:29Z

We might be able to perform the benchmark pass via GitHub Actions using the new hosted GHA runners:
https://github.blog/2022-09-01-github-actions-introducing-the-new-larger-github-hosted-runners-beta/

This would simplify our CI setup compared to having 2 different ones.
What do you think?

sergejparity · 2022-09-02T11:56:22Z

Yes. Those ones look interesting. But they are still in beta, not sure whether they available for us right now.

paritytech-cicd-pr · 2022-09-09T08:34:17Z

CRITERION BENCHMARKS

BENCHMARK	MASTER	PR	Diff
`compile_and_validate_v1`	5.3960 ms	5.4034 ms	⚪ -0.0543%
`execute_count_until_v1`	2.0226 ms	2.0235 ms	⚪ +0.0058%
`execute_factorial_iterative_v1`	918.89 ns	921.30 ns	⚪ +0.2528%
`execute_factorial_recursive_v1`	1.2885 µs	1.2668 µs	🟢 -1.6309%
`execute_fib_iterative_v1`	4.7993 ms	4.8103 ms	⚪ +0.2261%
`execute_fib_recursive_v1`	11.649 ms	11.088 ms	🟢 -4.8002%
`execute_global_bump_v1`	2.8162 ms	2.8178 ms	⚪ +0.0311%
`execute_host_calls_v1`	31.023 µs	30.696 µs	⚪ -0.6575%
`execute_memory_fill_v1`	4.0522 ms	4.0485 ms	⚪ -0.0776%
`execute_memory_sum_v1`	3.8000 ms	3.7960 ms	⚪ -0.0977%
`execute_memory_vec_add_v1`	8.0440 ms	8.0777 ms	⚪ +0.3714%
`execute_recursive_is_even_v1`	2.3110 ms	2.2377 ms	🟢 -3.1723%
`execute_recursive_ok_v1`	308.19 µs	295.35 µs	🟢 -4.1052%
`execute_recursive_scan_v1`	373.12 µs	357.69 µs	🟢 -4.1572%
`execute_recursive_trap_v1`	25.767 µs	25.363 µs	🟢 -1.7334%
`execute_regex_redux_v1`	1.4169 ms	1.3883 ms	🟢 -2.0764%
`execute_rev_complement_v1`	1.3855 ms	1.3879 ms	⚪ +0.2077%
`execute_tiny_keccak_v1`	1.2512 ms	1.2656 ms	⚪ +1.3573%
`execute_trunc_f2i_v1`	1.8029 ms	1.8044 ms	⚪ +0.1029%
`instantiate_v1`	75.835 µs	74.327 µs	🟢 -2.0046%

Link to pipeline

sergejparity · 2022-09-09T08:47:13Z

@Robbepop please take a look at the sample report above. Does it look like what is expected?
It was generated from output of cargo bench --bench benches -- --noplot --baseline master :
As the base I took what can be seen in the logs

Benchmarking execute/memory_vec_add/v1
Benchmarking execute/memory_vec_add/v1: Warming up for 1.0000 s
Benchmarking execute/memory_vec_add/v1: Collecting 10 samples in estimated 2.1970 s (275 iterations)
Benchmarking execute/memory_vec_add/v1: Analyzing
execute/memory_vec_add/v1
                        time:   [7.9111 ms 7.9149 ms 7.9200 ms]
                        change: [-0.1041% +0.0076% +0.1570%] (p = 0.93 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

and transformed into this:

Benchmarks results
execute/memory_vec_add/v1
execute/memory_vec_add/v1
time: [7.9111 ms 7.9149 ms 7.9200 ms]
change: [-0.1041% +0.0076% +0.1570%] (p = 0.93 > 0.05)
⚪ No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) high mild
1 (10.00%) high severe

Any suggestions what needs to be improved, changed, added, etc..?

Robbepop · 2022-09-09T11:21:21Z

Hey @sergejparity thank you for the update!
Having those CI based benchmarks is already a big upgrade!
The benchmarked times looks a bit slow, do you know what machine they were running on?
(I did not expect that my old laptop from 2015 would outperform our CI runners 😅 )

I really like the color encodings where red indicates worse performance, green is improved performance and white is no change!
However, imo the default Criterion output for benchmarks is kind of unreadable at large and I'd much more prefer having a tabular.

Something like the following:

Benchmark	`master`	PR	Diff %
`compile_and_validate`	5.3814 ms	5.3807 ms	⚪ -0.8455%
`instantiate`	74.820 µs	74.782 µs	⚪ +0.6305%
`execute/tiny_keccak`	1.2501 ms	1.2468 ms	⚪ -0.1709%
`execute/count_until`	1.9322 ms	1.8791 ms	🟢 -1.8681%

etc...

Note that I made up the master column numbers and took over the PR column numbers in the middle as well as the %-diffs.

One thing we might want to do is to make benchmarks in CI more stable. Right now the benchmarks are kinda unstable since I rather wanted to make them execute faster. But for CI we could change that. Maybe we can implement it in a dynamic way that allows for both options, fast and imprecise as well as slow and precise mode.

Generally I have big plans for this CI based benchmarking infrastructure.
For example in the future we want to extend it to also run Wasm coremark benchmarks native and under Wasmtime since that's how we will run wasmi mostly via Substrate. However, that would exceed the scope of the current PR and I will write a detailed issue in the future for this extension.

sergejparity · 2022-09-09T11:44:34Z

@Robbepop thank you for the feedback. Will update the template.

What goes regarding performance, I also just using random regular runners for now. Later will tune up this component too.

Robbepop · 2022-09-09T11:50:57Z

@Robbepop thank you for the feedback. Will update the template.

What goes regarding performance, I also just using random regular runners for now. Later will tune up this component too.

Hyped to see your updates! :)
Glad to hear that runner performance is going to improve later.

sergejparity · 2022-09-16T08:31:55Z

@Robbepop now it should be much closer to what you expect.

Robbepop · 2022-09-16T08:38:51Z

@sergejparity yes it looks pretty good to me! Great job! 🚀

One small nit is that I'd love if the benchmark names were written in mono fonts. I.e. using ` around the names.

Is this PR done?

sergejparity · 2022-09-16T09:13:28Z

names were written in mono fonts

Here you go :)

Robbepop · 2022-09-16T09:43:18Z

.gitlab-ci.yml

+  - cargo --version
+  - rustup +nightly show
+  - cargo +nightly --version
+  - cargo spellcheck --version


please remove this line for merge

Robbepop · 2022-09-16T09:49:20Z

scripts/ci/benchmarks-report.sh

+echo "PARSING MASTER REPORT"
+sed -e 's/^Found.*//g' \
+    -e 's/^\s\+[[:digit:]].*$//g' \
+    -e 's/\//_/g' \
+    -e 's/^[a-z0-9_]\+/"&": {/g' \
+    -e 's/time:\s\+\[.\{10\}/"time": "/g' \
+    -e 's/.\{10\}\]/"},/g' \
+    -e '1s/^/{\n/g' \
+    -e '/^$/d' \
+    -e 's/  */ /g' \
+    -e 's/^ *\(.*\) *$/\1/' $1 \
+    | sed -z 's/.$//' \
+    | sed -e '$s/.$/}/g' \
+    | tee target/criterion/output_master.json


As a follow up to this PR we should try to make this part more readable.
For example criterion supports JSON output as well: https://bheisler.github.io/criterion.rs/book/cargo_criterion/external_tools.html
And with tools such as jq we could significantly make extracting information more readable.

I will merge this PR and propose a follow-up PR for you. :) Is that okay with you?

Agree. My homegrown parsing "engine" looks ugly. And definitely can be improved.
But json export tool you mention is a part of cargo-criterion which is under development. By the way, have you tried it in action?
Reports from cargo bench also produces jsons with raw data in target/criterion. I've tried to use them as well, but found that even it is possible get execution times and other metrics. Only thing stopped me from using it is that I forced to write my own statistics analysis tool to interpret them :)
So parsing raw command line output to json appeared easier solution.

Upd: tried to install and use cargo-criterion and first attempt was not successful:

Benchmarking compile_and_validate/v1thread 'main' panicked at 'could not open benchmark file benches/wasm/wasm_kernel/target/wasm32-unknown-unknown/release/wasm_kernel.wasm: No such file or directory (os er', wasmi_v1/benches/bench/mod.rs : 16 : 33 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace Benchmarking compile_and_validate/v1: Warming up for 1.0000 sthread 'main' panicked at 'Unexpected message FinishedBenchmarkGroup { group: "compile_and_validate/v1" }', /home/sergej/.cargo/registry/src/github.com-1ecc6299db9ec823/cargo-criterion-1.1.0/src/bench_target.rs:306:26 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })', /home/sergej/.cargo/registry/src/github.com-1ecc6299db9ec823/criterion-0.4.0/src/benchmark_group.rs:380:18 stack backtrace:

Double checked benches/wasm/wasm_kernel/target/wasm32-unknown-unknown/release/wasm_kernel.wasm is in place. What could be the reason for failure.

The problem with raw command line output compared to the JSON output (or genreally machine readable output) is that the former is subject to change whereas the latter is kinda guaranteed to be stable to not break dependents. Therefore as a long term non-brittle solution we really really want to read output that is intended to be machine readable.

Upd: tried to install and use cargo-criterion and first attempt was not successful:

Benchmarking compile_and_validate/v1thread 'main' panicked at 'could not open benchmark file benches/wasm/wasm_kernel/target/wasm32-unknown-unknown/release/wasm_kernel.wasm: No such file or directory (os er', wasmi_v1/benches/bench/mod.rs : 16 : 33 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace Benchmarking compile_and_validate/v1: Warming up for 1.0000 sthread 'main' panicked at 'Unexpected message FinishedBenchmarkGroup { group: "compile_and_validate/v1" }', /home/sergej/.cargo/registry/src/github.com-1ecc6299db9ec823/cargo-criterion-1.1.0/src/bench_target.rs:306:26 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })', /home/sergej/.cargo/registry/src/github.com-1ecc6299db9ec823/criterion-0.4.0/src/benchmark_group.rs:380:18 stack backtrace:

Double checked benches/wasm/wasm_kernel/target/wasm32-unknown-unknown/release/wasm_kernel.wasm is in place. What could be the reason for failure.

You are probably doing this in the root directory instead of the wasmi_v1 subdirectory.

You are right. Now it works, at first glance json's produced by cargo-criterion are almost what is needed. Of course some processing also is needed to produce readable output, like time conversion from ns into more convenient units. But at least we'll have an out-of-the-box machine readable data

Robbepop · 2022-09-16T11:49:06Z

Thanks a lot @sergejparity for this great work! I will merge this PR now and will write some issues for future work items for you and the benchmarking CI. :)

🚀

Robbepop reviewed Sep 2, 2022

View reviewed changes

Robbepop force-pushed the sk-add-gitlab-pipeline branch from 461f183 to c4b82f4 Compare September 5, 2022 08:51

sergejparity added 5 commits September 6, 2022 21:55

add gitlab pipeline

eccb2ea

change job name

410c90c

add fetch

d2c7401

update submodules

55cd0af

cleanup

9c39f52

Robbepop force-pushed the sk-add-gitlab-pipeline branch from c4b82f4 to 9c39f52 Compare September 6, 2022 19:55

sergejparity added 2 commits September 8, 2022 14:53

Merge branch 'master' into sk-add-gitlab-pipeline

ddcb3fa

add report generation and post

0341a77

Robbepop mentioned this pull request Sep 9, 2022

CI: Benchmark PRs and releases and display well formatted results #322

Closed

sergejparity added 11 commits September 12, 2022 14:52

Merge branch 'master' into sk-add-gitlab-pipeline

27c855d

benchmark report update

8ba2dfe

Merge branch 'master' into sk-add-gitlab-pipeline

ed08dfd

add debug output

9b119d6

more debugging

9f26a68

more debugging

89f75ec

more debugging

d0d9037

fix sed expr

07e3001

fix sed expr

2104958

fix sed expr

880312d

fix sed expr

22b5ad2

sergejparity added 14 commits September 16, 2022 00:29

fix sed expr

0127464

fix sed expr

66d3699

fix sed expr

5fb1be0

fix sed expr

e00c1d2

cleanup script

173cadb

fix quotes

ae857eb

add debug output

e4367b9

add debug output

2b616e2

remove debug

aafb623

update message template

8e88798

update message template

a318e4c

fix message template

aad4cae

reformat code

c8fa34d

reformat code

0582356

sergejparity changed the title ~~[wip][ci][do_not_merge] add gitlab pipeline~~ [ci] add gitlab pipeline for criterion benchmarks Sep 16, 2022

decorate fonts

bdfa811

Robbepop reviewed Sep 16, 2022

View reviewed changes

remove debug output

e916aec

Robbepop merged commit 97fb1d2 into master Sep 16, 2022

Robbepop deleted the sk-add-gitlab-pipeline branch September 16, 2022 11:49

Robbepop mentioned this pull request Sep 16, 2022

CI: Cleanup benchmark CI scripts #444

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] add gitlab pipeline for criterion benchmarks #422

[ci] add gitlab pipeline for criterion benchmarks #422

sergejparity commented Sep 2, 2022

codecov-commenter commented Sep 2, 2022 •

edited

Robbepop left a comment

Robbepop Sep 2, 2022

sergejparity Sep 2, 2022

Robbepop Sep 16, 2022

Robbepop commented Sep 2, 2022 •

edited

sergejparity commented Sep 2, 2022

paritytech-cicd-pr commented Sep 9, 2022 •

edited

sergejparity commented Sep 9, 2022

Robbepop commented Sep 9, 2022 •

edited

sergejparity commented Sep 9, 2022

Robbepop commented Sep 9, 2022 •

edited

sergejparity commented Sep 16, 2022

Robbepop commented Sep 16, 2022

sergejparity commented Sep 16, 2022

Robbepop Sep 16, 2022

Robbepop Sep 16, 2022

sergejparity Sep 16, 2022

sergejparity Sep 16, 2022

Robbepop Sep 16, 2022

Robbepop Sep 16, 2022

sergejparity Sep 16, 2022

Robbepop commented Sep 16, 2022

[ci] add gitlab pipeline for criterion benchmarks #422

[ci] add gitlab pipeline for criterion benchmarks #422

Conversation

sergejparity commented Sep 2, 2022

codecov-commenter commented Sep 2, 2022 • edited

Codecov Report

Robbepop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Robbepop commented Sep 2, 2022 • edited

sergejparity commented Sep 2, 2022

paritytech-cicd-pr commented Sep 9, 2022 • edited

CRITERION BENCHMARKS

sergejparity commented Sep 9, 2022

Robbepop commented Sep 9, 2022 • edited

sergejparity commented Sep 9, 2022

Robbepop commented Sep 9, 2022 • edited

sergejparity commented Sep 16, 2022

Robbepop commented Sep 16, 2022

sergejparity commented Sep 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Robbepop commented Sep 16, 2022

codecov-commenter commented Sep 2, 2022 •

edited

Robbepop commented Sep 2, 2022 •

edited

paritytech-cicd-pr commented Sep 9, 2022 •

edited

Robbepop commented Sep 9, 2022 •

edited

Robbepop commented Sep 9, 2022 •

edited