zkVM tests: add feature flag to build multi-test with docker environment #912

SchmErik · 2023-09-27T17:47:14Z

This PR adds a test-exact-cycles feature for risc0-zkvm. This is used to build test guest binaries like multi-test using the docker environment in the cargo test command. The intention here is to run the tests using a reproducible ELF binary to eliminate test failures across different architectures that are caused by entropy in the ELF binaries resulting from rust build tools.

I've gated 4 tests using this feature flag. Initially, what I had in mind was to isolate these tests to a different CI test step but I realized that it was cumbersome to isolate the 4 tests as integration tests and prevent other tests from running and I would have to proliferate a bunch of #[cfg(feature = ...)] directives. My solution is to create a new feature flag that the CI can use and it will run all tests including the ones that rely on exact cycles counts. If users do not wish to run the tests that require the reproducible binary, they can simply leave out the test-exact-cycles feature flag in their cargo test invocation.

Most of the code changes in this PR involves moving code from the cargo risczero build command to the risc0-build crate.

The reproducible build implementation used to be a part of the cargo-risczero utility. This change moves the code from cargo-risczero to risc0-build. By doing so, we will be able to integrate the docker builds as a part of the risc0-build mechanism. Eventually, it would be nice for users to be able to build guest code using by setting feature flags for the risc0-build crate. This code movement is the first step in facilitating users to use docker to build their guest code.

The tests being gated by this feature flag measure cycles and segments and are extremely sensitive to changes in the elf binary. We have seen cases where CI machines fail these tests on different architectures depending on the commit. What's interesting is that the test behavior changes even for commits that did not actually change the test. The root of this problem lies in the fact that rust does not support reproducible builds so that each architecture is running slightly mismatched ELF binaries from eachother. This is an attemp to eliminate test failures that happen from reproducible builds. The tests under this new flag must be run on elfs generated by the docker environment. All others can be run by the usual `cargo test` command.

risc0/zkvm/methods/build.rs

.github/workflows/main.yml

risc0/build/src/docker.rs

.github/workflows/main.yml

Co-authored-by: Frank Laub <flaub@risc0.com>

github-actions · 2023-09-27T19:32:36Z

Benchmark for Linux-cuda `fabecb5`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	5.3±0.17ms	5.2±0.14ms	-1.89%
fib/100/prove	1165.7±28.87ms	851.3±21.79ms	-26.97%
fib/100/total	1157.2±11.06ms	838.4±15.07ms	-27.55%
fib/1000/execute	5.9±0.07ms	5.7±0.11ms	-3.39%
fib/1000/prove	1192.3±22.61ms	872.0±13.56ms	-26.86%
fib/1000/total	1174.8±16.46ms	877.2±14.20ms	-25.33%
fib/10000/execute	12.0±0.19ms	11.9±0.20ms	-0.83%
fib/10000/prove	3.6±0.03s	3.4±0.02s	-5.56%
fib/10000/total	3.5±0.01s	3.4±0.02s	-2.86%

Benchmark for Linux-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-default `fabecb5`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	2.8±0.16ms	2.8±0.12ms	0.00%
fib/100/prove	3.7±0.05s	3.6±0.05s	-2.70%
fib/100/total	3.7±0.06s	3.7±0.07s	0.00%
fib/1000/execute	3.1±0.10ms	3.0±0.08ms	-3.23%
fib/1000/prove	3.7±0.06s	3.7±0.08s	0.00%
fib/1000/total	3.7±0.10s	3.7±0.08s	0.00%
fib/10000/execute	6.2±0.10ms	6.1±0.08ms	-1.61%
fib/10000/prove	15.1±0.10s	15.0±0.15s	-0.66%
fib/10000/total	15.0±0.13s	15.0±0.19s	0.00%

Benchmark for macOS-metal `fabecb5`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	2.9±0.12ms	2.7±0.11ms	-6.90%
fib/100/prove	814.4±6.26ms	802.4±4.93ms	-1.47%
fib/100/total	833.6±7.47ms	830.4±5.16ms	-0.38%
fib/1000/execute	3.2±0.09ms	3.1±0.02ms	-3.13%
fib/1000/prove	833.8±6.21ms	820.1±4.04ms	-1.64%
fib/1000/total	850.3±6.23ms	849.6±4.25ms	-0.08%
fib/10000/execute	6.2±0.09ms	6.1±0.15ms	-1.61%
fib/10000/prove	3.1±0.01s	3.1±0.02s	0.00%
fib/10000/total	3.1±0.01s	3.1±0.01s	0.00%

github-actions · 2023-09-27T21:25:42Z

Benchmark for Linux-cuda `f7725bc`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	5.0±0.10ms	5.0±0.08ms	0.00%
fib/100/prove	1522.1±78.09ms	1146.4±8.49ms	-24.68%
fib/100/total	1416.2±23.73ms	1123.9±6.46ms	-20.64%
fib/1000/execute	5.6±0.11ms	5.6±0.09ms	0.00%
fib/1000/prove	1515.3±32.76ms	1173.4±15.12ms	-22.56%
fib/1000/total	1402.8±23.78ms	1149.8±9.98ms	-18.04%
fib/10000/execute	11.6±0.12ms	11.6±0.09ms	0.00%
fib/10000/prove	4.6±0.02s	3.7±0.03s	-19.57%
fib/10000/total	4.4±0.04s	3.8±0.08s	-13.64%

Benchmark for Linux-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-metal

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

github-actions · 2023-09-27T22:26:07Z

Benchmark for Linux-cuda

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for Linux-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-default `9e31afc`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	2.8±0.15ms	2.8±0.15ms	0.00%
fib/100/prove	3.6±0.06s	3.6±0.08s	0.00%
fib/100/total	3.7±0.05s	3.6±0.05s	-2.70%
fib/1000/execute	3.1±0.08ms	3.0±0.05ms	-3.23%
fib/1000/prove	3.7±0.06s	3.6±0.08s	-2.70%
fib/1000/total	3.7±0.09s	3.6±0.07s	-2.70%
fib/10000/execute	6.2±0.10ms	6.2±0.07ms	0.00%
fib/10000/prove	15.1±0.08s	15.0±0.11s	-0.66%
fib/10000/total	15.1±0.17s	15.0±0.18s	-0.66%

Benchmark for macOS-metal `9e31afc`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	2.9±0.16ms	2.8±0.04ms	-3.45%
fib/100/prove	802.2±5.54ms	802.0±4.84ms	-0.02%
fib/100/total	827.4±4.64ms	823.9±6.47ms	-0.42%
fib/1000/execute	3.1±0.05ms	3.1±0.03ms	0.00%
fib/1000/prove	821.9±4.12ms	821.5±3.41ms	-0.05%
fib/1000/total	852.2±6.72ms	846.6±5.96ms	-0.66%
fib/10000/execute	6.2±0.05ms	6.0±0.13ms	-3.23%
fib/10000/prove	3.1±0.01s	3.1±0.02s	0.00%
fib/10000/total	3.1±0.01s	3.1±0.01s	0.00%

github-actions · 2023-09-27T23:29:25Z

Benchmark for Linux-cuda

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for Linux-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-default `c2a7b26`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	2.8±0.09ms	2.8±0.10ms	0.00%
fib/100/prove	3.6±0.06s	3.6±0.07s	0.00%
fib/100/total	3.6±0.05s	3.6±0.06s	0.00%
fib/1000/execute	3.1±0.09ms	3.0±0.07ms	-3.23%
fib/1000/prove	3.7±0.06s	3.7±0.08s	0.00%
fib/1000/total	3.7±0.09s	3.6±0.09s	-2.70%
fib/10000/execute	6.2±0.04ms	6.0±0.05ms	-3.23%
fib/10000/prove	15.1±0.10s	15.0±0.08s	-0.66%
fib/10000/total	15.0±0.10s	15.0±0.08s	0.00%

Benchmark for macOS-metal

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

The cargo metadata crate seems to not play nicely with absolute paths...

github-actions · 2023-09-28T01:00:33Z

Benchmark for Linux-cuda

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for Linux-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-metal

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

github-actions · 2023-09-28T01:49:11Z

Benchmark for Linux-cuda `101d599`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	5.1±0.09ms	5.1±0.09ms	0.00%
fib/100/prove	1171.4±47.79ms	1161.4±20.35ms	-0.85%
fib/100/total	1141.3±22.91ms	1103.7±8.83ms	-3.29%
fib/1000/execute	5.7±0.10ms	5.7±0.11ms	0.00%
fib/1000/prove	1181.4±15.36ms	1111.5±12.84ms	-5.92%
fib/1000/total	1145.0±6.63ms	1135.1±16.79ms	-0.86%
fib/10000/execute	11.9±0.11ms	11.7±0.12ms	-1.68%
fib/10000/prove	4.1±0.04s	3.5±0.03s	-14.63%
fib/10000/total	4.2±0.03s	3.5±0.01s	-16.67%

Benchmark for Linux-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-metal `101d599`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	2.8±0.08ms	2.8±0.13ms	0.00%
fib/100/prove	805.6±5.23ms	798.7±3.36ms	-0.86%
fib/100/total	826.0±4.18ms	822.5±7.08ms	-0.42%
fib/1000/execute	3.1±0.07ms	3.1±0.05ms	0.00%
fib/1000/prove	820.8±3.57ms	820.6±3.35ms	-0.02%
fib/1000/total	843.7±5.99ms	843.4±5.67ms	-0.04%
fib/10000/execute	6.2±0.06ms	6.1±0.05ms	-1.61%
fib/10000/prove	3.1±0.01s	3.1±0.02s	0.00%
fib/10000/total	3.1±0.01s	3.1±0.01s	0.00%

github-actions · 2023-09-28T08:27:06Z

Benchmark for Linux-cuda

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for Linux-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-default `848e1ee`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	2.8±0.10ms	2.8±0.16ms	0.00%
fib/100/prove	3.6±0.04s	3.6±0.07s	0.00%
fib/100/total	3.6±0.07s	3.6±0.05s	0.00%
fib/1000/execute	3.0±0.08ms	3.0±0.10ms	0.00%
fib/1000/prove	3.7±0.04s	3.6±0.04s	-2.70%
fib/1000/total	3.7±0.06s	3.7±0.05s	0.00%
fib/10000/execute	6.1±0.06ms	6.0±0.05ms	-1.64%
fib/10000/prove	15.0±0.13s	15.0±0.17s	0.00%
fib/10000/total	15.1±0.14s	15.0±0.16s	-0.66%

Benchmark for macOS-metal

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

github-actions · 2023-09-28T18:20:01Z

Benchmark for Linux-cuda `acfdd8e`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	5.1±0.15ms	5.1±0.14ms	0.00%
fib/100/prove	1461.9±26.41ms	1179.0±66.73ms	-19.35%
fib/100/total	1348.0±43.42ms	1134.5±9.88ms	-15.84%
fib/1000/execute	5.8±0.06ms	5.7±0.12ms	-1.72%
fib/1000/prove	1405.4±65.90ms	1192.3±69.01ms	-15.16%
fib/1000/total	1357.6±17.26ms	1156.3±8.05ms	-14.83%
fib/10000/execute	12.1±0.18ms	11.7±0.16ms	-3.31%
fib/10000/prove	4.6±0.01s	4.3±0.01s	-6.52%
fib/10000/total	4.6±0.03s	4.1±0.03s	-10.87%

Benchmark for Linux-default

    <details open>
      <summary>Click to hide benchmark</summary>
      Benchmarks have changed between the two branches, unable to diff.
    </details>

Benchmark for macOS-default `acfdd8e`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	2.8±0.15ms	2.7±0.14ms	-3.57%
fib/100/prove	3.7±0.05s	3.6±0.05s	-2.70%
fib/100/total	3.6±0.08s	3.6±0.05s	0.00%
fib/1000/execute	3.1±0.09ms	3.0±0.10ms	-3.23%
fib/1000/prove	3.7±0.06s	3.6±0.06s	-2.70%
fib/1000/total	3.7±0.05s	3.7±0.07s	0.00%
fib/10000/execute	6.2±0.13ms	6.2±0.06ms	0.00%
fib/10000/prove	15.0±0.16s	15.0±0.12s	0.00%
fib/10000/total	15.0±0.15s	15.0±0.10s	0.00%

Benchmark for macOS-metal `acfdd8e`

Click to hide benchmark

Test	Base	PR	%
fib/100/execute	2.9±0.14ms	2.8±0.09ms	-3.45%
fib/100/prove	800.3±5.01ms	795.1±4.01ms	-0.65%
fib/100/total	824.3±5.82ms	820.4±4.46ms	-0.47%
fib/1000/execute	3.1±0.08ms	3.0±0.04ms	-3.23%
fib/1000/prove	815.9±3.98ms	814.8±4.45ms	-0.13%
fib/1000/total	845.4±3.01ms	838.9±7.42ms	-0.77%
fib/10000/execute	6.1±0.15ms	6.0±0.04ms	-1.64%
fib/10000/prove	3.1±0.01s	3.1±0.01s	0.00%
fib/10000/total	3.1±0.01s	3.1±0.01s	0.00%

SchmErik added 2 commits September 26, 2023 21:49

SchmErik requested review from flaub and capossele September 27, 2023 17:47

Merge branch 'main' into erik/repro-multi-test

b53d1c6