Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register machine wasmi execution engine #729

Merged
merged 673 commits into from
Sep 21, 2023
Merged

Register machine wasmi execution engine #729

merged 673 commits into from
Sep 21, 2023

Conversation

Robbepop
Copy link
Member

@Robbepop Robbepop commented May 29, 2023

Closes #361.

Precursor: #367

ToDo

These items are things we want to do before merging the PR:

  • Simplify call argument encoding for function calls with exactly 1 argument.

For function calls we need to setup the arguments so that they are all stored in contiguous registers. Currently we do this for function calls with exactly 1 argument even though this is not needed. Thus during translation we should implement a check and simplify (and thus optimize) function call argument encoding for those function calls.

  • Properly implement translation of ConsumeFuel instructions and their fixed costs.

So far we have concentrated on getting the Wasm to wasmi bytecode translation up and running with the most simple setup possible. This means we ignored translation of ConsumeFuel instructions and their associated fuel costs so far and are in need of doing this work before we merge the PR. There should not be any difficulties compared to the already existing implementation for the stack-machine engine backend.

  • Fix attack vector of local.set stack preservation:

Currently for every local.set or local.tee that we encounter while translating Wasm bytecode to wasmi bytecode we iterate all values on the emulated value stack. Usually this isn't a big deal since the stack usually doesn't grow big in practical workloads. However, this can be easily attacked by a Wasm blob that blows up the stack and then performs tons of local.set instructions. We can eliminate this attack vector by storing the local.get provider indices on the emulated stack on a separate stack and iterate on it instead. While iterating we also remove the local.get indices from the stack so that a consecutive local.set operation will see an empty stack and thus perform no operations. We simply cache the local.get providers this way.

  • Improve local.set result replacement optimization when preserving local.get at the same time:

When translating local.set or local.tee we replace the result register of the previous instruction instead of emitting a copy instruction if possible. However, when preserving a local.get on the emulation stack we do not perform this optimization. The problem is that the copy instruction required for the preservation is required to take place before the instruction i that should have its result register replaced. However, the instruction i is already encoded and could in the worst case consist of multiple instruction words which would require to shift already encoded instructions by one index. In theory this could also interfere with pre-calculated branch offset calculations but so far this has not been demonstrated and might not be true.

  • Avoid expensive register space defragmentation if possible.

After successful translation of Wasm bytecode the wasmi translation performs a final pass over all encoded instructions to defragment the register space. This is needed for registers that have been preserved for local.set in certain situations. However, this entire process is very costly and can be avoided entirely or partially. We can avoid it entirely since this only ever needs to be done if there were actual register preservations during the translation procesure. Furthermore we only need to defragment all instructions that have been encoded after encountering the first local.set register preservation. This way we can keep the simple loop over all instructions but still avoid most of the unnecessary work and thus speed up the translation performance.

  • Recycle preservation register slots if possible.

Currently, when preserving local.get x values on the stack upon a local.set x translation, a new register is reserved for the N local.get x values that have been found and replaced on the emulated value stack at the time of translation. When this happens multiple times, a new register slot on the preservation stack is registered each time. Right now, we do not track how many of the preserved local.get x values have already been used. However, if we would do this we could recycle no longer used preservation register slots instead of allocating a new slot all the time which could lead to fewer registers used especially by larger functions. A major data structure that would allow this efficiently in O(1) is the so-called Stash data structure.

Plan & Steps

  • Define the new register-machine based bytecode for the new wasmi executor.
    • Branch instructions
      • Unconditional Branches
      • Conditional Branches
      • Branch Table
    • Trap Instructions, e.g. Wasm unreachable instruction
    • Wasm select and select <ty> Instructions
    • Return Instructions
      • Unconditional Returns
      • Conditional Returns
    • Call Instructions
      • Nested Calls
        • Internal Function Calls
        • Imported Function Calls
        • Indirect Function Calls
      • Tail Calls
        • Internal Function Calls
        • Imported Function Calls
        • Indirect Function Calls
    • Comparison Instructions (or equivalents)
      • i32.{eq, eqz, ne, lt_{s|u}, le_{s|u}, gt_{s|u}, ge_{s|u}} instructions
      • i64.{eq, eqz, ne, lt_{s|u}, le_{s|u}, gt_{s|u}, ge_{s|u}} instructions
      • f32.{eq, ne, lt, le, gt, ge} instructions
      • f64.{eq, ne, lt, le, gt, ge} instructions
    • Memory load Instructions (or equivalents)
      • i32.load and i32.loadN_{s|u} instructions
      • i64.load and i64.loadN_{s|u} instructions
      • f32.load instruction
      • f64.load instruction
    • Memory store instructions (or equivalents)
      • i32.store and i32.storeN instructions
      • i64.store and i64.storeN instructions
      • f32.store instruction
      • f64.store instruction
    • Compute Instructions (or equivalents)
      • i32 compute instructions, e.g. i32.popcnt, i32.add, i32.rotl etc..
      • i64 compute instructions, e.g. i64.popcnt, i64.add, i32.rotl etc..
      • f32 compute instructions, e.g. f32.sqrt, f32.add, etc..
      • f64 compute instructions, e.g. f32.sqrt, f32.add, etc..
    • Conversion Instructions
      • MVP conversion instructions
      • sign-extension proposal instructions
      • non-trapping float-to-int conversion proposal instructions
    • Global Variable Instructions (or equivalents)
      • global.get
      • global.set (and immediate versions)
    • Wasm table instructions
      • table.size instruction
      • table.grow instruction
      • table.get instruction
      • table.set instruction
      • table.fill instruction
      • table.copy instruction
      • table.init instruction
    • Wasm memory instructions
      • memory.size instruction
      • memory.grow instruction
      • memory.fill instruction
      • memory.copy instruction
      • memory.init instruction
  • Translation of Wasm bytecode to register-machine based wasmi bytecode
    • Note: This includes translation as well as proper unit tests coverage for all wasmi instructions.
    • Control Flow
      • block control flow
      • loop control flow
      • if control flow
    • Branches
      • Unconditional Branches
      • Conditional Branches
      • Branch Table: br_table
    • Returns
    • Call Instructions
      • Nested Calls
        • Internal Function Calls
        • Imported Function Calls
        • Indirect Function Calls
      • Tail Calls
        • Internal Function Calls
        • Imported Function Calls
        • Indirect Function Calls
    • Select Instruction
      • Untyped select (from Wasm MVP)
      • Typed select (result <ty>) (from reference-types proposal)
    • Wasm drop instruction
    • Wasm unreachable instruction
    • Wasm local.set and local.tee
    • Comparison Instructions (or equivalents)
      • i32.{eq, ne, eqz} instructions
      • i64.{eq, ne, eqz} instructions
      • f32.{eq, ne} instructions
      • f64.{eq, ne} instructions
      • i32.{lt_s, lt_u, le_s, le_u, gt_s, gt_u, ge_s, ge_u} instructions
      • i64.{lt_s, lt_u, le_s, le_u, gt_s, gt_u, ge_s, ge_u} instructions
      • f32.{lt, le, gt, ge} instructions
      • f64.{lt, le, gt, ge} instructions
    • Memory load Instructions (or equivalents)
      • i32.load and i32.loadN_{s|u} instructions
      • i64.load and i64.loadN_{s|u} instructions
      • f32.load instruction
      • f64.load instruction
    • Memory store instructions (or equivalents)
      • i32.store and i32.storeN instructions
      • i64.store and i64.storeN instructions
      • f32.store instruction
      • f64.store instruction
    • Compute instructions
      • Unary instructions
        • i32.{clz, ctz, popcnt}
        • i64.{clz, ctz, popcnt}
        • f32.{abs, neg, ceil, floor, trunc, nearest, sqrt}
        • f64.{abs, neg, ceil, floor, trunc, nearest, sqrt}
      • Commutative instructions
        • i32.{add, mul, and, or, xor}
        • i64.{add, mul, and, or, xor}
        • f32.{add, mul, min, max}
        • f64.{add, mul, min, max}
      • Non-commutative instructions
        • i32.sub
        • i64.sub
        • f32.{sub, div, copysign}
        • f64.{sub, div, copysign}
      • Shift instructions
        • i32.{shl, shr_s, shr_u, rotl, rotr}
        • i64.{shl, shr_s, shr_u, rotl, rotr}
      • Divide and remainder instructions
        • i32.{div_u, div_s, rem_u, rem_s}
        • i64.{div_u, div_s, rem_u, rem_s}
    • Conversion Instructions
      • MVP conversion instructions
      • sign-extension proposal instructions
      • non-trapping float-to-int conversion proposal instructions
    • Global Variable Instructions (or equivalents)
      • global.get instruction
      • global.set instruction
    • Wasm reftype instructions
      • ref.null
      • ref.is_null
      • ref.func
    • Wasm table instructions
      • table.size instruction
      • table.grow instruction
      • table.get instruction
      • table.set instruction
      • table.fill instruction
      • table.copy instruction
      • table.init instruction
      • elem.drop instruction
    • Wasm memory instructions
      • memory.size instruction
      • memory.grow instruction
      • memory.fill instruction
      • memory.copy instruction
      • memory.init instruction
      • data.drop instruction
  • Execution of wasmi register-machine bytecode
    • Trap instruction
    • Consume fuel instruction
    • Branch instructions
    • Return instructions
    • Copy instructions
    • Select instructions
    • Wasm ref.func instruction
    • Call instructions
      • Internal calls
        • Nested
        • Tail
      • Imported calls
        • Nested
        • Tail
      • Indirect calls
        • Nested
        • Tail
      • Host calls
    • Wasm load instructions
    • Wasm store instructions
    • Wasm table instructions
      • table.get instructions
      • table.set instructions
      • table.size instruction
      • table.copy instruction
      • table.init instruction
      • table.fill instruction
      • table.grow instruction
      • elem.drop instruction
    • Wasm memory instructions
      • memory.size instruction
      • memory.copy instruction
      • memory.init instruction
      • memory.fill instruction
      • memory.grow instruction
      • data.drop instruction
    • Wasm global instructions
      • global.get instructions
      • global.set instructions
    • Unary instructions
      • i32 instructions
      • i64 instructions
      • f32 instructions
      • f64 instructions
    • Comparison instructions
    • Shift & rotate instructions
    • Binary instructions
      • i32 instructions
      • i64 instructions
      • f32 instructions
      • f64 instructions
    • Conversion instructions

Unresolved Questions

Ideas

The following list contains ideas that spun up and might be iterated here for experimentation purposes.

  • Wasm br 0 in Wasm basic block control frames should not be translated as a branch and instead as a Wasm end of the basic block since all code after the br 0 is unreachable. Fortunately this bytecode sequence seems to not be very common in practical Wasm blobs.
  • Use inline encoding for 64-bit immediate instructions.
    • Note: At the moment 64-bit encoded immediates use ConstRef in order to store the actual value in a const pool which is external to the bytecode itself. This has several downsides for performance and also for bytecode integrity. Performance is affected since an additional indirect memory fetch is required in order to compute on the constant value. Bytecode integrity is worse since analysing wasmi bytecode now also needs to inspect the external const pool. The latter point affects testability of wasmi bytecode. A way to improve this situation in both areas is to store the 64-bit constants inline. The problem is that for this 8 bytes are required but instruction words only support up to 6 bytes of parameters per word. Thus there is a need to split up the 64-bit constant into multiple pieces. The experiment in this GitHub Gist shows that a solution that splits the 64-bit constant value into 3 pieces (2 x 2-byte and 1 x 4-byte pieces) can be done efficiently on x86 and Wasm platforms. However, further benchmark tests are required to proof this.
  • For {i32, i64}.{div_u, div_s, rem_u, rem_s} we can apply a bytecode encoding optimization for the cases where the right-hand side divisor is a constant value. During translation we guarantee for all those operations that the right-hand side constant value is non-zero due to the fact that a zero right-hand side value is translated as a trap instruction during translation. Therefore we can replace the Const32 or Const16 parameter of those instructions with a NonZero32 or NonZero16 value and use Rust's built-in Div<NonZeroU{32,64}> for u{32,64} and Rem<NonZeroU{32,64}> for u{32,64}. Note that those APIs are only available for unsigned integers.
  • Add op-assign instructions.
    • Note: The idea is to introduce op-assign instructions such as i32.add_assign. The advantage is that we no longer have to store both, result and lhs fields since they are always the same and always of type Register. This also allows to have inline Const32 or ConstRef rhs fields and thus we could save some encoding space, too. Another benefit is that the instruction is probably also a bit faster at execution since the compiler has more information about which registers to read to and write from. However, this has to be checked with experiments. A downside of having these op-assign instructions is that they do not play well with local.set and local.tee optimizations where we replace the result register of the previous instruction during translation phase
    • Example instructions for AddAssign or += operator:
      • I32AddAssign(UnaryInstr)
      • I32AddAssignImm(UnaryInstrImm32): Requires just 1 Instruction for encoding.
      • I64AddAssign(UnaryInstr)
      • I64AddAssignImm(UnaryInstrImm): Requires just 1 Instruction for encoding.
      • I64AddAssignImm32(UnaryInstrImm32): 32-bit small value optimization instead of 16-bit
      • F32AddAssign(UnaryInstr)
      • F32AddAssignImm(UnaryInstrImm32): Requires just 1 Instruction for encoding.
      • F64AddAssign(UnaryInstr)
      • F64AddAssignImm(UnaryInstrImm): Requires just 1 Instruction for encoding.
  • Fuse comparison and branch instructions as proposed in this issue.
    • Note: The idea behind this optimization is that conditional branch instructions often follow comparison instructions and thus this sequence can easily be identified and optimized. Furthermore both instructions are very common in execution hot paths, for example in loops and thus optimizing them has greater effects.
  • Introduce global.op_assign instructions
    • These special instructions act as an optimization on Wasm instruction sequences such as global.get g; i32.add; global.set g which we then could represent using a single wasmi instructions such as global.i32.add g r or global.i32.add_imm g c where g represents a global, r represents an input register and c a constant value. Further research is needed to find out how common these sequences are and if the proposed global.op_assign instruction are actually improving execution performance significantly.
    • Note: It is likely that global variables are very sparsely used and thus this optimization likely won't improve performance too much while adding tons of new special instructions to wasmi bytecode. For example spidermonkey.wasm contains exactly a single global variable that is only sparsely used thoughout the Wasm file.
  • Add pointer arithmetic to load and store instructions.
    • Note: Fuse instruction sequences for load and store instructions that oddly compute a ptr+offset or ptr*scale + offset or even ptr shift scale + offset outside of Wasm's load and store instructions. Technically we can make load and store instructions very powerful by including simple pointer arithmetic into these instructions.

@paritytech-cicd-pr
Copy link

paritytech-cicd-pr commented May 29, 2023

BENCHMARKS

NATIVEWASMTIME
BENCHMARKMASTERPRDIFFMASTERPRDIFFWASMTIME OVERHEAD
execute/
bare_call_0
1.53ms 1.53ms 🔴 0.00% 1.07ms 1.11ms 🔴 3.56% 🟢 -28%
execute/
bare_call_0/typed
1.16ms 1.18ms 🔴 1.82% 706.50µs 720.65µs 🔴 1.99% 🟢 -39%
execute/
bare_call_1
1.58ms 1.59ms 🔴 0.86% 1.14ms 1.27ms 🔴 11.45% 🟢 -20%
execute/
bare_call_16
2.53ms 2.46ms 🔴 -2.43% 3.25ms 3.23ms 🔴 -0.69% 🟢 31%
execute/
bare_call_16/typed
1.54ms 1.58ms 🔴 2.56% 1.59ms 1.65ms 🔴 3.73% 🟢 5%
execute/
bare_call_1/typed
1.24ms 1.24ms ⚪ 0.22% 910.24µs 1.04ms 🔴 14.59% 🟢 -16%
execute/
bare_call_4
1.74ms 1.76ms ⚪ 0.88% 1.53ms 1.71ms 🔴 11.70% 🟢 -3%
execute/
bare_call_4/typed
1.23ms 1.21ms ⚪ -1.67% 883.66µs 985.02µs 🔴 11.47% 🟢 -19%
execute/
br_table
1.32ms 1.39ms 🔴 6.08% 1.08ms 1.26ms 🔴 17.42% 🟢 -9%
execute/
count_until
678.27µs 575.90µs 🟢 -15.44% 1.33ms 1.61ms 🔴 20.56% 🔴 179%
execute/
factorial_iterative
319.88µs 318.53µs ⚪ -0.46% 520.97µs 519.45µs ⚪ -0.78% 🟡 63%
execute/
factorial_recursive
489.01µs 491.65µs ⚪ 0.71% 660.32µs 673.63µs 🔴 1.97% 🟢 37%
execute/
fibonacci_iter
1.43ms 1.39ms 🟢 -2.81% 2.75ms 2.64ms 🟢 -3.91% 🟡 89%
execute/
fibonacci_rec
3.99ms 3.97ms ⚪ -0.46% 6.13ms 6.21ms 🔴 1.49% 🟡 57%
execute/
fibonacci_tail
860.29µs 861.49µs ⚪ 0.20% 1.59ms 1.59ms ⚪ -0.06% 🟡 85%
execute/
global_bump
740.65µs 728.83µs 🟢 -1.58% 1.64ms 1.60ms 🟢 -2.74% 🔴 119%
execute/
global_const
661.16µs 682.22µs 🔴 3.09% 1.33ms 1.39ms 🔴 4.22% 🔴 104%
execute/
host_calls
37.16µs 37.16µs ⚪ 0.20% 38.43µs 39.34µs 🔴 2.39% 🟢 6%
execute/
memory_fill
1.21ms 1.21ms ⚪ -0.27% 2.32ms 2.31ms ⚪ -0.47% 🟡 90%
execute/
memory_sum
1.18ms 1.14ms 🟢 -3.71% 2.28ms 2.28ms ⚪ -0.04% 🔴 101%
execute/
memory_vec_add
2.46ms 2.35ms ⚪ -3.93% 4.72ms 4.81ms 🔴 1.86% 🔴 104%
execute/
recursive_is_even
668.98µs 669.16µs ⚪ -0.24% 979.45µs 982.57µs ⚪ 0.48% 🟢 47%
execute/
recursive_ok
94.69µs 93.61µs ⚪ -1.02% 144.35µs 142.94µs ⚪ -0.90% 🟡 53%
execute/
recursive_scan
129.76µs 129.30µs ⚪ -0.82% 192.89µs 198.47µs 🔴 2.89% 🟡 54%
execute/
recursive_trap
8.94µs 8.72µs 🟢 -2.50% 13.87µs 13.80µs ⚪ -0.59% 🟡 58%
execute/
regex_redux
456.90µs 460.20µs ⚪ 0.71% 825.43µs 837.37µs 🔴 1.48% 🟡 82%
execute/
rev_complement
424.62µs 419.57µs ⚪ -1.10% 807.47µs 821.58µs 🔴 1.71% 🟡 96%
execute/
tiny_keccak
323.73µs 321.26µs ⚪ -0.73% 666.89µs 724.40µs 🔴 8.39% 🔴 125%
execute/
trunc_f2i
732.18µs 727.73µs ⚪ -0.57% 1.56ms 1.53ms 🟢 -1.38% 🔴 111%
instantiate/
wasm_kernel
57.06µs 55.66µs ⚪ -2.26% 52.73µs 54.63µs 🔴 2.95% 🟢 -2%
translate/
erc1155
198.49µs 210.88µs 🔴 6.23% 334.97µs 361.77µs 🔴 7.93% 🟡 72%
translate/
erc20
97.16µs 104.12µs 🔴 6.78% 162.45µs 175.14µs 🔴 7.94% 🟡 68%
translate/
erc721
137.41µs 147.05µs 🔴 6.74% 232.44µs 253.96µs 🔴 9.27% 🟡 73%
translate/
spidermonkey
61.89ms 65.40ms 🔴 5.57% 0.00ns 0.00ns 🔴 7.95% 🟢 -100%
translate/
wasm_kernel
4.15ms 4.40ms 🔴 5.95% 6.02ms 6.52ms 🔴 8.26% 🟢 48%

Link to pipeline

@codecov-commenter
Copy link

codecov-commenter commented May 29, 2023

Codecov Report

Merging #729 (283b121) into master (837bfdb) will increase coverage by 1.65%.
The diff coverage is 85.50%.

@@            Coverage Diff             @@
##           master     #729      +/-   ##
==========================================
+ Coverage   79.42%   81.07%   +1.65%     
==========================================
  Files         105      270     +165     
  Lines        9075    23217   +14142     
==========================================
+ Hits         7208    18824   +11616     
- Misses       1867     4393    +2526     
Files Changed Coverage Δ
crates/cli/src/args.rs 0.00% <0.00%> (ø)
crates/cli/src/context.rs 0.00% <0.00%> (ø)
crates/cli/src/main.rs 0.00% <0.00%> (ø)
crates/core/src/trap.rs 51.47% <ø> (ø)
crates/wasmi/src/engine/executor.rs 80.30% <ø> (+0.24%) ⬆️
crates/wasmi/src/engine/func_builder/error.rs 4.44% <0.00%> (-2.46%) ⬇️
...ates/wasmi/src/engine/func_builder/inst_builder.rs 80.64% <ø> (ø)
crates/wasmi/src/engine/regmach/bytecode/mod.rs 0.00% <0.00%> (ø)
.../wasmi/src/engine/regmach/tests/op/cmp/i32_gt_s.rs 100.00% <ø> (ø)
.../wasmi/src/engine/regmach/tests/op/cmp/i32_gt_u.rs 100.00% <ø> (ø)
... and 182 more

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Robbepop Robbepop added the research Research intense work item. label May 30, 2023
@Robbepop Robbepop changed the title [experimental] Register machine wasmi execution engine (take 2) WIP: Register machine wasmi execution engine (take 2) Jun 22, 2023
@Robbepop Robbepop mentioned this pull request Jun 27, 2023
10 tasks
@yamt yamt mentioned this pull request Jul 24, 2023
All call instructions now uniformly require their parameters to be placed in contiguous register spans. This necessitates copy instructions before a call is initiated in some cases. Future plans include to optimise longer sequences of copy instructions but we left that optimisation out for now.
They now have the same form as their nested call counterparts.
This is missing translation tests for now.
@Robbepop
Copy link
Member Author

Robbepop commented Sep 21, 2023

As discussed with @athei I will merge this PR now and start working on the remaining TODO items in isolation via follow-up PRs. For this process I am going to write a bunch of issues to track their progress.

The register-machine backend introduced by this PR passes the entire Wasm spec testsuite. However, this does not mean it is bug free and stable. I actually am aware of a few bugs that are going to be fixed soon.

The roadmap of wasmi's new register machine is as follows:

  1. Merge of this PR. (today)
  2. Implementing all the TODOs and fixing the bugs.
  3. Releasing a new wasmi version. This is going to be the last version where the stack-machine is still the default engine.
  4. Switch the default engine to the new register-machine.
  5. Release yet another wasmi version. This is going to be a temporary release to make transitioning simpler.
  6. Remove the old stack-machine engine. We do not plan to support multiple engines longer than we need to. Additionally there is some Wasm translation overhead (~8-9%) by just having multiple engines. Also feature development is a lot simple with just a single engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research Research intense work item.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add register machine based wasmi engine backend
4 participants