Register machine `wasmi` execution engine #729

Robbepop · 2023-05-29T19:11:42Z

Closes #361.

Precursor: #367

ToDo

These items are things we want to do before merging the PR:

Simplify call argument encoding for function calls with exactly 1 argument.

For function calls we need to setup the arguments so that they are all stored in contiguous registers. Currently we do this for function calls with exactly 1 argument even though this is not needed. Thus during translation we should implement a check and simplify (and thus optimize) function call argument encoding for those function calls.

Properly implement translation of ConsumeFuel instructions and their fixed costs.

So far we have concentrated on getting the Wasm to wasmi bytecode translation up and running with the most simple setup possible. This means we ignored translation of ConsumeFuel instructions and their associated fuel costs so far and are in need of doing this work before we merge the PR. There should not be any difficulties compared to the already existing implementation for the stack-machine engine backend.

Fix attack vector of local.set stack preservation:

Currently for every local.set or local.tee that we encounter while translating Wasm bytecode to wasmi bytecode we iterate all values on the emulated value stack. Usually this isn't a big deal since the stack usually doesn't grow big in practical workloads. However, this can be easily attacked by a Wasm blob that blows up the stack and then performs tons of local.set instructions. We can eliminate this attack vector by storing the local.get provider indices on the emulated stack on a separate stack and iterate on it instead. While iterating we also remove the local.get indices from the stack so that a consecutive local.set operation will see an empty stack and thus perform no operations. We simply cache the local.get providers this way.

Improve local.set result replacement optimization when preserving local.get at the same time:

When translating local.set or local.tee we replace the result register of the previous instruction instead of emitting a copy instruction if possible. However, when preserving a local.get on the emulation stack we do not perform this optimization. The problem is that the copy instruction required for the preservation is required to take place before the instruction i that should have its result register replaced. However, the instruction i is already encoded and could in the worst case consist of multiple instruction words which would require to shift already encoded instructions by one index. In theory this could also interfere with pre-calculated branch offset calculations but so far this has not been demonstrated and might not be true.

Avoid expensive register space defragmentation if possible.

After successful translation of Wasm bytecode the wasmi translation performs a final pass over all encoded instructions to defragment the register space. This is needed for registers that have been preserved for local.set in certain situations. However, this entire process is very costly and can be avoided entirely or partially. We can avoid it entirely since this only ever needs to be done if there were actual register preservations during the translation procesure. Furthermore we only need to defragment all instructions that have been encoded after encountering the first local.set register preservation. This way we can keep the simple loop over all instructions but still avoid most of the unnecessary work and thus speed up the translation performance.

Recycle preservation register slots if possible.

Currently, when preserving local.get x values on the stack upon a local.set x translation, a new register is reserved for the N local.get x values that have been found and replaced on the emulated value stack at the time of translation. When this happens multiple times, a new register slot on the preservation stack is registered each time. Right now, we do not track how many of the preserved local.get x values have already been used. However, if we would do this we could recycle no longer used preservation register slots instead of allocating a new slot all the time which could lead to fewer registers used especially by larger functions. A major data structure that would allow this efficiently in O(1) is the so-called Stash data structure.

Plan & Steps

Unresolved Questions

Ideas

The following list contains ideas that spun up and might be iterated here for experimentation purposes.

Wasm br 0 in Wasm basic block control frames should not be translated as a branch and instead as a Wasm end of the basic block since all code after the br 0 is unreachable. Fortunately this bytecode sequence seems to not be very common in practical Wasm blobs.
Use inline encoding for 64-bit immediate instructions.
- Note: At the moment 64-bit encoded immediates use ConstRef in order to store the actual value in a const pool which is external to the bytecode itself. This has several downsides for performance and also for bytecode integrity. Performance is affected since an additional indirect memory fetch is required in order to compute on the constant value. Bytecode integrity is worse since analysing wasmi bytecode now also needs to inspect the external const pool. The latter point affects testability of wasmi bytecode. A way to improve this situation in both areas is to store the 64-bit constants inline. The problem is that for this 8 bytes are required but instruction words only support up to 6 bytes of parameters per word. Thus there is a need to split up the 64-bit constant into multiple pieces. The experiment in this GitHub Gist shows that a solution that splits the 64-bit constant value into 3 pieces (2 x 2-byte and 1 x 4-byte pieces) can be done efficiently on x86 and Wasm platforms. However, further benchmark tests are required to proof this.
For {i32, i64}.{div_u, div_s, rem_u, rem_s} we can apply a bytecode encoding optimization for the cases where the right-hand side divisor is a constant value. During translation we guarantee for all those operations that the right-hand side constant value is non-zero due to the fact that a zero right-hand side value is translated as a trap instruction during translation. Therefore we can replace the Const32 or Const16 parameter of those instructions with a NonZero32 or NonZero16 value and use Rust's built-in Div<NonZeroU{32,64}> for u{32,64} and Rem<NonZeroU{32,64}> for u{32,64}. Note that those APIs are only available for unsigned integers.
Add op-assign instructions.
- Note: The idea is to introduce op-assign instructions such as i32.add_assign. The advantage is that we no longer have to store both, result and lhs fields since they are always the same and always of type Register. This also allows to have inline Const32 or ConstRef rhs fields and thus we could save some encoding space, too. Another benefit is that the instruction is probably also a bit faster at execution since the compiler has more information about which registers to read to and write from. However, this has to be checked with experiments. A downside of having these op-assign instructions is that they do not play well with local.set and local.tee optimizations where we replace the result register of the previous instruction during translation phase
- Example instructions for AddAssign or += operator:
  - I32AddAssign(UnaryInstr)
  - I32AddAssignImm(UnaryInstrImm32): Requires just 1 Instruction for encoding.
  - I64AddAssign(UnaryInstr)
  - I64AddAssignImm(UnaryInstrImm): Requires just 1 Instruction for encoding.
  - I64AddAssignImm32(UnaryInstrImm32): 32-bit small value optimization instead of 16-bit
  - F32AddAssign(UnaryInstr)
  - F32AddAssignImm(UnaryInstrImm32): Requires just 1 Instruction for encoding.
  - F64AddAssign(UnaryInstr)
  - F64AddAssignImm(UnaryInstrImm): Requires just 1 Instruction for encoding.
Fuse comparison and branch instructions as proposed in this issue.
- Note: The idea behind this optimization is that conditional branch instructions often follow comparison instructions and thus this sequence can easily be identified and optimized. Furthermore both instructions are very common in execution hot paths, for example in loops and thus optimizing them has greater effects.
Introduce global.op_assign instructions
- These special instructions act as an optimization on Wasm instruction sequences such as global.get g; i32.add; global.set g which we then could represent using a single wasmi instructions such as global.i32.add g r or global.i32.add_imm g c where g represents a global, r represents an input register and c a constant value. Further research is needed to find out how common these sequences are and if the proposed global.op_assign instruction are actually improving execution performance significantly.
- Note: It is likely that global variables are very sparsely used and thus this optimization likely won't improve performance too much while adding tons of new special instructions to wasmi bytecode. For example spidermonkey.wasm contains exactly a single global variable that is only sparsely used thoughout the Wasm file.
Add pointer arithmetic to load and store instructions.
- Note: Fuse instruction sequences for load and store instructions that oddly compute a ptr+offset or ptr*scale + offset or even ptr shift scale + offset outside of Wasm's load and store instructions. Technically we can make load and store instructions very powerful by including simple pointer arithmetic into these instructions.

paritytech-cicd-pr · 2023-05-29T19:23:10Z

BENCHMARKS

	NATIVE			WASMTIME
BENCHMARK	MASTER	PR	DIFF	MASTER	PR	DIFF	WASMTIME OVERHEAD
`execute/` `bare_call_0`	1.53ms	1.53ms	🔴 0.00%	1.07ms	1.11ms	🔴 3.56%	🟢 -28%
`execute/` `bare_call_0/typed`	1.16ms	1.18ms	🔴 1.82%	706.50µs	720.65µs	🔴 1.99%	🟢 -39%
`execute/` `bare_call_1`	1.58ms	1.59ms	🔴 0.86%	1.14ms	1.27ms	🔴 11.45%	🟢 -20%
`execute/` `bare_call_16`	2.53ms	2.46ms	🔴 -2.43%	3.25ms	3.23ms	🔴 -0.69%	🟢 31%
`execute/` `bare_call_16/typed`	1.54ms	1.58ms	🔴 2.56%	1.59ms	1.65ms	🔴 3.73%	🟢 5%
`execute/` `bare_call_1/typed`	1.24ms	1.24ms	⚪ 0.22%	910.24µs	1.04ms	🔴 14.59%	🟢 -16%
`execute/` `bare_call_4`	1.74ms	1.76ms	⚪ 0.88%	1.53ms	1.71ms	🔴 11.70%	🟢 -3%
`execute/` `bare_call_4/typed`	1.23ms	1.21ms	⚪ -1.67%	883.66µs	985.02µs	🔴 11.47%	🟢 -19%
`execute/` `br_table`	1.32ms	1.39ms	🔴 6.08%	1.08ms	1.26ms	🔴 17.42%	🟢 -9%
`execute/` `count_until`	678.27µs	575.90µs	🟢 -15.44%	1.33ms	1.61ms	🔴 20.56%	🔴 179%
`execute/` `factorial_iterative`	319.88µs	318.53µs	⚪ -0.46%	520.97µs	519.45µs	⚪ -0.78%	🟡 63%
`execute/` `factorial_recursive`	489.01µs	491.65µs	⚪ 0.71%	660.32µs	673.63µs	🔴 1.97%	🟢 37%
`execute/` `fibonacci_iter`	1.43ms	1.39ms	🟢 -2.81%	2.75ms	2.64ms	🟢 -3.91%	🟡 89%
`execute/` `fibonacci_rec`	3.99ms	3.97ms	⚪ -0.46%	6.13ms	6.21ms	🔴 1.49%	🟡 57%
`execute/` `fibonacci_tail`	860.29µs	861.49µs	⚪ 0.20%	1.59ms	1.59ms	⚪ -0.06%	🟡 85%
`execute/` `global_bump`	740.65µs	728.83µs	🟢 -1.58%	1.64ms	1.60ms	🟢 -2.74%	🔴 119%
`execute/` `global_const`	661.16µs	682.22µs	🔴 3.09%	1.33ms	1.39ms	🔴 4.22%	🔴 104%
`execute/` `host_calls`	37.16µs	37.16µs	⚪ 0.20%	38.43µs	39.34µs	🔴 2.39%	🟢 6%
`execute/` `memory_fill`	1.21ms	1.21ms	⚪ -0.27%	2.32ms	2.31ms	⚪ -0.47%	🟡 90%
`execute/` `memory_sum`	1.18ms	1.14ms	🟢 -3.71%	2.28ms	2.28ms	⚪ -0.04%	🔴 101%
`execute/` `memory_vec_add`	2.46ms	2.35ms	⚪ -3.93%	4.72ms	4.81ms	🔴 1.86%	🔴 104%
`execute/` `recursive_is_even`	668.98µs	669.16µs	⚪ -0.24%	979.45µs	982.57µs	⚪ 0.48%	🟢 47%
`execute/` `recursive_ok`	94.69µs	93.61µs	⚪ -1.02%	144.35µs	142.94µs	⚪ -0.90%	🟡 53%
`execute/` `recursive_scan`	129.76µs	129.30µs	⚪ -0.82%	192.89µs	198.47µs	🔴 2.89%	🟡 54%
`execute/` `recursive_trap`	8.94µs	8.72µs	🟢 -2.50%	13.87µs	13.80µs	⚪ -0.59%	🟡 58%
`execute/` `regex_redux`	456.90µs	460.20µs	⚪ 0.71%	825.43µs	837.37µs	🔴 1.48%	🟡 82%
`execute/` `rev_complement`	424.62µs	419.57µs	⚪ -1.10%	807.47µs	821.58µs	🔴 1.71%	🟡 96%
`execute/` `tiny_keccak`	323.73µs	321.26µs	⚪ -0.73%	666.89µs	724.40µs	🔴 8.39%	🔴 125%
`execute/` `trunc_f2i`	732.18µs	727.73µs	⚪ -0.57%	1.56ms	1.53ms	🟢 -1.38%	🔴 111%
`instantiate/` `wasm_kernel`	57.06µs	55.66µs	⚪ -2.26%	52.73µs	54.63µs	🔴 2.95%	🟢 -2%
`translate/` `erc1155`	198.49µs	210.88µs	🔴 6.23%	334.97µs	361.77µs	🔴 7.93%	🟡 72%
`translate/` `erc20`	97.16µs	104.12µs	🔴 6.78%	162.45µs	175.14µs	🔴 7.94%	🟡 68%
`translate/` `erc721`	137.41µs	147.05µs	🔴 6.74%	232.44µs	253.96µs	🔴 9.27%	🟡 73%
`translate/` `spidermonkey`	61.89ms	65.40ms	🔴 5.57%	0.00ns	0.00ns	🔴 7.95%	🟢 -100%
`translate/` `wasm_kernel`	4.15ms	4.40ms	🔴 5.95%	6.02ms	6.52ms	🔴 8.26%	🟢 48%

Link to pipeline

codecov-commenter · 2023-05-29T19:25:52Z

Codecov Report

Merging #729 (283b121) into master (837bfdb) will increase coverage by 1.65%.
The diff coverage is 85.50%.

@@            Coverage Diff             @@
##           master     #729      +/-   ##
==========================================
+ Coverage   79.42%   81.07%   +1.65%     
==========================================
  Files         105      270     +165     
  Lines        9075    23217   +14142     
==========================================
+ Hits         7208    18824   +11616     
- Misses       1867     4393    +2526

Files Changed	Coverage Δ
crates/cli/src/args.rs	`0.00% <0.00%> (ø)`
crates/cli/src/context.rs	`0.00% <0.00%> (ø)`
crates/cli/src/main.rs	`0.00% <0.00%> (ø)`
crates/core/src/trap.rs	`51.47% <ø> (ø)`
crates/wasmi/src/engine/executor.rs	`80.30% <ø> (+0.24%)`	⬆️
crates/wasmi/src/engine/func_builder/error.rs	`4.44% <0.00%> (-2.46%)`	⬇️
...ates/wasmi/src/engine/func_builder/inst_builder.rs	`80.64% <ø> (ø)`
crates/wasmi/src/engine/regmach/bytecode/mod.rs	`0.00% <0.00%> (ø)`
.../wasmi/src/engine/regmach/tests/op/cmp/i32_gt_s.rs	`100.00% <ø> (ø)`
.../wasmi/src/engine/regmach/tests/op/cmp/i32_gt_u.rs	`100.00% <ø> (ø)`
... and 182 more

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

crates/core/src/trap.rs

crates/wasmi/src/engine/bytecode2/mod.rs

These are (probably) more efficient than their ReturnMany and ReturnNezMany respective counterparts because they store the returned registers inline.

This also tests the new Return2 and Return3 instructions.

All call instructions now uniformly require their parameters to be placed in contiguous register spans. This necessitates copy instructions before a call is initiated in some cases. Future plans include to optimise longer sequences of copy instructions but we left that optimisation out for now.

They now have the same form as their nested call counterparts.

This is missing translation tests for now.

We do this by avoiding or at least limiting the procedure to a conservative subset of all instructions that could have been affected by the register space fragmentation.

Robbepop · 2023-09-21T09:49:36Z

As discussed with @athei I will merge this PR now and start working on the remaining TODO items in isolation via follow-up PRs. For this process I am going to write a bunch of issues to track their progress.

The register-machine backend introduced by this PR passes the entire Wasm spec testsuite. However, this does not mean it is bug free and stable. I actually am aware of a few bugs that are going to be fixed soon.

The roadmap of wasmi's new register machine is as follows:

Merge of this PR. (today)
Implementing all the TODOs and fixing the bugs.
Releasing a new wasmi version. This is going to be the last version where the stack-machine is still the default engine.
Switch the default engine to the new register-machine.
Release yet another wasmi version. This is going to be a temporary release to make transitioning simpler.
Remove the old stack-machine engine. We do not plan to support multiple engines longer than we need to. Additionally there is some Wasm translation overhead (~8-9%) by just having multiple engines. Also feature development is a lot simple with just a single engine.

Robbepop added the research Research intense work item. label May 30, 2023

This was referenced May 30, 2023

Implement register machine based bytecode #367

Closed

wasmi: Register-machine based Engine #731

Closed

Robbepop mentioned this pull request Jun 9, 2023

Add lazy Wasm compilation #732

Closed

yjhmelody reviewed Jun 16, 2023

View reviewed changes

crates/core/src/trap.rs Show resolved Hide resolved

Robbepop changed the title ~~[experimental] Register machine wasmi execution engine (take 2)~~ WIP: Register machine wasmi execution engine (take 2) Jun 22, 2023

Robbepop mentioned this pull request Jun 27, 2023

Add ResourceLimiter to wasmi::Store #728

Closed

10 tasks

Robbepop mentioned this pull request Jul 4, 2023

Optimization: Opcode fusion of branch and comparison instructions #712

Closed

yjhmelody reviewed Jul 13, 2023

View reviewed changes

crates/wasmi/src/engine/bytecode2/mod.rs Outdated Show resolved Hide resolved

Robbepop added 10 commits July 19, 2023 20:28

remove no longer needed Instruction::ConstRef

efd106f

improve some doc comments

3d6efcd

rename RegisterSlice to RegisterSpan

6f8d65a

initial implementation of Wasm call translation

4767250

refactor ProviderSliceAlloc to RegisterSliceAlloc

e9ed236

add proper translation for Wasm calls with more than 3 parameters

02709f2

fix intra doc link

7346510

add Return2, Return3 and ReturnNez2 instructions

19f160b

These are (probably) more efficient than their ReturnMany and ReturnNezMany respective counterparts because they store the returned registers inline.

add translation test for Wasm call translation

dfbc135

This also tests the new Return2 and Return3 instructions.

fix docs

1815c44

yamt mentioned this pull request Jul 24, 2023

Benchmarks yamt/toywasm#8

Open

Robbepop added 8 commits July 24, 2023 16:27

refactor return_call wasmi instructions

6e509fd

They now have the same form as their nested call counterparts.

remove commented out code

7967fee

refactor call_indirect wasmi instructions

2744281

add InstrEncoder::encode_call_params helper method

ae80862

add Wasm call_indirect translation

27f64db

This is missing translation tests for now.

remove WIP todo!()

25f6c33

add Wasm call_indirect translation tests

d1a1577

Robbepop added 19 commits September 12, 2023 17:49

add safety comments to unsafe blocks

01970a1

implement root host function calls

4144911

move code sections closer together

b522fbb

make it possible to choose the execution engine in wasmi_cli

073e1a0

re-enable all benchmarks and use the stack-machine

9cc4fd7

silence incorrect clippy warning

f5cdfe9

only reorder copy instruction if they overwrite each other

dc58fec

optimize local.set preservation defragmentation

87e666f

We do this by avoiding or at least limiting the procedure to a conservative subset of all instructions that could have been affected by the register space fragmentation.

fix bug in reset of notified_preservation in InstrEncoder

0cd1503

improve preservation notification API

a6b1588

reorder methods

ba769a4

introduce RegisterSpace abstraction

e1a64b8

fix potential attack vector with local.get preservation

b854853

make wabt_example test pass again

4674d70

remove unused method

02fa5f8

add whitespace line

15d695e

fix bug in ProviderStack::push_const_local

8b58428

add dev comment

6313e53

add warning to EngineBackend::RegisterMachine

283b121

Robbepop marked this pull request as ready for review September 21, 2023 09:54

Robbepop merged commit 6fb940f into master Sep 21, 2023
17 checks passed

Robbepop deleted the rf-engine-regmach branch September 21, 2023 09:55

Robbepop changed the title ~~WIP: Register machine wasmi execution engine (take 2)~~ Register machine wasmi execution engine Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register machine `wasmi` execution engine #729

Register machine `wasmi` execution engine #729

Robbepop commented May 29, 2023 •

edited

Loading

paritytech-cicd-pr commented May 29, 2023 •

edited

Loading

codecov-commenter commented May 29, 2023 •

edited

Loading

Robbepop commented Sep 21, 2023 •

edited

Loading

Register machine wasmi execution engine #729

Register machine wasmi execution engine #729

Conversation

Robbepop commented May 29, 2023 • edited Loading

ToDo

Plan & Steps

Unresolved Questions

Ideas

paritytech-cicd-pr commented May 29, 2023 • edited Loading

BENCHMARKS

codecov-commenter commented May 29, 2023 • edited Loading

Codecov Report

Robbepop commented Sep 21, 2023 • edited Loading

Register machine `wasmi` execution engine #729

Register machine `wasmi` execution engine #729

Robbepop commented May 29, 2023 •

edited

Loading

paritytech-cicd-pr commented May 29, 2023 •

edited

Loading

codecov-commenter commented May 29, 2023 •

edited

Loading

Robbepop commented Sep 21, 2023 •

edited

Loading