Optimize VM Engine (Part 1) #70

kateinoigakukun · 2024-02-01T12:49:57Z

Benchmark

Reduced version of test/Concurrency/Runtime/clock.swift

	Before (`1c4dbf6`)	After (`c2357e9`)
Execution Time	1644.81s	295.46s (5.5x faster)
Cache miss	35.965%	3.741%

Optimization techniques

Make frequently used types POD
Reduce Array CoW
Flatten Instruction enum's indices space
Linearlize instruction sequences instead of a structured layout
- This contributes to better memory access locality.
Split Stack buffers for each type, Frame, Label and Value.

TODO

There are still many optimization opportunities even in interpreter style:

Guard page-based memory access check
Register-based machine instead of stack-based one
Skip frame pushing on leaf function
Place frame-local info (like module index, current default memory) in registers during VM-loop
- Each instruction implementation should be a function instead of method to avoid memory access on self (ExecutionState), which doesn't fit in 64-bit r13 register.
etc...

Deprecated Features

Guest code profiler using interceptor API is now unavailable with Release build due to performance reasons.
- We can consider a sampling-based profiler later

8% faster

To avoid triggering CoW at popping values from stack

Instead of allocating separate Array for each frame locals, allocate a single Array buffer and use it for all frames. This contributes to make `Frame` type to be POD type and it will reduce ARC traffic.

since the index is already checked at validation time

to reduce ARC traffic for every call

The returning enum is not POD, so it causes ARC traffic while returning StoreFunction.

50% faster

This reverts commit e7b9e4f.

8% faster cache-misses 35.965% -> 11.892%

2% faster

cache-misses 11.892% -> 3.741%

This reverts commit 86fd9d1.

But not on release builds to avoid the non-trivial overhead of interception.

kateinoigakukun added 30 commits January 17, 2024 23:37

Track base stack indices of each frames

718d4a3

Homogeneous Stack

887e5e2

Add CustomStringConvertible conformance

bcde58c

Revert tests

84a112f

Optimize basic Stack operations

f5f88ad

8% faster

Reduce Array mutations

91a5a77

To avoid triggering CoW at popping values from stack

Skip some push/pop traffic on entering/exiting frame or label

40f7658

Place all frames' locals into a single Array buffer

0011d29

Instead of allocating separate Array for each frame locals, allocate a single Array buffer and use it for all frames. This contributes to make `Frame` type to be POD type and it will reduce ARC traffic.

Skip local concatenation

3186d0a

WIP: Comment out unit tests

362b4a2

Spectest pass

adf8248

Split out ValueStack

0583d52

Unify locals and values storage

af127fd

Fix CoW performance issue

49056fd

Flatten instruction dispatcher

8d555c6

Optimize memory store

73fd887

Disable interceptor

3163982

Skip returning frame from pushFrame

99495c4

Use memcpy for pushing values

882f0b1

Reduce ARC traffic

138946b

Native memory load

6a21fb2

Ungrowable stack

9f3519e

Skip pop-push traffic on block/loop

424a855

Update inst dispatch generator

c53ff8e

Add Instruction.name

1b8d395

Skip local index validation at runtime

d3b66eb

since the index is already checked at validation time

Allocate defaultLocals as raw buffer instead of Array

233fa22

to reduce ARC traffic for every call

Inline Store.function(at:) to reduce ARC traffic

9197bd8

The returning enum is not POD, so it causes ARC traffic while returning StoreFunction.

Generate Instruction enum from code

42f1ff3

Make Instruction POD type

833d957

50% faster

kateinoigakukun added 28 commits February 1, 2024 21:31

Revert "Add sentinel guard for the end of expression"

0ee5b91

This reverts commit e7b9e4f.

Use Store.withMemory to reuse borrowing memory instance

0fbc14a

Linearize instruction sequence for data locality

a3ee5a6

8% faster cache-misses 35.965% -> 11.892%

Sentinel guard for the end of function

b4b87d7

2% faster

Repeat doExecute while frame cannot be changed

400e196

cache-misses 11.892% -> 3.741%

Disable interceptor

4258a0c

Reduce number of memory load for each instruction fetch

c82d440

Make PC non-nullable

51bb77c

doExecute is now well small to be inlined without @_transparent

2f39e21

Skip type section reference for structured controls at runtime

16c38a4

Skip heap allocation at root entry point

0a348c2

Revert dropping @inline(__always)

d558ab2

Skip overflow check for stack pointer

9d5bdc3

Track head stack pointer

2967c24

@_transparent withExecution

bc711b8

Flatten unary numeric instructions

6873b95

Reduce sizeof(Instruction) 21 -> 17

441b4ce

Remove unnecessary throws

a5e488d

Skip runtime global boundary and mutability check

2a64413

Track current stack pointer in FixedSizeStack

c864d8f

Revert "Track current stack pointer in FixedSizeStack"

9f58ecd

This reverts commit 86fd9d1.

Reduce stored fields

bc14ce4

Pass locals ptr by register

c910c92

Flatten part of IntBinary operator

40ba2ca

Update for direct local passing

807db0a

Remove too implementation-specific test cases for the VM now

bda3fd4

Re-enable interceptor API on debug builds

1d6783e

But not on release builds to avoid the non-trivial overhead of interception.

Remove unused already flattened Numeric operators

c2357e9

kateinoigakukun merged commit 4049a4e into main Feb 1, 2024
6 checks passed

kateinoigakukun deleted the yt/stack-opt branch February 1, 2024 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize VM Engine (Part 1) #70

Optimize VM Engine (Part 1) #70

kateinoigakukun commented Feb 1, 2024 •

edited

Optimize VM Engine (Part 1) #70

Optimize VM Engine (Part 1) #70

Conversation

kateinoigakukun commented Feb 1, 2024 • edited

Benchmark

Optimization techniques

TODO

Deprecated Features

kateinoigakukun commented Feb 1, 2024 •

edited