Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize VM Engine (Part 1) #70

Merged
merged 63 commits into from
Feb 1, 2024
Merged

Optimize VM Engine (Part 1) #70

merged 63 commits into from
Feb 1, 2024

Conversation

kateinoigakukun
Copy link
Member

@kateinoigakukun kateinoigakukun commented Feb 1, 2024

Benchmark

Reduced version of test/Concurrency/Runtime/clock.swift

Before (1c4dbf6) After (c2357e9)
Execution Time 1644.81s 295.46s (5.5x faster)
Cache miss 35.965% 3.741%

Optimization techniques

  • Make frequently used types POD
  • Reduce Array CoW
  • Flatten Instruction enum's indices space
  • Linearlize instruction sequences instead of a structured layout
    • This contributes to better memory access locality.
  • Split Stack buffers for each type, Frame, Label and Value.

TODO

There are still many optimization opportunities even in interpreter style:

  • Guard page-based memory access check
  • Register-based machine instead of stack-based one
  • Skip frame pushing on leaf function
  • Place frame-local info (like module index, current default memory) in registers during VM-loop
    • Each instruction implementation should be a function instead of method to avoid memory access on self (ExecutionState), which doesn't fit in 64-bit r13 register.
  • etc...

Deprecated Features

  • Guest code profiler using interceptor API is now unavailable with Release build due to performance reasons.
    • We can consider a sampling-based profiler later

To avoid triggering CoW at popping values from stack
Instead of allocating separate Array for each frame locals, allocate a
single Array buffer and use it for all frames. This contributes to make
`Frame` type to be POD type and it will reduce ARC traffic.
since the index is already checked at validation time
to reduce ARC traffic for every call
The returning enum is not POD, so it causes ARC traffic while returning
StoreFunction.
8% faster
cache-misses 35.965% -> 11.892%
cache-misses 11.892% -> 3.741%
But not on release builds to avoid the non-trivial overhead of
interception.
@kateinoigakukun kateinoigakukun merged commit 4049a4e into main Feb 1, 2024
6 checks passed
@kateinoigakukun kateinoigakukun deleted the yt/stack-opt branch February 1, 2024 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant