perf(vm): fast-path the executor dispatch loop#223
Merged
Conversation
Eight focused optimizations that cut interpreter overhead in half on
representative microbenchmarks and bring the gap to Luerl from
1.3-2x slower down to 0.95-1.35x (and faster on string concatenation).
Profile-driven changes, each measured before/after with tprof:
1. Numeric fast path for arithmetic and comparison opcodes. Numbers can
never carry metatables in Lua, so the `try_binary_metamethod` wrapper
that fired on every `+`/`-`/`*`/`==`/`<`/`<=`/etc was pure overhead.
Two extra clauses per opcode head: int-int and number-number, falling
through to the existing dispatch for non-numbers. Cuts fib(25) by
~12%. (Strings get the same treatment for `==`/`<`/`<=`/etc — they
can't carry __eq either.)
2. Call-args staging without an intermediate list. Every Lua-to-Lua call
was building an args list via `Enum.with_index |> Enum.reduce`. New
`copy_args_to_regs/5` recursively moves caller regs into callee regs
in place. `collect_args/3` still materializes a list for native
callbacks and `__call`.
3. `Lua.VM.Table` `order_tail` field. The `order: order ++ [key]` on
every brand-new insert was O(n), making `t = {} for i = 1,N do
t[i]=v end` O(n²). Writes now prepend onto a separate `order_tail`
(O(1)); `next_entry` merges lazily; `lua_next` flushes once per
iteration to amortize. table_build+sum(n=500) drops from ~490μs to
~180μs (~63% faster).
4. Short-circuit `Map.reject(open_upvalues, ...)` when the map is
empty. Every numeric/generic for iteration was paying for an empty-
map walk. New `close_open_upvalues_at_or_above/2` guards on
`map_size(ou) == 0`.
5. Fast-path `:get_field` and `:get_table` for trefs. When the table
has the key directly (and either no metatable, or the key is
present), skip the whole `index_value → table_index → normalize_key
→ Map.fetch! → Map.get` pipeline.
6. Fast-path `table_newindex/5` for tables with no metatable. Skips a
`has_data?` lookup and the nested-case scrutiny that only the
metamethod path needs.
7. Single-result `:return` fast path. Every `return f(...)` and `return
n` hit a `for i <- 0..0` range comprehension; replaced with a direct
`elem` read.
8. Fast-path string concatenation when both operands are binary or
numeric. The string metatable is for `string.*` library access, not
`__concat` dispatch — that lookup was wasted.
9. Single process-dict key for the source-position bridge. Native
callback dispatch was doing six `Process.put`/`get`/`delete` ops per
call (two each for line + source, separately). One tuple-valued key
halves it.
Results, microbenchmarks (median μs, perf/skip-metamethod-overhead vs
Luerl 1.5 on the same hardware):
fib(25) 71μs vs 63μs ratio 1.13x (was ~1.57x)
table_build+sum 179μs vs 141μs ratio 1.27x (was ~1.50x)
string_concat 40μs vs 43μs ratio 0.95x (was ~1.31x)
oop 109μs vs 81μs ratio 1.35x (was ~2.05x)
All 1692 tests pass (plus 55 doctests and 51 properties); the lua53
suite count is unchanged.
davydog187
added a commit
that referenced
this pull request
May 21, 2026
davydog187
added a commit
that referenced
this pull request
May 22, 2026
Post-PR #223 / #227 profile shows Map.get/2 + Map.get/3 combined at 3.28% on fib(22) (plan claimed 6.4%), 2.81% on OOP, and 0.04% on table_build. The real table-workload bottlenecks live inside Lua.VM.Table (insert/put 18%, normalize_key 3.3%, sequence_length 4%) and in :erlang.setelement (17.5% on table writes, 20.9% on OOP). Those are B7's targets, not B6's. B6's projected wall-clock win is now below 1%, inside benchee's deviation band on every measured workload. Audit cleanup may still be worth doing later as a refactor, but not as a perf plan and not before B7.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Eight focused optimizations to the executor's dispatch loop, each
measured with
mix profile.tprofbefore and after. Cuts interpreteroverhead in half on representative microbenchmarks and brings the gap
to Luerl from 1.3–2x slower down to 0.95–1.35x (and faster than Luerl
on string concatenation).
Why
Recent baseline measurements (the kind A33 will formalize) showed the
VM trailing Luerl by 1.5–2x on most workloads. Profiling fib, OOP,
table_build, and string_concat revealed a small handful of structural
overheads dominating the dispatch loop — not any single hot opcode, but
the same dispatch pattern paying for metamethod lookup, list-cons
allocation, and
Map.fetch!indirection on every iteration.This PR is the "cheap wins" pass: fast paths that the interpreter
already knew how to skip but wasn't. The bigger structural changes
(flat instruction stream, Erlang-function compilation, array+hash
table split) are scoped as follow-up plans B4–B8 in a separate PR.
What changed
Numeric fast path for arithmetic and comparison opcodes. Numbers
can never carry metatables in Lua, so the
try_binary_metamethodwrapper on every
+/-/*/==/</<=/etc was pure overhead.Adds an int-int guard clause (with Lua 5.3 §3.4.1 wrap-around via
Numeric.to_signed_int64) and a number-number cond branch abovethe existing dispatch. Strings get the same treatment for
==/</<=— they can't carry__eqeither.Call-arg staging without an intermediate list. Every Lua-to-Lua
call was building an args list via
Enum.with_index |> Enum.reduce.New
copy_args_to_regs/5moves caller regs into callee regsdirectly.
collect_args/3still materializes a list for nativecallbacks and
__call, where the contract demands one.Lua.VM.Table.order_tailfield for O(1) insert ordering. Everybrand-new key was doing
order ++ [key]— O(n) append. Writes nowprepend onto a separate
order_tail(O(1));next_entrymergeslazily;
lua_nextflushes once per iteration so subsequent stepssee a clean list. This is the single biggest win for table-heavy
workloads.
Empty-map short-circuit for
open_upvaluescleanup. Everyfor-loop iteration was paying
Map.rejecton an empty map (theoverwhelming common case: no nested closures captured a loop-local).
New
close_open_upvalues_at_or_above/2guards onmap_size == 0.Fast-path
:get_fieldand:get_tablefor trefs. When thetable has the key directly and either no metatable or the key is
present, skip the whole
index_value → table_index → normalize_key → Map.fetch! → Map.getpipeline.Fast-path
table_newindex/5for tables with no metatable. Skipsthe
has_data?lookup and the nested-case scrutiny that only themetamethod path needs.
Single-result
:returnfast path. Everyreturn nandreturn f(...) + g(...)hit afor i <- 0..0range comprehension; replacedwith a direct
elemread.Fast-path string concatenation. When both operands are binary
(or numeric), skip the
__concatmetamethod lookup — the stringmetatable holds
__indexforstring.*library access, not__concat.Single process-dict key for source-position bridge. Native
callback dispatch was doing six
Process.put/get/deleteops percall (two each for line + source, separately). One tuple-valued key
halves the cost.
Results
Microbenchmark medians (lower is better), measured against Luerl 1.5 on
the same M1 Pro hardware:
The
n=500table_build benchmark dropped from ~490μs median to~180μs (~63% faster), largely from the
order_tailchange.Profile snapshot, fib(22):
do_execute/8self-time went from 23% to43% (of a 50% smaller total),
lookup_metamethoddropped from 6.4%to negligible, and the
Enum.map/Range.new/with_indexclusterthat dominated the call path is gone.
Tests
mix test --only lua53count unchanged.path needed to handle the
results = []case (function fell off endwith no explicit return — Lua spec says missing returns yield nil),
caught by the
vararg.luasuite test.Follow-ups (separate PR)
A separate
chore(plans):PR will land five B-series plans scoping theremaining structural work:
Map.fetch!)Numeric.to_signed_int64fast pathThose plans aim to close the remaining 1.13–1.35x gap on the harder
workloads.