perf(vm): fast-path the executor dispatch loop by davydog187 · Pull Request #223 · tv-labs/lua

davydog187 · 2026-05-21T20:27:29Z

Eight focused optimizations to the executor's dispatch loop, each
measured with mix profile.tprof before and after. Cuts interpreter
overhead in half on representative microbenchmarks and brings the gap
to Luerl from 1.3–2x slower down to 0.95–1.35x (and faster than Luerl
on string concatenation).

Why

Recent baseline measurements (the kind A33 will formalize) showed the
VM trailing Luerl by 1.5–2x on most workloads. Profiling fib, OOP,
table_build, and string_concat revealed a small handful of structural
overheads dominating the dispatch loop — not any single hot opcode, but
the same dispatch pattern paying for metamethod lookup, list-cons
allocation, and Map.fetch! indirection on every iteration.

This PR is the "cheap wins" pass: fast paths that the interpreter
already knew how to skip but wasn't. The bigger structural changes
(flat instruction stream, Erlang-function compilation, array+hash
table split) are scoped as follow-up plans B4–B8 in a separate PR.

What changed

Numeric fast path for arithmetic and comparison opcodes. Numbers
can never carry metatables in Lua, so the try_binary_metamethod
wrapper on every +/-/*/==/</<=/etc was pure overhead.
Adds an int-int guard clause (with Lua 5.3 §3.4.1 wrap-around via
Numeric.to_signed_int64) and a number-number cond branch above
the existing dispatch. Strings get the same treatment for == /
< / <= — they can't carry __eq either.
Call-arg staging without an intermediate list. Every Lua-to-Lua
call was building an args list via Enum.with_index |> Enum.reduce.
New copy_args_to_regs/5 moves caller regs into callee regs
directly. collect_args/3 still materializes a list for native
callbacks and __call, where the contract demands one.
Lua.VM.Table.order_tail field for O(1) insert ordering. Every
brand-new key was doing order ++ [key] — O(n) append. Writes now
prepend onto a separate order_tail (O(1)); next_entry merges
lazily; lua_next flushes once per iteration so subsequent steps
see a clean list. This is the single biggest win for table-heavy
workloads.
Empty-map short-circuit for open_upvalues cleanup. Every
for-loop iteration was paying Map.reject on an empty map (the
overwhelming common case: no nested closures captured a loop-local).
New close_open_upvalues_at_or_above/2 guards on map_size == 0.
Fast-path :get_field and :get_table for trefs. When the
table has the key directly and either no metatable or the key is
present, skip the whole index_value → table_index → normalize_key → Map.fetch! → Map.get pipeline.
Fast-path table_newindex/5 for tables with no metatable. Skips
the has_data? lookup and the nested-case scrutiny that only the
metamethod path needs.
Single-result :return fast path. Every return n and return f(...) + g(...) hit a for i <- 0..0 range comprehension; replaced
with a direct elem read.
Fast-path string concatenation. When both operands are binary
(or numeric), skip the __concat metamethod lookup — the string
metatable holds __index for string.* library access, not
__concat.
Single process-dict key for source-position bridge. Native
callback dispatch was doing six Process.put/get/delete ops per
call (two each for line + source, separately). One tuple-valued key
halves the cost.

Results

Microbenchmark medians (lower is better), measured against Luerl 1.5 on
the same M1 Pro hardware:

Benchmark	Before vs Luerl	After vs Luerl
fib (recursive)	1.72x slower	1.13x slower
table_build+sum	1.51x slower	1.27x slower
string_concat	1.31x slower	0.95x — faster
oop	2.05x slower	1.35x slower

The n=500 table_build benchmark dropped from ~490μs median to
~180μs (~63% faster), largely from the order_tail change.

Profile snapshot, fib(22): do_execute/8 self-time went from 23% to
43% (of a 50% smaller total), lookup_metamethod dropped from 6.4%
to negligible, and the Enum.map/Range.new/with_index cluster
that dominated the call path is gone.

Tests

All 1,692 tests pass (plus 55 doctests and 51 properties).
mix test --only lua53 count unchanged.
One bug found and fixed mid-iteration: the single-result return fast
path needed to handle the results = [] case (function fell off end
with no explicit return — Lua spec says missing returns yield nil),
caught by the vararg.lua suite test.

Follow-ups (separate PR)

A separate chore(plans): PR will land five B-series plans scoping the
remaining structural work:

B4: Flat instruction stream + PC dispatch
B5: Compile prototypes to Erlang functions
B6: Direct table refs (kill per-access Map.fetch!)
B7: Array+hash split for table storage
B8: Inline Numeric.to_signed_int64 fast path

Those plans aim to close the remaining 1.13–1.35x gap on the harder
workloads.

Eight focused optimizations that cut interpreter overhead in half on representative microbenchmarks and bring the gap to Luerl from 1.3-2x slower down to 0.95-1.35x (and faster on string concatenation). Profile-driven changes, each measured before/after with tprof: 1. Numeric fast path for arithmetic and comparison opcodes. Numbers can never carry metatables in Lua, so the `try_binary_metamethod` wrapper that fired on every `+`/`-`/`*`/`==`/`<`/`<=`/etc was pure overhead. Two extra clauses per opcode head: int-int and number-number, falling through to the existing dispatch for non-numbers. Cuts fib(25) by ~12%. (Strings get the same treatment for `==`/`<`/`<=`/etc — they can't carry __eq either.) 2. Call-args staging without an intermediate list. Every Lua-to-Lua call was building an args list via `Enum.with_index |> Enum.reduce`. New `copy_args_to_regs/5` recursively moves caller regs into callee regs in place. `collect_args/3` still materializes a list for native callbacks and `__call`. 3. `Lua.VM.Table` `order_tail` field. The `order: order ++ [key]` on every brand-new insert was O(n), making `t = {} for i = 1,N do t[i]=v end` O(n²). Writes now prepend onto a separate `order_tail` (O(1)); `next_entry` merges lazily; `lua_next` flushes once per iteration to amortize. table_build+sum(n=500) drops from ~490μs to ~180μs (~63% faster). 4. Short-circuit `Map.reject(open_upvalues, ...)` when the map is empty. Every numeric/generic for iteration was paying for an empty- map walk. New `close_open_upvalues_at_or_above/2` guards on `map_size(ou) == 0`. 5. Fast-path `:get_field` and `:get_table` for trefs. When the table has the key directly (and either no metatable, or the key is present), skip the whole `index_value → table_index → normalize_key → Map.fetch! → Map.get` pipeline. 6. Fast-path `table_newindex/5` for tables with no metatable. Skips a `has_data?` lookup and the nested-case scrutiny that only the metamethod path needs. 7. Single-result `:return` fast path. Every `return f(...)` and `return n` hit a `for i <- 0..0` range comprehension; replaced with a direct `elem` read. 8. Fast-path string concatenation when both operands are binary or numeric. The string metatable is for `string.*` library access, not `__concat` dispatch — that lookup was wasted. 9. Single process-dict key for the source-position bridge. Native callback dispatch was doing six `Process.put`/`get`/`delete` ops per call (two each for line + source, separately). One tuple-valued key halves it. Results, microbenchmarks (median μs, perf/skip-metamethod-overhead vs Luerl 1.5 on the same hardware): fib(25) 71μs vs 63μs ratio 1.13x (was ~1.57x) table_build+sum 179μs vs 141μs ratio 1.27x (was ~1.50x) string_concat 40μs vs 43μs ratio 0.95x (was ~1.31x) oop 109μs vs 81μs ratio 1.35x (was ~2.05x) All 1692 tests pass (plus 55 doctests and 51 properties); the lua53 suite count is unchanged.

Co-authored-by: dave <dave@tvlabs.com>

Post-PR #223 / #227 profile shows Map.get/2 + Map.get/3 combined at 3.28% on fib(22) (plan claimed 6.4%), 2.81% on OOP, and 0.04% on table_build. The real table-workload bottlenecks live inside Lua.VM.Table (insert/put 18%, normalize_key 3.3%, sequence_length 4%) and in :erlang.setelement (17.5% on table writes, 20.9% on OOP). Those are B7's targets, not B6's. B6's projected wall-clock win is now below 1%, inside benchee's deviation band on every measured workload. Audit cleanup may still be worth doing later as a refactor, but not as a perf plan and not before B7.

davydog187 merged commit 4f2894b into main May 21, 2026
4 checks passed

davydog187 deleted the perf/skip-metamethod-overhead branch May 21, 2026 20:31

davydog187 mentioned this pull request May 21, 2026

chore(plans): scope B4-B8 perf follow-ups after PR #223 #226

Merged

davydog187 added a commit that referenced this pull request May 21, 2026

chore(plans): scope B4-B8 perf follow-ups after PR #223 (#226)

eeca921

Co-authored-by: dave <dave@tvlabs.com>

davydog187 mentioned this pull request May 22, 2026

docs(roadmap): consolidate B-series findings (B4, B6, B7, B8 + harness) #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vm): fast-path the executor dispatch loop#223

perf(vm): fast-path the executor dispatch loop#223
davydog187 merged 1 commit into
mainfrom
perf/skip-metamethod-overhead

davydog187 commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davydog187 commented May 21, 2026

Why

What changed

Results

Tests

Follow-ups (separate PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant