Skip to content

perf(vm): fast-path the executor dispatch loop#223

Merged
davydog187 merged 1 commit into
mainfrom
perf/skip-metamethod-overhead
May 21, 2026
Merged

perf(vm): fast-path the executor dispatch loop#223
davydog187 merged 1 commit into
mainfrom
perf/skip-metamethod-overhead

Conversation

@davydog187
Copy link
Copy Markdown
Contributor

Eight focused optimizations to the executor's dispatch loop, each
measured with mix profile.tprof before and after. Cuts interpreter
overhead in half on representative microbenchmarks and brings the gap
to Luerl from 1.3–2x slower down to 0.95–1.35x (and faster than Luerl
on string concatenation).

Why

Recent baseline measurements (the kind A33 will formalize) showed the
VM trailing Luerl by 1.5–2x on most workloads. Profiling fib, OOP,
table_build, and string_concat revealed a small handful of structural
overheads dominating the dispatch loop — not any single hot opcode, but
the same dispatch pattern paying for metamethod lookup, list-cons
allocation, and Map.fetch! indirection on every iteration.

This PR is the "cheap wins" pass: fast paths that the interpreter
already knew how to skip but wasn't. The bigger structural changes
(flat instruction stream, Erlang-function compilation, array+hash
table split) are scoped as follow-up plans B4–B8 in a separate PR.

What changed

  1. Numeric fast path for arithmetic and comparison opcodes. Numbers
    can never carry metatables in Lua, so the try_binary_metamethod
    wrapper on every +/-/*/==/</<=/etc was pure overhead.
    Adds an int-int guard clause (with Lua 5.3 §3.4.1 wrap-around via
    Numeric.to_signed_int64) and a number-number cond branch above
    the existing dispatch. Strings get the same treatment for == /
    < / <= — they can't carry __eq either.

  2. Call-arg staging without an intermediate list. Every Lua-to-Lua
    call was building an args list via Enum.with_index |> Enum.reduce.
    New copy_args_to_regs/5 moves caller regs into callee regs
    directly. collect_args/3 still materializes a list for native
    callbacks and __call, where the contract demands one.

  3. Lua.VM.Table.order_tail field for O(1) insert ordering. Every
    brand-new key was doing order ++ [key] — O(n) append. Writes now
    prepend onto a separate order_tail (O(1)); next_entry merges
    lazily; lua_next flushes once per iteration so subsequent steps
    see a clean list. This is the single biggest win for table-heavy
    workloads.

  4. Empty-map short-circuit for open_upvalues cleanup. Every
    for-loop iteration was paying Map.reject on an empty map (the
    overwhelming common case: no nested closures captured a loop-local).
    New close_open_upvalues_at_or_above/2 guards on map_size == 0.

  5. Fast-path :get_field and :get_table for trefs. When the
    table has the key directly and either no metatable or the key is
    present, skip the whole index_value → table_index → normalize_key → Map.fetch! → Map.get pipeline.

  6. Fast-path table_newindex/5 for tables with no metatable. Skips
    the has_data? lookup and the nested-case scrutiny that only the
    metamethod path needs.

  7. Single-result :return fast path. Every return n and return f(...) + g(...) hit a for i <- 0..0 range comprehension; replaced
    with a direct elem read.

  8. Fast-path string concatenation. When both operands are binary
    (or numeric), skip the __concat metamethod lookup — the string
    metatable holds __index for string.* library access, not
    __concat.

  9. Single process-dict key for source-position bridge. Native
    callback dispatch was doing six Process.put/get/delete ops per
    call (two each for line + source, separately). One tuple-valued key
    halves the cost.

Results

Microbenchmark medians (lower is better), measured against Luerl 1.5 on
the same M1 Pro hardware:

Benchmark Before vs Luerl After vs Luerl
fib (recursive) 1.72x slower 1.13x slower
table_build+sum 1.51x slower 1.27x slower
string_concat 1.31x slower 0.95x — faster
oop 2.05x slower 1.35x slower

The n=500 table_build benchmark dropped from ~490μs median to
~180μs (~63% faster), largely from the order_tail change.

Profile snapshot, fib(22): do_execute/8 self-time went from 23% to
43% (of a 50% smaller total), lookup_metamethod dropped from 6.4%
to negligible, and the Enum.map/Range.new/with_index cluster
that dominated the call path is gone.

Tests

  • All 1,692 tests pass (plus 55 doctests and 51 properties).
  • mix test --only lua53 count unchanged.
  • One bug found and fixed mid-iteration: the single-result return fast
    path needed to handle the results = [] case (function fell off end
    with no explicit return — Lua spec says missing returns yield nil),
    caught by the vararg.lua suite test.

Follow-ups (separate PR)

A separate chore(plans): PR will land five B-series plans scoping the
remaining structural work:

  • B4: Flat instruction stream + PC dispatch
  • B5: Compile prototypes to Erlang functions
  • B6: Direct table refs (kill per-access Map.fetch!)
  • B7: Array+hash split for table storage
  • B8: Inline Numeric.to_signed_int64 fast path

Those plans aim to close the remaining 1.13–1.35x gap on the harder
workloads.

Eight focused optimizations that cut interpreter overhead in half on
representative microbenchmarks and bring the gap to Luerl from
1.3-2x slower down to 0.95-1.35x (and faster on string concatenation).

Profile-driven changes, each measured before/after with tprof:

1. Numeric fast path for arithmetic and comparison opcodes. Numbers can
   never carry metatables in Lua, so the `try_binary_metamethod` wrapper
   that fired on every `+`/`-`/`*`/`==`/`<`/`<=`/etc was pure overhead.
   Two extra clauses per opcode head: int-int and number-number, falling
   through to the existing dispatch for non-numbers. Cuts fib(25) by
   ~12%. (Strings get the same treatment for `==`/`<`/`<=`/etc — they
   can't carry __eq either.)

2. Call-args staging without an intermediate list. Every Lua-to-Lua call
   was building an args list via `Enum.with_index |> Enum.reduce`. New
   `copy_args_to_regs/5` recursively moves caller regs into callee regs
   in place. `collect_args/3` still materializes a list for native
   callbacks and `__call`.

3. `Lua.VM.Table` `order_tail` field. The `order: order ++ [key]` on
   every brand-new insert was O(n), making `t = {} for i = 1,N do
   t[i]=v end` O(n²). Writes now prepend onto a separate `order_tail`
   (O(1)); `next_entry` merges lazily; `lua_next` flushes once per
   iteration to amortize. table_build+sum(n=500) drops from ~490μs to
   ~180μs (~63% faster).

4. Short-circuit `Map.reject(open_upvalues, ...)` when the map is
   empty. Every numeric/generic for iteration was paying for an empty-
   map walk. New `close_open_upvalues_at_or_above/2` guards on
   `map_size(ou) == 0`.

5. Fast-path `:get_field` and `:get_table` for trefs. When the table
   has the key directly (and either no metatable, or the key is
   present), skip the whole `index_value → table_index → normalize_key
   → Map.fetch! → Map.get` pipeline.

6. Fast-path `table_newindex/5` for tables with no metatable. Skips a
   `has_data?` lookup and the nested-case scrutiny that only the
   metamethod path needs.

7. Single-result `:return` fast path. Every `return f(...)` and `return
   n` hit a `for i <- 0..0` range comprehension; replaced with a direct
   `elem` read.

8. Fast-path string concatenation when both operands are binary or
   numeric. The string metatable is for `string.*` library access, not
   `__concat` dispatch — that lookup was wasted.

9. Single process-dict key for the source-position bridge. Native
   callback dispatch was doing six `Process.put`/`get`/`delete` ops per
   call (two each for line + source, separately). One tuple-valued key
   halves it.

Results, microbenchmarks (median μs, perf/skip-metamethod-overhead vs
Luerl 1.5 on the same hardware):

  fib(25)            71μs vs 63μs  ratio 1.13x  (was ~1.57x)
  table_build+sum   179μs vs 141μs ratio 1.27x  (was ~1.50x)
  string_concat      40μs vs 43μs  ratio 0.95x  (was ~1.31x)
  oop               109μs vs  81μs ratio 1.35x  (was ~2.05x)

All 1692 tests pass (plus 55 doctests and 51 properties); the lua53
suite count is unchanged.
@davydog187 davydog187 merged commit 4f2894b into main May 21, 2026
4 checks passed
@davydog187 davydog187 deleted the perf/skip-metamethod-overhead branch May 21, 2026 20:31
davydog187 added a commit that referenced this pull request May 21, 2026
Co-authored-by: dave <dave@tvlabs.com>
davydog187 added a commit that referenced this pull request May 22, 2026
Post-PR #223 / #227 profile shows Map.get/2 + Map.get/3 combined at
3.28% on fib(22) (plan claimed 6.4%), 2.81% on OOP, and 0.04% on
table_build. The real table-workload bottlenecks live inside
Lua.VM.Table (insert/put 18%, normalize_key 3.3%, sequence_length 4%)
and in :erlang.setelement (17.5% on table writes, 20.9% on OOP).
Those are B7's targets, not B6's.

B6's projected wall-clock win is now below 1%, inside benchee's
deviation band on every measured workload. Audit cleanup may still be
worth doing later as a refactor, but not as a perf plan and not
before B7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant