perf(vm): dispatcher coverage for table opcodes and :numeric_for by davydog187 · Pull Request #275 · tv-labs/lua

davydog187 · 2026-05-28T12:13:01Z

Dispatcher table opcodes — make table-heavy workloads bypass the interpreter

Plan: .agents/plans/B5b-v2-dispatcher-tables.md
Closes #271

Goal

Extend Lua.VM.Dispatcher and Lua.Compiler.Bytecode to lower the
table opcode family plus :numeric_for. After this PR all four
table_ops benchmarks compile end-to-end and stay out of the
interpreter fallback path, plus the closures orchestrator
(run_closures).

Success criteria

Performance

Mini-bench (mix run -e, 100–200 iter median, warmed):

Workload	Dispatcher	Interpreter	Speedup
`run_table_build(500)`	90 µs	106 µs	1.18x
`run_table_sum(500)`	121 µs	129 µs	1.06x
`run_table_sum(1000)`	254 µs	289 µs	1.13x
`run_table_map_reduce(500)`	241 µs	245 µs	1.02x
`fib(22)`	12.2 ms	18.7 ms	1.54x

fib lands at 1.54x — well above B5a-v2's 1.17x median for fib(25),
a sign the table-opcode additions did not push the arithmetic
path off any inlining cliff.

Discoveries

:call with result_count == 0 was needed. run_table_sort
calls table.sort(t) at statement position, which lowers to
{:call, _, _, 0, _}. Added @op_call_zero (reusing slot 25,
freed by the removed @op_test_true). The dispatcher branch
shares the same shape as :call_one, with a :discard sentinel
in the frame's base slot signalling "throw the return value
away."
string_ops orchestrators do not fully compile. The
issue described "orchestrators in closures and string_ops"
as compiling. run_closures does. But both string_ops orchestrators
end with return table.concat(...) / return string.format(...)
— multi-return shapes (:call with result_count = -1 plus
:return_vararg) which are explicitly out of scope per the
parent plan. Documented as a B5c-v2 follow-up.
:break inside :numeric_for forces fallback. The
interpreter's :break walks find_loop_exit/1 against the
continuation stack. Reproducing that in the dispatcher requires
mixing post-test markers with {:loop_exit, _} markers — the
same machinery generic-for / while-loop / break all want, which
is the domain of B5c-v2. The encoder rejects :numeric_for
bodies containing :break upfront (walking recursively through
nested :test branches) and the whole enclosing prototype falls
back.
The :cps_for continuation marker integrates cleanly. B5a's
cont stack carried only {code, pc} resume points. Adding
{:cps_for, base, loop_var, body_bc, code, pc + 1} markers
expands finish_body/6 from two clauses to three. The marker
stays on the stack across iterations (re-pushed when the body
restarts), so nested numeric-fors compose naturally on the same
stack.
Soft perf target missed; hard floor met. setelement/3 +
Table.put/3 allocation dominate the table workloads. The
parent plan explicitly flagged this and ruled mutable storage
out of scope; the next plan for closing the table-workload gap
is mutable register/table storage, not more dispatch work.

Changes

 .agents/plans/B5b-v2-dispatcher-tables.md   |   ~233  (status: review, discoveries appended)
 lib/lua/compiler/bytecode.ex                |   +95  (7 new opcodes + :call_zero + accessors)
 lib/lua/vm/dispatcher.ex                    |  +231  (7 new branches + :cps_for + helpers)
 lib/lua/vm/executor.ex                      |   +73  (6 dispatcher_* bridges)
 test/lua/compiler/bytecode_test.exs         |   ~54  (flip 2 fallback tests, add 3)
 test/lua/vm/dispatcher_test.exs             |  +306  (21 new goldens: 7 table, 6 numeric_for,
                                              |        4 table_ops shapes, 1 :call_zero, plus
                                              |        the test_ops setup block)

Verification

mix format                                            ✓
mix compile --warnings-as-errors                      ✓
mix test                                              ✓ 1902 tests, 0 failures
mix test --only lua53                                 ✓ 29 tests, 0 failures
mix test test/lua/vm/dispatcher_test.exs              ✓ 48 tests (27 → 48)
mix test test/lua/compiler/bytecode_test.exs          ✓ 15 tests (14 → 15)
mix test test/lua/vm/leak_regression_test.exs         ✓ 3 tests

Out of scope (intentional)

:closure opcode and varargs → B5c-v2.
Multi-return :call (result_count = -1) and :return_vararg
→ still B5c-v2. This is why string_ops orchestrators
don't fully compile.
:while_loop, :repeat_loop, :generic_for → defer to a future
plan (not blockers for table_ops).
:break inside :numeric_for → fallback for now. The
loop-exit continuation machinery lands with the rest of B5c-v2.
:set_list with {:multi, _} → fallback.
Mutable table storage → its own follow-up plan if the table-workload
gap matters.
Line attribution for table errors → B5d-v2. All new bridges
pass line: 0.

Extend the v2 dispatcher and bytecode encoder to cover the table opcode family (:new_table, :get_table, :set_table, :set_field, :set_list non-multi-return form, :length), :numeric_for with a new :cps_for continuation marker, and :call with result_count == 0 (the statement-call form like table.sort(t)). After this PR all four table_ops benchmarks compile end-to-end and the run_closures orchestrator joins them; only multi-return shapes (return f(...), :return_vararg) still keep workloads on the interpreter. Mini-bench: fib(22) 1.54x interpreter (up from 1.17x in B5a-v2), run_table_build(500) 1.18x, run_table_sum(1000) 1.13x, run_table_map_reduce(500) 1.02x. The soft target (>=1.5x on table_sum) is not met because Table.put/3 allocation churn and setelement/3 register writes dominate the table workloads — both flagged as out of scope by the parent plan. The hard floor (no regression) is met across the board. The dispatcher delegates the slow paths to new Executor.dispatcher_* bridges (dispatcher_get_table, dispatcher_set_table, dispatcher_set_field, dispatcher_length, dispatcher_coerce_numeric_for_controls, dispatcher_close_open_upvalues_at_or_above), each wrapping the existing defp helpers so metamethod fidelity matches the interpreter for free. Plan: B5b-v2 Closes #271

- Tighten `:set_list` encoder guard to `count > 0`. The literal-count zero shape is the interpreter's multi-return sentinel (consumes `state.multi_return_count` trailing values); codegen never emits it from a literal constructor today, but encoding it as a no-op would silently diverge if that ever changed. Two new tests pin the fallback contract for both `count == 0` and `{:multi, _}`. - Fix misleading comment about step == 0 in `:numeric_for` body completion. Neither the dispatcher nor the interpreter implements PUC-Lua's runtime check; both infinite-loop on step == 0. Parity is preserved; the comment now describes it accurately. - Alias `Lua.VM.Table` at the top of `Lua.VM.Dispatcher` and use the short `Table.put/3` form in `set_list_into_table/6`, matching the other sibling-module aliases. - Tighten `@spec` on `dispatcher_set_table` / `dispatcher_set_field` to `State.t() | no_return()` (the non-tref clauses always raise). - Document the unused `_proto` parameter on `dispatcher_length` as forward-compat for B5d-v2 error attribution. The other table bridges already thread `proto.source` through their slow paths; `__len` does not yet, but will when error positions land for compiled prototypes.

davydog187 · 2026-05-28T12:47:24Z

Addressed the review feedback in 75e616d. Summary:

#	Issue	Fix
1	Misleading `step == 0` comment in `:numeric_for`	Rewrote to accurately note that both executors infinite-loop on `step == 0` — parity preserved, no PUC-style runtime check on either path.
2	Latent encoder/dispatcher divergence on `:set_list count == 0`	Tightened the encoder guard to `count > 0`. Added two regression tests (`count == 0` and `{:multi, _}`) that build synthetic prototypes and assert `Bytecode.compile/1` returns `bytecode == nil`. Pins the contract so a future codegen change can't silently diverge.
3	Inconsistent aliasing of `Lua.VM.Table`	Added `alias Lua.VM.Table` at the top; `set_list_into_table/6` now uses the short `Table.put/3` form.
4	`@spec` overstates return for `dispatcher_set_table`/`dispatcher_set_field`	Tightened to `State.t() \| no_return()`.
5	Unused `_proto` on `dispatcher_length`	Added a comment documenting forward-compat intent for B5d-v2 error attribution. The other table bridges already thread `proto.source`; `__len` will join them when error positions land.

Verification:

mix format --check-formatted                         ✓
mix compile --warnings-as-errors                     ✓
mix test                                             ✓ 1902 → 1904 tests, 0 failures
mix test --only lua53                                ✓ 29 tests, 0 failures
mix test test/lua/compiler/bytecode_test.exs         ✓ 17 → 19 tests
mix test test/lua/vm/dispatcher_test.exs             ✓ 48 tests
mix test test/lua/vm/leak_regression_test.exs        ✓ 3 tests

davydog187 · 2026-05-28T12:50:51Z

Benchmarks

dave@dave-mac-mini ~/code/tvlabs/lua (perf/dispatcher-tables) $ mix lua.bench
==> closures
Compiling 16 files (.ex)
Generated lua app
Operating System: macOS
CPU Information: Apple M4
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.19.4
Erlang 27.3.4.7
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 3 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 16 s
Excluding outliers: false

Benchmarking C Lua (luaport) ...
Benchmarking lua (chunk) ...
Benchmarking lua (eval) ...
Benchmarking luerl ...
Calculating statistics...
Formatting results...

Name                      ips        average  deviation         median         99th %
C Lua (luaport)       32.86 K       30.44 μs    ±16.72%          29 μs       43.13 μs
luerl                  2.62 K      381.93 μs     ±4.79%      379.50 μs      442.03 μs
lua (chunk)            2.38 K      420.66 μs     ±5.42%      415.88 μs      493.72 μs
lua (eval)             2.35 K      426.12 μs     ±5.49%      421.29 μs      498.47 μs

Comparison:
C Lua (luaport)       32.86 K
luerl                  2.62 K - 12.55x slower +351.50 μs
lua (chunk)            2.38 K - 13.82x slower +390.23 μs
lua (eval)             2.35 K - 14.00x slower +395.69 μs
==> dispatcher_vs_interpreter

--- closure tags ---
dispatcher: :compiled_closure
interpreter: :lua_closure

fib(20) dispatcher=6765 interpreter=6765 match=true

Operating System: macOS
CPU Information: Apple M4
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.19.4
Erlang 27.3.4.7
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 3 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 8 s
Excluding outliers: false

Benchmarking dispatcher fib(25) ...
Benchmarking interpreter fib(25) ...
Calculating statistics...
Formatting results...

Name                          ips        average  deviation         median         99th %
dispatcher fib(25)          18.46       54.17 ms     ±1.31%       53.88 ms       56.02 ms
interpreter fib(25)         13.35       74.89 ms     ±1.13%       74.74 ms       76.68 ms

Comparison:
dispatcher fib(25)          18.46
interpreter fib(25)         13.35 - 1.38x slower +20.72 ms
==> fibonacci
Operating System: macOS
CPU Information: Apple M4
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.19.4
Erlang 27.3.4.7
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 3 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 16 s
Excluding outliers: false

Benchmarking C Lua (luaport) ...
Benchmarking lua (chunk) ...
Benchmarking lua (eval) ...
Benchmarking luerl ...
Calculating statistics...
Formatting results...

Name                      ips        average  deviation         median         99th %
C Lua (luaport)         37.53       26.65 ms     ±1.32%       26.58 ms       28.03 ms
lua (eval)               1.63      614.51 ms     ±0.68%      614.71 ms      619.56 ms
lua (chunk)              1.62      617.95 ms     ±0.70%      617.21 ms      625.23 ms
luerl                    1.40      715.25 ms     ±1.30%      710.93 ms      725.36 ms

Comparison:
C Lua (luaport)         37.53
lua (eval)               1.63 - 23.06x slower +587.86 ms
lua (chunk)              1.62 - 23.19x slower +591.31 ms
luerl                    1.40 - 26.84x slower +688.60 ms
==> helpers
==> oop
Operating System: macOS
CPU Information: Apple M4
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.19.4
Erlang 27.3.4.7
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 3 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 16 s
Excluding outliers: false

Benchmarking C Lua (luaport) ...
Benchmarking lua (chunk) ...
Benchmarking lua (eval) ...
Benchmarking luerl ...
Calculating statistics...
Formatting results...

Name                      ips        average  deviation         median         99th %
C Lua (luaport)       33.61 K       29.76 μs    ±14.35%       28.92 μs       39.67 μs
luerl                  8.16 K      122.51 μs    ±17.80%      115.88 μs      199.08 μs
lua (chunk)            7.68 K      130.28 μs    ±20.63%      122.46 μs      243.83 μs
lua (eval)             7.32 K      136.61 μs    ±18.15%      129.96 μs      231.50 μs

Comparison:
C Lua (luaport)       33.61 K
luerl                  8.16 K - 4.12x slower +92.75 μs
lua (chunk)            7.68 K - 4.38x slower +100.53 μs
lua (eval)             7.32 K - 4.59x slower +106.85 μs
==> string_ops

=== String Concatenation via table.concat (n=100) (mode: quick) ===

Operating System: macOS
CPU Information: Apple M4
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.19.4
Erlang 27.3.4.7
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 3 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 16 s
Excluding outliers: false

Benchmarking C Lua (luaport) ...
Benchmarking lua (chunk) ...
Benchmarking lua (eval) ...
Benchmarking luerl ...
Calculating statistics...
Formatting results...

Name                      ips        average  deviation         median         99th %
C Lua (luaport)       55.68 K       17.96 μs    ±19.29%       17.38 μs       24.42 μs
luerl                 24.73 K       40.44 μs     ±8.25%       40.48 μs       53.21 μs
lua (chunk)           22.75 K       43.96 μs    ±16.99%       46.67 μs       56.51 μs
lua (eval)            20.78 K       48.13 μs    ±16.12%       49.63 μs       64.17 μs

Comparison:
C Lua (luaport)       55.68 K
luerl                 24.73 K - 2.25x slower +22.48 μs
lua (chunk)           22.75 K - 2.45x slower +26.00 μs
lua (eval)            20.78 K - 2.68x slower +30.17 μs

=== String Formatting via string.format (n=100) (mode: quick) ===

Operating System: macOS
CPU Information: Apple M4
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.19.4
Erlang 27.3.4.7
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 3 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 16 s
Excluding outliers: false

Benchmarking C Lua (luaport) ...
Benchmarking lua (chunk) ...
Benchmarking lua (eval) ...
Benchmarking luerl ...
Calculating statistics...
Formatting results...

Name                      ips        average  deviation         median         99th %
C Lua (luaport)       30.07 K       33.25 μs     ±6.85%       32.25 μs       39.84 μs
luerl                  9.58 K      104.40 μs    ±10.41%      103.13 μs      138.01 μs
lua (chunk)            5.43 K      184.13 μs    ±16.87%      175.21 μs      307.68 μs
lua (eval)             5.32 K      187.88 μs    ±16.14%      179.13 μs      304.97 μs

Comparison:
C Lua (luaport)       30.07 K
luerl                  9.58 K - 3.14x slower +71.14 μs
lua (chunk)            5.43 K - 5.54x slower +150.88 μs
lua (eval)             5.32 K - 5.65x slower +154.62 μs
==> table_ops

=== Table Build (mode: quick) ===

Operating System: macOS
CPU Information: Apple M4
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.19.4
Erlang 27.3.4.7
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 3 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: medium (n=100)
Estimated total run time: 16 s
Excluding outliers: false

Benchmarking C Lua (luaport) with input medium (n=100) ...
Benchmarking lua (chunk) with input medium (n=100) ...
Benchmarking lua (eval) with input medium (n=100) ...
Benchmarking luerl with input medium (n=100) ...
Calculating statistics...
Formatting results...

##### With input medium (n=100) #####
Name                      ips        average  deviation         median         99th %
C Lua (luaport)      105.28 K        9.50 μs    ±21.11%        9.21 μs       13.54 μs
lua (chunk)           63.07 K       15.85 μs     ±9.24%       15.92 μs       19.92 μs
luerl                 54.89 K       18.22 μs    ±14.61%          18 μs       24.79 μs
lua (eval)            54.61 K       18.31 μs    ±12.32%       18.08 μs       25.67 μs

Comparison:
C Lua (luaport)      105.28 K
lua (chunk)           63.07 K - 1.67x slower +6.36 μs
luerl                 54.89 K - 1.92x slower +8.72 μs
lua (eval)            54.61 K - 1.93x slower +8.81 μs

=== Table Sort (mode: quick) ===

Operating System: macOS
CPU Information: Apple M4
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.19.4
Erlang 27.3.4.7
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 3 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: medium (n=100)
Estimated total run time: 16 s
Excluding outliers: false

Benchmarking C Lua (luaport) with input medium (n=100) ...
Benchmarking lua (chunk) with input medium (n=100) ...
Benchmarking lua (eval) with input medium (n=100) ...
Benchmarking luerl with input medium (n=100) ...
Calculating statistics...
Formatting results...

##### With input medium (n=100) #####
Name                      ips        average  deviation         median         99th %
C Lua (luaport)       55.81 K       17.92 μs    ±14.53%       17.54 μs       22.63 μs
luerl                 38.13 K       26.23 μs    ±36.19%       21.21 μs       46.04 μs
lua (eval)            27.94 K       35.79 μs    ±11.16%       35.54 μs       52.04 μs
lua (chunk)           26.83 K       37.27 μs    ±19.89%       34.38 μs       56.38 μs

Comparison:
C Lua (luaport)       55.81 K
luerl                 38.13 K - 1.46x slower +8.31 μs
lua (eval)            27.94 K - 2.00x slower +17.87 μs
lua (chunk)           26.83 K - 2.08x slower +19.35 μs

=== Table Iterate/Sum (mode: quick) ===

Operating System: macOS
CPU Information: Apple M4
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.19.4
Erlang 27.3.4.7
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 3 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: medium (n=100)
Estimated total run time: 16 s
Excluding outliers: false

Benchmarking C Lua (luaport) with input medium (n=100) ...
Benchmarking lua (chunk) with input medium (n=100) ...
Benchmarking lua (eval) with input medium (n=100) ...
Benchmarking luerl with input medium (n=100) ...
Calculating statistics...
Formatting results...

##### With input medium (n=100) #####
Name                      ips        average  deviation         median         99th %
C Lua (luaport)      104.66 K        9.56 μs    ±20.41%        9.33 μs       13.29 μs
lua (chunk)           44.04 K       22.70 μs    ±17.59%       21.79 μs       40.25 μs
lua (eval)            39.91 K       25.06 μs    ±13.24%       24.75 μs       35.25 μs
luerl                 35.65 K       28.05 μs     ±7.87%          28 μs       34.13 μs

Comparison:
C Lua (luaport)      104.66 K
lua (chunk)           44.04 K - 2.38x slower +13.15 μs
lua (eval)            39.91 K - 2.62x slower +15.50 μs
luerl                 35.65 K - 2.94x slower +18.50 μs

=== Table Map + Reduce (mode: quick) ===

Operating System: macOS
CPU Information: Apple M4
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.19.4
Erlang 27.3.4.7
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 3 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: medium (n=100)
Estimated total run time: 16 s
Excluding outliers: false

Benchmarking C Lua (luaport) with input medium (n=100) ...
Benchmarking lua (chunk) with input medium (n=100) ...
Benchmarking lua (eval) with input medium (n=100) ...
Benchmarking luerl with input medium (n=100) ...
Calculating statistics...
Formatting results...

##### With input medium (n=100) #####
Name                      ips        average  deviation         median         99th %
C Lua (luaport)       88.87 K       11.25 μs    ±20.19%       11.29 μs       15.38 μs
lua (eval)            20.87 K       47.91 μs    ±13.73%       46.58 μs          69 μs
lua (chunk)           19.37 K       51.63 μs    ±18.05%       52.63 μs       73.50 μs
luerl                 16.70 K       59.88 μs    ±29.83%          52 μs      113.21 μs

Comparison:
C Lua (luaport)       88.87 K
lua (eval)            20.87 K - 4.26x slower +36.66 μs
lua (chunk)           19.37 K - 4.59x slower +40.38 μs
luerl                 16.70 K - 5.32x slower +48.63 μs

davydog187 added 4 commits May 28, 2026 04:58

chore(B5b-v2): start plan

25410bb

chore(B5b-v2): record PR #275, discoveries, and what-changed

43e83c0

davydog187 merged commit 262b9be into main May 28, 2026
5 checks passed

davydog187 deleted the perf/dispatcher-tables branch May 28, 2026 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vm): dispatcher coverage for table opcodes and :numeric_for#275

perf(vm): dispatcher coverage for table opcodes and :numeric_for#275
davydog187 merged 4 commits into
mainfrom
perf/dispatcher-tables

davydog187 commented May 28, 2026

Uh oh!

davydog187 commented May 28, 2026

Uh oh!

davydog187 commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davydog187 commented May 28, 2026

Dispatcher table opcodes — make table-heavy workloads bypass the interpreter

Goal

Success criteria

Performance

Discoveries

Changes

Verification

Out of scope (intentional)

Uh oh!

davydog187 commented May 28, 2026

Uh oh!

davydog187 commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant