chore(B4): defer plan; PC+elem dispatch is 9-15% slower than list-cons by davydog187 · Pull Request #232 · tv-labs/lua

davydog187 · 2026-05-22T01:43:34Z

B4: Flat instruction stream + PC dispatch — deferred

Plan: .agents/plans/B4-flat-instruction-stream.md

Mirrors the B7 deferral pattern (#231): the plan's core hypothesis was
falsified by a pre-flight spike, so we ship the writeup as a deferral
rather than a ~2,400-line rewrite that won't pay off.

The spike

Before committing to the full executor + codegen + loop-opcode rewrite,
a synthetic microbench compared the two dispatch shapes on identical
work (10,000 mixed opcodes, same tagged-tuple shape, same register
state — only the dispatch read changes):

Dispatch	IPS	Mean	vs current
list-cons (current)	13.86 K	72.13 µs	baseline
pc+elem `case` (proposed)	12.69 K	78.83 µs	1.09x slower
pc+elem multi-head	12.10 K	82.65 µs	1.15x slower

Memory: identical to three decimal places. Stable over multiple runs.

Why

The tagged-tuple jump table is the same in both shapes — BEAM compiles
both into a jump on the tag of the matched tuple. The only difference
is the dispatch read itself:

[h | t] is a single indirect load. The BEAM is heavily tuned for
cons-list iteration; it is the native iteration idiom on the platform.
elem(instrs, pc) is a bounds-checked indirect load plus integer
arithmetic.

Cons-list iteration wins by 9-15% on raw dispatch.

The structural target is still correct — the shape is wrong

fib(22) baseline profile (main @ bc69a2e):

```
Lua.VM.Executor.do_execute/8 802388 50.98% self
:erlang.setelement/3 601788 25.49%
Lua.VM.Executor.do_frame_return/6 57313 5.96%
Lua.VM.Executor.copy_args_to_regs/5 114626 4.94%
Lua.VM.Numeric.to_signed_int64/1 85968 3.35%
```

do_execute/8 is 51% of fib self-time (plan referenced 43.6% from
a pre-PR-223 baseline; the proportional cost has grown). Attacking
dispatch is the right target — the proposed shape just doesn't help.

The plan's Risk #1 anticipated this exact outcome:

If the post-merge profile shows no improvement (or worse, a
regression), the structural change isn't paying for itself and
B5 (Erlang functions) is the better lever.

That exit condition is met pre-merge.

Recommendation

Defer B4. The 51% do_execute/8 self-time should be attacked by B5
(compile instruction streams to native Erlang functions), which
collapses dispatch entirely into the BEAM's function-call mechanism —
the BEAM-tuned operation that beat every data-shape alternative we
tried.

A future plan could revisit B4 as a structural prerequisite for B5
if codegen wants integer entry points in the IR. Success criteria
would change: the bar would be "B5 compiles cleanly from the new
layout," not "dispatch gets faster" (which is now disproven).

Changes

```
.agents/plans/B4-flat-instruction-stream.md | 92 ++++++++++++++++++++-
1 file changed, 90 insertions(+), 2 deletions(-)
```

Verification

```
mix compile --warnings-as-errors # passes (no code changes)
```

Plan-file-only change; no tests affected.

Out of scope (intentional)

Implementing any of B4. The point of this PR is to stop the work
before sunk cost begins.
B5. That's its own plan, to be scoped separately once we decide to
pick it up.

…5% slower than list-cons A synthetic dispatch microbench compared the current list-cons shape against the proposed pc+elem case (and a multi-head variant) on identical work. The current shape wins by 9-15%, with identical memory. BEAM's [h | t] destructuring is one indirect load; elem(instrs, pc) is a bounds-checked load plus integer arithmetic, and the tagged-tuple jump table is identical in both, so the dispatch read is the only delta and cons wins. The structural target was correct: do_execute/8 is 51% of fib(22) self-time on main @ bc69a2e (the plan referenced 43.6% from a pre-PR-223 baseline; the proportional cost has grown). But the proposed shape makes it worse, not better. The plan's Risk #1 anticipated this: 'if the post-merge profile shows no improvement, B5 (Erlang functions) is the better lever.' That exit condition is met pre-merge. Plan: B4

davydog187 merged commit 9c873ed into main May 22, 2026
4 checks passed

davydog187 deleted the chore/defer-b4 branch May 22, 2026 09:54

davydog187 mentioned this pull request May 22, 2026

chore(B4): defer plan; flat-tuple+PC dispatch did not beat list-cons #233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(B4): defer plan; PC+elem dispatch is 9-15% slower than list-cons#232

chore(B4): defer plan; PC+elem dispatch is 9-15% slower than list-cons#232
davydog187 merged 1 commit into
mainfrom
chore/defer-b4

davydog187 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davydog187 commented May 22, 2026

B4: Flat instruction stream + PC dispatch — deferred

The spike

Why

The structural target is still correct — the shape is wrong

Recommendation

Changes

Verification

Out of scope (intentional)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant