Skip to content

chore(B4): defer plan; flat-tuple+PC dispatch did not beat list-cons#233

Closed
davydog187 wants to merge 1 commit into
mainfrom
chore/defer-b4
Closed

chore(B4): defer plan; flat-tuple+PC dispatch did not beat list-cons#233
davydog187 wants to merge 1 commit into
mainfrom
chore/defer-b4

Conversation

@davydog187
Copy link
Copy Markdown
Contributor

Defer B4: flat-tuple + PC dispatch did not beat list-cons head-match

Updates .agents/plans/B4-flat-instruction-stream.md from readydeferred with the measurement data and design notes from a complete-but-not-shipped implementation.

Why we tried

The plan's hypothesis: replace the current [head | rest] list-cons dispatch in Lua.VM.Executor.do_execute/8 with a case on :erlang.element(pc + 1, instrs), with all nested bodies (:test then/else, while/repeat/for loop bodies) lifted into a single flat instruction tuple by a compile-time pass. Stretch target: 20% reduction in fib(25) median. Floor: no workload regresses by more than 2%.

What I built

Full end-to-end on a throwaway branch:

  • Lua.Compiler.Linearize — lifts every nested-body opcode into the flat top-level stream, with :test_pc, :goto_pc, :while_test_pc, :numeric_for_step_pc, etc. as explicit jump opcodes. Labels and :break resolve to PCs at compile time, so find_label/2 and find_loop_exit/1 become dead code.
  • Lua.Compiler.Prototype — gained a labels field and changed instructions from list to tuple().
  • Lua.VM.Executor.do_execute/8 — rewritten from 64 list-cons defp clauses into a single function whose body is one case with one arm per opcode. The continuation stack (cont) was removed entirely; CPS frames for loops became :numeric_for_step_pc / :generic_for_step_pc opcodes emitted by the linearizer.
  • A new synth opcode :end_of_function was added so the linearizer could safely append a fall-off-end terminator (which yields zero results, distinct from explicit return yielding a single nil).

All 1705 tests + 29 lua53 suite tests passed under the rewrite. The correctness work is sound.

Why we closed it

workload main B4 delta
fib(30) chunk ~850 ms ~875 ms +3% ⚠️
OOP n=50 137 µs 137 µs flat
Table Build n=100 17.33 µs 16.44 µs -5%
Table Sort n=100 34.83 µs 36.24 µs +4%
Table Iterate 24.17 µs 23.01 µs -5%
Table Map+Reduce ~50 µs 49.06 µs -2%

fib(30) regressed 3% — past the plan's 2% floor.

Profile confirms the structural change didn't matter: do_execute/8 self-time was 50.64% under B4 vs 50.83% on main. Essentially unchanged.

The plan's risks section anticipated this:

If the post-merge profile shows no improvement (or worse, a regression), the structural change isn't paying for itself and B5 (Erlang functions) is the better lever.

On the BEAM concretely: [head | rest] pattern-match destructures the list head + tail in a single op. case :erlang.element(pc + 1, instrs) do is two ops (element fetch + case discrimination). Plus threading instrs through every recursive call adds register pressure. The hoped-for jump-table optimization on the case did not produce a net win against the optimized list-cons path.

What's preserved

The implementation isn't entirely throwaway. If/when B5 (compile prototypes to Erlang functions) starts, the linearizer design can be reintroduced only at compile time — feeding the B5 codegen with flat bytecode, leaving the runtime executor on its proven list-cons path.

Changes

 .agents/plans/B4-flat-instruction-stream.md | 94 +++++++++++++++++++++++++++--
 1 file changed, 92 insertions(+), 2 deletions(-)

Only the plan file changes; no library code touched.

Verification

mix format
mix compile --warnings-as-errors
mix test    # 1705 tests pass (unchanged)

Implemented the full B4 spec on a throwaway branch:
- Lua.Compiler.Linearize lifts nested bodies into flat bytecode
- Prototype.instructions becomes a tuple with a labels map
- do_execute/8 rewritten to PC-dispatch via case-on-elem
- CPS continuation stack and find_label/find_loop_exit removed

All 1705 tests + 29 lua53 suite tests passed. Closed unmerged because:

- fib(30) chunk: ~850ms (main) -> ~875ms (B4), +3% regression
- do_execute self-time: 50.83% (main) -> 50.64% (B4), unchanged
- Table workloads: mixed +-5%, within deviation bands

The plan's risks section anticipated this outcome explicitly. On the
BEAM, list-cons head-match destructures head+tail in one op, while
case-on-elem is two ops (fetch + discriminate). The hoped-for jump-
table optimization did not produce a net win, and threading instrs
through every tail call added register pressure.

Plan file documents the full implementation findings and the
conditions under which the work could be reopened. The linearizer
design will be reused by B5 (compile to Erlang) as a compile-time
preparation step, without touching the runtime executor.
@davydog187
Copy link
Copy Markdown
Contributor Author

Closing as superseded by #232.

#232 merged first with a cleaner approach: an isolated dispatch microbenchmark (10k-instruction synthetic stream, three shape variants) that falsified the hypothesis pre-implementation. The plan file on main already reflects the deferral with that evidence.

My branch implemented the full executor rewrite end-to-end (all 1705 tests + 29 lua53 suite tests passed) and measured the regression on real workloads (+3% fib(30)). That's additional confirming evidence — same conclusion, different angle — but it doesn't change the decision, and the plan is already in the right state on main.

The implementation findings live in this conversation; the Lua.Compiler.Linearize design and the :end_of_function sentinel insight will be relevant if B5 (compile prototypes to Erlang) starts and needs a flat-bytecode compile-time preparation step.

@davydog187 davydog187 closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant