Skip to content

fix(vm): keep heap effects across protected-call error unwinding#333

Merged
davydog187 merged 12 commits into
mainfrom
fix/pcall-state-preservation
Jun 4, 2026
Merged

fix(vm): keep heap effects across protected-call error unwinding#333
davydog187 merged 12 commits into
mainfrom
fix/pcall-state-preservation

Conversation

@davydog187
Copy link
Copy Markdown
Contributor

@davydog187 davydog187 commented Jun 4, 2026

Closes #331.

Problem

When a function called via pcall/xpcall (or Lua.call_function/3 from Elixir) raised an error, every mutation made before the error — global writes, table field updates, upvalue assignments, metatable changes — was rolled back. Reference Lua keeps those heap effects and only unwinds control state (Lua 5.3 §2.3):

$ lua -e 'x = 1; pcall(function() x = 2; error("boom") end); print(x)'
2

Root cause: all VM state threads through the immutable %Lua.VM.State{} struct, and errors are Elixir exceptions that carried no state — so a protected call's rescue only had its stale entry snapshot.

Fix

VM exceptions (RuntimeError, TypeError, AssertionError, ArgumentError) gain a :state field carrying the raise-time state. Protected boundaries recover it via the new State.unwind_to/2: heap fields (tables, userdata, metatables, upvalue cells, private) come from the raise-time snapshot; control fields (call stack, call depth, open upvalues) restore from the entry state.

State is attached in layers so no raise path is missed, with zero happy-path overhead:

  1. Raise sites with state in scopeerror(), assert(), stack overflow, call-type errors (both engines) annotate directly.
  2. Native-call choke point — both native invocation sites reraise escaping VM exceptions annotated with the call's entry state, covering every stdlib bad-argument check without touching ~200 raise sites. Innermost annotation wins.
  3. In-frame pure helpers — arithmetic/comparison/concat/index/bitwise/length/numeric-for helpers now take the caller's threaded state (one extra already-bound argument; raise opts only built on the error path), so x = 2; return nil + 1 under pcall keeps x = 2.
  4. Frame-boundary backstop — frame entries in both engines annotate-if-nil, bounding any future missed raise site to losing only the innermost frame's in-frame mutations. One try per frame entry; the executor loop's tail recursion is untouched.

The xpcall message handler now runs against the recovered state, so it observes the mutations (matching PUC-Lua). Lua.call_function/3's {:error, reason, lua} likewise returns the recovered state. VM exceptions derive Inspect excluding :state so logging a trapped error never dumps the VM heap.

TDD

The first commit is a red 37-case matrix (test/lua/vm/pcall_state_preservation_test.exs) pinning reference semantics — global/table/upvalue/metatable mutations crossed with error(), assert, runtime type errors, stdlib bad-argument errors, stack overflow, nested pcall, xpcall (including an erroring handler), and the Elixir API — each run under both engines (dispatcher and bytecode-stripped interpreter). Expected values were cross-checked against PUC Lua. Each subsequent commit turns a slice green; control tests pin that no-mutation and success paths are unchanged.

Verification

  • mix test: 2190 passed, 0 failures (includes the official Lua 5.3 suite tests)
  • Issue repro now returns [2, false, "boom"] (was [1, false, "boom"])
  • Focused benchmark vs main (benchmarks/scripts/pcall_state_bench.exs): fib(26) −0.2%, table-write 200k −6.6%, concat 5k +1.2%, pcall loop 50k +0.4% — all within noise

Notes

@davydog187
Copy link
Copy Markdown
Contributor Author

Automated multi-agent review

This is an automated multi-agent review of PR #333 (fix(vm): keep heap effects across protected-call error unwinding, closes #331). A review-fix loop ran 4 rounds; each finding below was independently confirmed by multiple voting agents and then fixed on this branch. The loop hit its max round count without reaching full stability, so a residual class of edge cases may remain (see status note at the end).

Scope note: error-value passthrough / stringification (issue #334) was explicitly out of scope and not reviewed.

Confirmed findings (all fixed on this branch)

[HIGH] Native generic-for iterator raise drops current-frame heap mutations
lib/lua/vm/executor.ex:2140
The :native_func clause of call_value/5 (the choke point for generic-for iterators) wrapped the native call in try/after for position restore but had no state-annotating rescue. A VM exception raised by a native iterator (e.g. utf8.codes on invalid UTF-8, or next/pairs) escaped with :state == nil and was only caught by an outer frame backstop that annotated it with the frame's entry state, rolling back all heap mutations made before/inside the loop. Asymmetric with the :call opcode native path which annotated correctly.
Fixed in 0056ef1fix(vm): annotate native iterator raises with call-time state (added reraise annotate_frame_state(e, state) mirroring the :call path).

[HIGH] deflua/set! user-function error tuple drops heap mutations under pcall
lib/lua.ex:1086 (and lib/lua.ex:277)
When a deflua-style module function (or set!/3 user function) returned {:error, reason, %Lua{state: mutated}}, the wrapper raised RuntimeError, value: reason without attaching returned_lua.state. The only annotation that fired was the native-call choke point using the pre-call entry state, so under pcall/xpcall the user function's heap mutations were rolled back. Same class as the gsub finding (callback mutates state, then we raise without ferrying post-callback state).
Fixed in 48c98f7fix(lua): ferry state out of user-function error tuples under pcall (raises now carry state: returned_lua.state).

[MEDIUM] gsub function-replacement raise drops callback heap mutations
lib/lua/vm/stdlib/pattern.ex:237
string.gsub threads state through a function replacement, but when the callback returned an invalid value (e.g. a table), apply_replacement/4 raised RuntimeError with no :state, discarding the post-callback state in scope. The native-call boundary then refilled nil with the pre-callback entry snapshot, rolling back the callback's heap effects. Repro: x=1; pcall(function() return string.gsub('ab','.',function(c) x=2; return {} end) end); return x yielded x==1 (reference Lua keeps x==2).
Fixed in ead6148fix(vm): thread state into gsub invalid-replacement raise.

[MEDIUM] No coverage for error in a native-stdlib callback under pcall
test/lua/vm/pcall_state_preservation_test.exs:173
No test covered a native stdlib function that calls back into Lua where the Lua callback raises (e.g. table.sort with an erroring comparator, or an erroring pairs/ipairs loop body). This crosses the native-call choke point twice and exercises the "innermost annotation wins" reraise logic that no other case did. A coverage gap, not a live bug — but it left the native reraise annotation unguarded against regression.
Fixed in debb1actest(vm): cover native-callback errors under pcall.

[LOW] lua_next discards order_tail flush mutation on invalid-key raise
lib/lua/vm/stdlib.ex:335
lua_next/2 flushes a table's deferred-append order_tail into a fresh state before checking the key, but the :invalid_key branch raised ArgumentError without that flushed state or any :state annotation. The flush itself is idempotent cache normalization (harmless), but combined with the missing call_value rescue it was one more native iterator raise escaping with nil state.
Addressed in 0056ef1 (the native-iterator rescue fix covers the propagation path).

[LOW] Doc comments reference private/nonexistent annotate_state/2
lib/lua/vm/dispatcher.ex:153 (and lib/lua/vm/executor.ex:184)
Comments pointed at Lua.VM.Executor.annotate_state/2, which was a private alias; the public function actually called was annotate_frame_state/2.
Fixed in ead6148 — comments now reference annotate_frame_state/2 and the redundant private annotate_state/2 alias was collapsed across both engines.

[LOW] Benchmark script fails repo Styler format check
benchmarks/scripts/pcall_state_bench.exs:11
defmodule PcallStateBench lacked @moduledoc false, which Styler flags when the file is formatted explicitly (benchmarks/ is outside the default formatter scope, so CI was unaffected).
Fixed in 98fa087chore(benchmarks): add @moduledoc false to pcall_state_bench.

[LOW] pcall_state_bench omits memory measurement and the generic-for path
benchmarks/scripts/pcall_state_bench.exs:63
memory_time: 0 measured only wall-clock, missing per-call allocation/GC pressure (the primary observable cost of added try frames), and no case routed through call_value/5's lua-closure clause — the one path that gained a genuinely new per-iteration try.
Fixed in 42cab9dbench(vm): measure memory and exercise lua-closure generic_for in pcall bench.

Refuted / not actioned findings

Notable findings raised during the loop but refuted (for transparency)
  • [HIGH] No coverage for error raised inside a metamethod under pcall (pcall_state_preservation_test.exs) — refuted; existing in-frame helper threading already covers metamethod raise paths.
  • [LOW] Duplicated raise-time state extraction across lua.ex and stdlib.ex (lib/lua.ex) — refuted as not a defect; intentional per-boundary extraction.
  • [LOW] No coverage for a successful pcall following a failed one (pcall_state_preservation_test.exs) — refuted; recovered-state continuation is exercised by existing cases.
  • [LOW] call_value/5 Lua-iterator path gains a per-iteration try block (lib/lua/vm/executor.ex) — refuted as a concern; the try frame is required for correctness and its cost is addressed by the benchmark update above.

Some lower-confidence findings drew a split vote (e.g. the lua_next and benchmark items had one dissenting reviewer who considered them harmless/out-of-consequence); they were fixed anyway since the changes are cheap and strictly correct.

Status

  • mix format --check-formatted: clean (exit 0)
  • mix test: 2202 passed (61 doctests, 51 properties, 2090 tests), 19 skipped, 1 excluded — 0 failures
  • Branch is pushed and in sync with origin/fix/pcall-state-preservation.

Note: the review-fix loop reached its maximum round count without converging to full stability, so a residual edge case may remain unsurfaced. All findings confirmed through round 4 have been fixed and the full suite is green.

davydog187 added 12 commits June 4, 2026 10:40
Red tests for protected-call error unwinding: heap effects (globals,
table fields, upvalue cells, metatables) made before an error must
survive pcall/xpcall trapping the error, matching reference Lua.
Covers both the dispatcher and interpreter engines plus the
Lua.call_function Elixir API error path.
VM exceptions gain a :state field carrying the raise-time State.
pcall/xpcall/Lua.call_function recover it via State.unwind_to/2 —
heap fields (tables, userdata, metatables, upvalue cells, private)
come from the raise-time snapshot, control fields (call stack,
open upvalues) restore from the protected-call entry, matching
Lua 5.3 §2.3 error semantics.

Annotates the raise sites that already have state in scope:
error(), assert(), stack-overflow, and call-type errors in both
engines.
Both native invocation points (Executor.call_function and the
interpreter's inline native path) reraise VM exceptions with the
call's entry state when none was attached deeper, covering every
stdlib bad-argument raise without threading state through each
check. Innermost annotation wins.
Arithmetic, comparison, concat, index, bitwise, length, and
numeric-for coercion helpers raise from pure code where the freshest
state lived only in the executor loop binding. They now take the
caller's threaded state and attach it to the exception, so protected
calls keep heap effects made earlier in the same frame (e.g.
x = 2; return nil + 1 under pcall). One extra already-bound argument
per helper — raise opts are only built on the error path.
Frame entries in both engines (Executor.execute, the lua_closure
paths of call_function/call_value, and the dispatcher's
do_execute_top) annotate escaping VM exceptions with the frame's
entry state when nothing deeper attached one. Converts any missed
raise site from 'loses all mutations under pcall' into 'loses only
in-frame mutations of the innermost unannotated frame'. One try per
frame entry — the executor loop's tail recursion is untouched.
Also derives Inspect excluding :state on VM exceptions so inspecting
a trapped error never dumps the whole VM heap into logs.
string.gsub with a function replacement threads state through the
callback, but the invalid-return RuntimeError dropped the freshest
post-callback state, so a protected unwind rolled back the callback's
heap effects. Attach the live %State{} to the raise (guarded so the
non-stateful gsub/4 entry, which passes nil, is unaffected).

Also point the frame-boundary rescue comments at the real public
Executor.annotate_frame_state/2 and collapse the redundant private
annotate_state/2 alias so the name is uniform across both engines.
Add matrix cases (both engines) for heap mutations made inside a native
stdlib callback that errors before returning: a gsub function
replacement that returns an invalid value, a table.sort comparator that
errors, and a pairs loop body that errors. These cross the native-call
choke point and exercise the innermost-annotation reraise path.
The native-function clause of call_value/5 — the choke point for
generic-for iterators reached from both engines — wrapped the native
call in try/after for position restore but had no state-annotating
rescue. A VM exception raised by a native iterator (utf8.codes, next
via pairs) escaped with :state == nil and was only caught by the
enclosing frame backstop, which annotated it with the frame's ENTRY
state, rolling back heap mutations the protected function made before
the loop (violating Lua 5.3 §2.3).

Mirror the :call opcode native dispatch: add a rescue that reraises
annotate_frame_state(e, state) so the iterator-call's entry state
ferries out on the exception. Both engines route through this one
clause via Executor.dispatcher_call_value/4, so no parity gap.

Also thread the flushed state into lua_next's invalid-key ArgumentError
so a next() invalid-key raise keeps the order_tail normalization and
any prior heap mutations across the protected boundary.
Consistent with the repo's Styler plugin output; the file failed
mix format --check-formatted otherwise.
The set!/3 arity-2 and deflua execute_function wrappers raised
RuntimeError without attaching the returned struct's state when a user
function signaled {:error, reason, %Lua{}}. The only annotation that
fired was the native-call choke point using the pre-call entry state,
which predates the mutations the user function made before erroring, so
pcall/xpcall rolled them back.

Bind the returned struct and pass state: returned_lua.state on the
raise, mirroring the gsub callback fix, so heap effects survive
protected-call unwinding.

Add pcall_state_preservation tests covering both wrapper sites.

Closes #331
…ll bench

Enable memory_time so the per-call allocation/GC delta of the added
try frames is observable, and add a generic_for case driven by a Lua
closure iterator so executor.ex call_value/5's lua_closure clause (the
path that gained a new per-iteration try block) is actually exercised.
@davydog187 davydog187 force-pushed the fix/pcall-state-preservation branch from 42cab9d to d785722 Compare June 4, 2026 17:43
@davydog187 davydog187 merged commit a558050 into main Jun 4, 2026
5 checks passed
@davydog187 davydog187 deleted the fix/pcall-state-preservation branch June 4, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pcall discards state mutations made before error() (diverges from reference Lua)

1 participant