Skip to content

perf(vm): split-storage tables (Erlang :array + map)#328

Merged
davydog187 merged 4 commits into
mainfrom
perf/split-storage-tables
Jun 3, 2026
Merged

perf(vm): split-storage tables (Erlang :array + map)#328
davydog187 merged 4 commits into
mainfrom
perf/split-storage-tables

Conversation

@davydog187
Copy link
Copy Markdown
Contributor

Closes #314.

Summary

Splits %Lua.VM.Table{} storage: dense positive-integer keys (1..n)
route to an Erlang :array (O(1) functional read/write, dense-integer
iteration ordering for free); all other keys (strings, non-positive /
sparse integers, float/bool/table keys) stay in the hash map with the
existing order/order_tail/dead iteration bookkeeping.

String-keyed reads (globals, fields, metatable lookups) are unchanged —
they still hit the hash map directly, so the executor's %{^name => v}
fast paths and __index/__newindex dispatch pay nothing.

Why now

The issue gated this on "the table-build / sort residual gap persists
after the per-element write-back fixes." It does, and split storage
closes it — moving us from behind luerl to ahead of it on three of four
table-heavy workloads.

Benchmarks

Apple M4, LUA_BENCH_MODE=full, lua (chunk) path (isolates the VM),
n=1000. luerl is the machine-drift control and stays flat (<3%) across
the before/after runs, so the deltas are real, not machine noise.

Workload Before After Delta luerl (before -> after)
Build 214.15 us 137.30 us -36% (1.56x) 155.16 -> 157.55 (flat)
Sort 350.66 us 251.55 us -28% (1.39x) 181.41 -> 179.77 (flat)
Iterate/Sum 304.06 us 194.09 us -36% (1.57x) 246.38 -> 254.00 (flat)
Map + Reduce 620.45 us 388.18 us -37% (1.59x) 447.55 -> 444.89 (flat)

Standing vs luerl after the change: Build, Iterate, and Map+Reduce now
beat luerl (e.g. Build 137 us vs luerl 158 us, previously 1.38x
slower). Sort improves a lot (350 -> 252 us) but is still 1.40x slower
than luerl — the remaining bottleneck is the O(n^2) state-threading
comparator path in table.sort, not storage. That is a separate lever.

Collateral spot-check (non-table workloads, must not regress):

  • Fibonacci: 801.40 -> 807.63 ms (+0.8%, noise)
  • Closures: 450.79 -> 447.16 us (-0.8%, noise)

Correctness

  • Full mix test: 2153 passed, 0 failures (includes the Lua 5.3 official
    suite — nextvar.lua, next/pairs/ipairs, sort, pack/unpack).
  • delete on an array key leaves a nil hole and keeps arr_n as the
    high-water bound rather than shrinking the border. This preserves the
    §6.1 dead-key contract: after a mid-iteration t[k] = nil,
    next_entry/2 can still tell a former array key (resume in-array) from
    a key never present (raise "invalid key to 'next'").
  • length/1 derives the sequence border by scanning the array to the
    first hole, then probing the hash for sparse integers beyond it.

Notes / follow-ups

  • One test (clearing a dense integer key to nil ... keeps its array slot) was updated: a cleared-then-revived integer key now keeps its
    array index (iteration 1,2,3,4,5) instead of moving to the end of
    insertion order (1,2,4,5,3). The new order matches reference Lua 5.3,
    where these keys all live in the array part. Two value_test assertions
    that reached into table.data[1] were repointed at Table.get/2.
  • length/1 is currently O(border) rather than O(1) because holes can
    appear after a delete; a cached border would restore O(1) for the
    hole-free common case. Left as a follow-up — it did not show up as a
    regression in any benchmark.
  • benchmarks/array_vs_map_probe.exs is kept as the data-structure
    probe that motivated the decision. luaport was dropped from the
    benchmark deps (it fails to compile without C Lua headers and the
    harness already skips it gracefully).

Opened as a draft for review.

Complete the array/hash split-storage rework started in the prior WIP
commit. Route dense positive-integer keys (1..n) to an Erlang :array for
O(1) functional read/write; keep all other keys in the hash map with the
existing order/dead iteration bookkeeping.

Salvaged from WIP and finished:
- table.sort plain path: read via Table.get/2 and write back via
  Table.put_many/2 so sorted values land in the array (was reading/writing
  the hash, making sort a no-op for array-resident tables).
- get/2 and has_key?/2: normalize integer-valued float keys (t[1.0]) so
  they resolve to the same array slot as t[1].
- delete on an array key now leaves a nil hole and keeps arr_n as the
  high-water bound instead of shrinking the border, so next_entry/2 can
  still distinguish a former array key (resume in-array) from a key that
  was never present (raise 'invalid key to next') after a mid-iteration
  clear, per Lua 5.3 §6.1.
- length/1 derives the sequence border by scanning the array to the first
  hole, then probing the hash for sparse integer keys beyond it.

Full mix test (incl. Lua 5.3 official suite) passes.
@davydog187
Copy link
Copy Markdown
Contributor Author

Automated review (workflow respike-314) — stable in round 1

This PR was produced by a benchmarked go/no-go spike and reviewed by an AI workflow. Posting the reviewer's own findings for the human record.

Verdict: approve (zero blocker/high/medium). The reviewer checked out the branch, compiled it, ran the full suite (2153 passed, incl. the Lua 5.3 official suite), and wrote ~35 targeted probe tests against the edge cases of a storage split. All held:

  • float/int key aliasing (t[1.0] == t[1]); key 0 and negatives stay in hash and don't affect #
  • huge sparse indices (t[1]=x; t[1000000]=y) do not allocate a 1M :array (beyond-border writes fall through to hash)
  • nil-deletion works in both stores; # border with holes is a valid Lua 5.3 border; ipairs stops at the first hole
  • pairs iterates array-then-hash, visits every key exactly once (incl. after absorb-promotion of out-of-order fills); §6.1 dead-key semantics hold (mid-iteration clears, invalid key to next, deleted-then-revived keys keep slots)
  • __index/__newindex fire on misses in either store; insert/remove/sort write-back preserves non-sequence keys
  • absorb promotion is O(n) on descending construction and doesn't resurrect deleted sparse keys
  • benchmark plausibility confirmed; fibonacci/closures flat as expected (no table code in their paths)

Findings (all low/nit — non-blocking cleanups)

  • low — mix.exs:85: the PR drops the {:luaport, …, only: :benchmark} dep, but benchmarks/oop.exs still references :luaport (degrades gracefully to "skipping C Lua benchmarks", but the oop bench silently loses its C-Lua column). Either restore the dep or remove the dead block + setup-script references.
  • low — lib/lua/vm/value.ex:166: Value.sequence_length/1 is now dead code — every runtime caller migrated to Table.length/1; only a stale comment at executor.ex:2391 still mentions it.
  • low — lib/lua/vm/state.ex:111: State.globals/1 returns only the hash portion post-split — a program doing _G[1] = x would have that entry invisible to globals/1. Either return Table.to_map(table) or tighten the @doc to string-named globals.
  • nit — lib/lua/vm/table.ex:74: replace_data/2 could be split_from_map(%__MODULE__{metatable: table.metatable}, data) instead of the long explicit-reset one-liner.

Benchmark summary (Apple M4, n=1000, luerl as drift control — stayed <3% flat)

Workload Before After Delta vs luerl
Table Build 214.15 µs 137.30 µs −36% was 1.38× slower → now faster
Table Sort 350.66 µs 251.55 µs −28% still 1.40× slower (comparator-bound, not storage)
Table Iterate/Sum 304.06 µs 194.09 µs −36% now faster
Table Map+Reduce 620.45 µs 388.18 µs −37% now faster

Collateral: fibonacci +0.8%, closures −0.8% (noise).

@davydog187 davydog187 marked this pull request as ready for review June 3, 2026 11:08
…ytes

The 256 MiB string ceiling guarded string.rep, .., and load readers at a
fixed compile-time bound. Embedders running the VM inside a process
capped with :max_heap_size could not align the two limits: the VM would
legally materialize strings far past a smaller heap cap, leaving the
kill to GC-observation timing instead of a deterministic refusal.

Thread the ceiling through State (default unchanged) and expose it as
Lua.new(max_string_bytes: n), mirroring :max_call_depth. The hot-path
concat guard reads a struct field instead of a module attribute;
byte_size/1 remains O(1), so the check stays cheap.
…orker

The string-bomb playground example only passed by GC-timing luck: the
VM's 256 MiB string ceiling sits far above the sandbox's 64 MB heap cap,
so the doubling demo legally materialized ~384 MB of live binaries and
survived only when no GC observed the peak. Any allocation-pattern shift
elsewhere in the VM re-rolled that dice — surfacing as a flaky website
test. Set the VM's :max_string_bytes to 16 MB, well under the heap cap,
so the refusal is deterministic and the example's before-a-byte-is-
allocated claim is actually true.

Also unlink the eval worker after it delivers its result: its exit-time
GC can trip the heap kill late, and the :killed signal propagated to the
caller after trap_exit was restored, crashing it (reproducible by
calling run/1 twice in a row with the string-bomb source).
@davydog187
Copy link
Copy Markdown
Contributor Author

Website CI failure: diagnosis and fix

The Website tests job failed on playground/string-bomb ("expected to run cleanly, got: Execution exceeded the memory limit"). This PR did not break the example — it exposed a pre-existing fragility:

  • The VM's string ceiling (Limits.max_string_bytes, 256 MiB) sits far above the website sandbox's worker heap cap (64 MB). The example's doubling loop is therefore legally allowed to materialize 128 MB + 256 MB binaries before the .. guard refuses — ~384 MB of live binaries inside a 64 MB-capped process.
  • max_heap_size (with include_shared_binaries) only kills when a GC happens to observe the oversized live set. Whether one lands during the two fatal iterations depends on the exact allocation trajectory. A byte-faithful replica of the sandbox's eval gets killed on main too; running the real LuaSandbox.run/1 twice in a row on main crashes the caller on the second run. The example has been passing by GC-timing luck, and this PR's different Lua.new setup allocations merely re-rolled the dice.

Fix (two commits)

  1. feat(vm): Lua.new(max_string_bytes: n) — the string ceiling (.., string.rep, load readers) becomes per-state, mirroring :max_call_depth from feat(vm): configurable max call depth #283. Default unchanged (256 MiB). The hot-path concat guard reads a struct field instead of a module attribute. New coverage in limits_test.exs (both interpreter and dispatcher concat paths, load readers, validation); sandboxing guide updated with a "pair the ceiling with your heap cap" warning.
  2. fix(website): the sandbox sets max_string_bytes: 16 MB — well under its 64 MB cap — so the string-bomb refusal is deterministic and the example's "refused before a byte is allocated" claim is actually true. Also fixes a latent race found while diagnosing: the eval worker stays linked after delivering its result, and its exit-time GC can trip the heap kill after the caller stops trapping exits, crashing the caller. The worker is now unlinked (and any queued exit flushed) on result delivery.

Verification

  • Full suite: 2159 passed / 19 skipped (incl. new :max_string_bytes tests)
  • Lua 5.3 official suite: passed
  • Website tests: 52/52 (previously 51/52)
  • The two-consecutive-runs caller-crash repro: 5/5 clean

Scope note for the reviewer: commit 1 adds a public API option; it's the root-cause fix rather than a test-tolerance workaround, but it does extend this PR beyond table storage.

@davydog187 davydog187 merged commit 9755265 into main Jun 3, 2026
5 checks passed
@davydog187 davydog187 deleted the perf/split-storage-tables branch June 3, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(vm): split-storage tables (Erlang :array + map)

1 participant