perf(vm): split table storage into array + hash parts#229
Conversation
Records PR #229. Documents the discovery that the plan's projected 30% win was reachable in theory but bounded by BEAM tuple semantics; the realized wins concentrate on time (6-22% across table workloads, new wins over luerl on 3/4 of them) with a memory regression that follows from immutable-tuple growth. Also records B6's deferral and B8's merge in the plan changelogs.
Post-PR #223 / #227 profile shows Map.get/2 + Map.get/3 combined at 3.28% on fib(22) (plan claimed 6.4%), 2.81% on OOP, and 0.04% on table_build. The real table-workload bottlenecks live inside Lua.VM.Table (insert/put 18%, normalize_key 3.3%, sequence_length 4%) and in :erlang.setelement (17.5% on table writes, 20.9% on OOP). Those are B7's targets, not B6's. B6's projected wall-clock win is now below 1%, inside benchee's deviation band on every measured workload. Audit cleanup may still be worth doing later as a refactor, but not as a perf plan and not before B7.
Reshapes Lua.VM.Table so contiguous integer keys live in a tuple-backed array part with exponential capacity growth, while non-integer / sparse keys stay in the existing hash map. Mirrors PUC-Lua's internal layout. Adds Table.get/2, Table.has?/2, Table.length/1 helpers that consult both parts; migrates every site that read `table.data` for an integer key (rawget, rawlen, ipairs, get_table fast path, lua.ex traversal, decode, display) onto the new helpers. Sites that only touch known-string keys (metatable __index/__newindex/__call lookups, package.loaded module caching, _G global lookups) continue reading `data` directly. The array part uses exponential capacity growth (doubling with a floor of 4) so sequential `t[i] = ...` writes are amortized O(1) per append rather than O(n) for naive Tuple.append. An `array_has_holes` flag keeps `#t` at O(1) for the overwhelmingly common case where no slot has been explicitly cleared. Nil-valued slots are allowed in the array part as PUC-Lua-compatible hole markers; `t[k] = nil` mid-array sets the slot to nil and flags holes rather than demoting the tail to the hash side. Reads return nil naturally via element/2; iteration via next_entry skips nil slots. Benchmarks (n=500, 10s benchee, 2s warmup; lua chunk path): - Table Build: 89.45 \xc2\xb5s \xe2\x86\x92 83.80 \xc2\xb5s (-6.3%; beats luerl) - Table Sort: 245.45 \xc2\xb5s \xe2\x86\x92 191.93 \xc2\xb5s (-21.8%) - Iterate/Sum: 129.91 \xc2\xb5s \xe2\x86\x92 117.19 \xc2\xb5s (-9.8%; beats luerl) - Map+Reduce: 277.32 \xc2\xb5s \xe2\x86\x92 249.02 \xc2\xb5s (-10.2%; beats luerl) - OOP: 135.69 \xc2\xb5s \xe2\x86\x92 122.26 \xc2\xb5s (-10%) - table.concat: 44.21 \xc2\xb5s \xe2\x86\x92 32.22 \xc2\xb5s (-27%) - fib(30): within noise (\xc2\xb13%) The plan's 30% stretch on table_build was not hit \xe2\x80\x94 most of the win the plan projected came from eliminating per-key Map.put. The exponential-growth tuple is faster than the per-key Map.put, but setelement on a growing tuple still has real cost. Memory regresses on table-heavy workloads (e.g. table_build 0.65 MB \xe2\x86\x92 1.68 MB) because of intermediate tuple copies during the build. PUC-Lua mitigates this with mutable tuples; on the BEAM each setelement is a copy. Still well below C Lua \xc3\x97 1.0 of course, and time wins on these workloads were the priority. Plan: .agents/plans/B7-table-array-hash-split.md
Records PR #229. Documents the discovery that the plan's projected 30% win was reachable in theory but bounded by BEAM tuple semantics; the realized wins concentrate on time (6-22% across table workloads, new wins over luerl on 3/4 of them) with a memory regression that follows from immutable-tuple growth. Also records B6's deferral and B8's merge in the plan changelogs.
230344f to
2f020b9
Compare
|
Closing after multi-n measurement on the merged bench harness (#230) revealed a hard crossover that makes this PR unsafe to ship. What the data showedFive-run variance check + full-mode multi-n sweep (n ∈ {10, 100, 1000}):
Memory at n=1000 is also bad: ~3-5× main's allocations (e.g. Sort 2.08 MB → 12.40 MB). Why the regression at scaleB7 routes contiguous integer keys into a tuple-backed array part with exponential capacity growth. At n=100 the tuple is ~128 cells and The single n=500 number that motivated investigation was right at the crossover, which explains the run-to-run inconsistency we saw before #230 landed. What would unblock thisThreshold-based promotion: keep contiguous integer keys in the hash map until Plan status
|
Split table storage into array + hash parts
Plan:
.agents/plans/B7-table-array-hash-split.mdAlso defers B6 (
.agents/plans/B6-direct-table-refs.md) and marks B8 as merged via #227.Goal
Reshape
Lua.VM.Tableso contiguous integer keys live in a tuple-backedarray part with exponential capacity growth, while non-integer / sparse keys
stay in the existing hash map. Mirrors PUC-Lua's internal layout. The
plan's main lever was reducing the per-key cost of sequential
t[i] = ...writes (the dominant pattern on every table-heavy workload).
Design
array :: tuple()holds values for the contiguous prefix1..array_len.Tuple capacity (
tuple_size(array)) is >=array_len; the headroom isfilled with
nilso reads beyondarray_lenshort-circuit naturally.Within a 500-element build only ~9 grow events fire (capacities
4 → 8 → ... → 512 → 1024).
array_has_holes :: boolean()keeps#tat O(1) for theoverwhelmingly common case where no slot has been explicitly cleared
via
t[k] = nil. Falls back to a linear scan only when the userpunches a hole.
hole markers.
t[k] = nilmid-array sets the slot to nil and flipsthe holes flag rather than demoting the tail to the hash side.
Table.get/2,Table.has?/2,Table.length/1,Table.next_entry/2,Table.to_map/1, andTable.keys/1all consult both parts. Everycall site that read
table.datafor an integer key has been migratedto the new helpers.
Migrations (correctness)
Sites that previously inspected
table.datadirectly and could seeinteger keys have been migrated:
Lua.VM.Stdlib.lua_rawget,lua_rawlen,lua_ipairsLua.VM.Executor.table_index/4,table_newindex/5,table_length/2,the
:lengthopcode, theget_tablefast pathLua.eval!/Lua.get!/Lua.set!traversal inlib/lua.exLua.VM.Value.decode/2Lua.VM.Display.peek_table/3Lua.VM.State.globals/1Sites that only read known-string keys (
mt.datafor__call,__index,__newindex;package.datafor"loaded","path","preload";loaded_table.datafor module names;_G.datafor globalnames;
string.datafor"unpack") continue to readdatadirectly —those keys never live in the array part.
Success criteria
Lua.VM.Tablecarriesarrayandarray_lenfields plusarray_has_holesinvariant flag.Lua.VM.Table.length/1is O(1) for the no-holes case (thedominant workload). Falls back to O(n) scan only when holes are
known to exist.
t[i]for integeriin1..array_leniselement/2. NoMap.get, no key normalization for in-range integer keys.t[#t + 1] = vis amortized O(1) via exponential tuple growth.ipairs(t)iterates the array viaTable.get/2.mix testpasses — 1692 tests, 51 properties, 55 doctests, 0failures.
mix test --only lua53does not regress — 29 tests, 0 failures.hit; floor "no workload regresses by more than 2% on time" met.
Benchmarks (n=500, 10s benchee, 2s warmup; lua chunk path)
Memory regressed on table-heavy workloads (e.g. table_build 0.65 MB →
1.68 MB). The cause is intermediate tuple copies during the build —
PUC-Lua mitigates this with mutable tuples; on the BEAM each
setelement/3is conceptually a copy. The time wins on theseworkloads were the priority. Memory is still order-of-magnitude
comparable to luerl; only specific table-heavy bench shapes regress.
Stretch targets not hit
(~84 µs). Most of the plan's projected win came from eliminating the
per-key
Map.put+order_tailcons +deadcheck pipeline.Replacing that with an exponential-growth tuple plus
setelement/3ischeaper, but
setelement/3on a doubled-capacity tuple is not free.Bounded by BEAM tuple semantics; would require either NIF mutability
or a more invasive co-design with the codegen (e.g. sized-tuple
emission for table literals) to close further. Out of scope here.
big.lua< 30s: not assessed; it's still in the lua53 skippedlist pending the larger A10 work it gates.
Discoveries
:erlang.append_element/2which is O(n) per call. That regressed table_build by +12% time and
+160% memory. The plan's risks section warned about exactly this;
exponential growth was the cited mitigation and is what landed.
t[k] = nilpunched a hole, to keep the array contiguous. That brokefor k,v in pairs(t) do t[k] = nil endbecause the cleared key wasno longer findable for
next(t, k)to advance past it. Switching toPUC-Lua's nil-as-hole semantics (set the slot to nil in place, flip
array_has_holes) is both simpler and correct.Changes
Verification
Out of scope (intentional)
{1, 2, 3}). Wouldreduce memory churn on the static-table case but adds a new compiler
contract; follow-up.
Lua.VM.Value.sequence_length/1(the old map-basedhelper). Still works on a bare map; removing it is a public-API
question, separate concern.