perf(stdlib): iolist string.format and plain-table sort/concat fast paths by davydog187 · Pull Request #299 · tv-labs/lua

davydog187 · 2026-05-31T14:03:00Z

stdlib hot paths: string.format iolist + plain-table fast paths

Plan: .agents/plans/B9-stdlib-hot-paths.md
Closes #273

Goal

Close three stdlib-side performance gaps surfaced by the benchmark suite.
Observable behavior is byte-for-byte identical before and after.

string.format literal accumulation. format_string/3 built its
result with per-character binary concatenation, which is awkward on long
format strings. It now accumulates an iolist collapsed once with
IO.iodata_to_binary/1, and copies each run of literal (non-%) bytes
as a single chunk via :binary.split/2 rather than one iolist cell per
character. (% is ASCII 0x25 and never appears as a UTF-8 continuation
byte, so byte-splitting is multibyte-safe.) See the note below — the
initial per-character iolist regressed literal-heavy formats badly; the
literal-run coalescing reverses that and beats the original.
table.sort / table.concat plain-table fast path. When
metatable == nil, both functions now read slots directly via
Lua.VM.Table.get_data/2 and table.sort writes back via
Lua.VM.Table.put/3 + a single Map.put/3, skipping the
Executor.table_index/3 / table_newindex/3 dispatch. Tables with a
metatable keep the Executor path unchanged.
apply_width_flags byte-consistent width. Both the width threshold
and the padding fill are now measured in bytes, matching PUC-Lua
(which hands the width straight to C's printf). Previously the
threshold used byte_size/1 but the fill used String.pad_*/String.slice,
which count codepoints — so a multibyte %s under a width was padded by
codepoints against a byte threshold (e.g. format("%6s", "café") emitted
7 bytes). Now byte-for-byte.

⚠️ Regression found in review, and fixed

Benchmarking against the merge-base surfaced a regression introduced by the
as-is iolist change (1, above): appending one iolist cell per literal
character ([acc, <<char::utf8>>]) ballooned both the iolist and its final
IO.iodata_to_binary/1 flatten on literal-heavy format strings, while
the per-character binary append it replaced was being coalesced by the BEAM's
append optimization. A ~430-char template (3 specifiers, mostly literal text)
regressed ~2.6× in time and ~4.7× in memory.

d2f4e55 fixes this by grabbing each literal run in one chunk
(:binary.split/2) instead of per character. The literal-heavy case is now
~2.4× faster than the merge-base with slightly less memory, and
specifier-dense formats keep their win.

Benchmarks (`lua` chunk path, idle Apple M4, quick mode)

base = merge-base 4f93396. PR as-is = the per-character iolist
(bddb677). PR fixed = with literal-run coalescing (d2f4e55).

workload	base	PR as-is	PR fixed	fixed vs base
`string.format` long literal-heavy (n=1000)	269 ips · 3.72 ms · 6.21 MB	103 ips · 9.70 ms · 29.1 MB	658 ips · 1.52 ms · 5.72 MB	~2.4× faster, less mem
`string.format` width-flagged specifiers (n=1000)	250 ips · 12.4 MB	287 ips · 11.4 MB	260 ips · 11.4 MB	+4%
`string.format` many specifiers (n=1000)	182 ips · 26.3 MB	247 ips · 24.5 MB	205 ips · 24.3 MB	+13%
`table.sort` (n=100)	19.0K ips · 52.6 µs	33.0K ips · 30.3 µs	34.3K ips · 29.1 µs	+80%

Notes:

The fixed branch beats the merge-base on every measured workload, with
no regression anywhere.
On formats with many short literal runs (width-flagged, many-specifiers)
the fixed version lands slightly below the as-is peak — :binary.split/2
carries a small per-run cost the per-character version avoided — but stays
ahead of base. The trade buys back the catastrophic literal-heavy loss.
Control workloads on unchanged paths (table build / iterate / map+reduce)
stayed flat within noise, confirming the machine was stable.

57e9ff2 adds benchmarks/string_format.exs
so the format-string-shape axis stays covered: the existing string_ops
benchmark only formats a short specifier-dense string ("item_%d=%f"), which
exercises argument conversion but not literal accumulation. The new workload
varies the format string itself (long literal-heavy, width-flagged,
many-specifier) and follows suite convention (helpers.exs run modes;
eval/chunk vs Luerl vs optional C Lua). Run with
mix lua.bench --workload string_format.

Success criteria

Changes

 .agents/plans/B9-stdlib-hot-paths.md | 215 +++++++++++++++++++++++++++++++++++
 benchmarks/string_format.exs         | 129 +++++++++++++++++++++
 lib/lua/vm/stdlib/string.ex          |  60 ++++++----
 lib/lua/vm/stdlib/table.ex           | 152 ++++++++++++++++---------
 test/lua/vm/string_test.exs          |  30 +++++
 5 files changed, 514 insertions(+), 72 deletions(-)

Verification

mix format                         # no diff
mix compile --warnings-as-errors   # clean
mix test test/lua/vm/string_test.exs   # 111 tests, 41 properties, 0 failures
mix test                           # green
mix test --only lua53              # no regression
mix lua.bench --workload string_format  # before/after numbers above

Out of scope (intentional)

The O(n^2) insertion-sort path used when a user-supplied Lua comparator
is passed (sort_values/2 / insert_sorted/4). The comparator branch
keeps going through Executor.call_function/3.
Lua.VM.Table.put/3's order_tail allocation on every write.
Any change to the comparator-driven write-back, error messages, or
argument validation.

🤖 Generated with Claude Code

…aths string.format accumulated its result with per-character binary concatenation, making it O(n^2) in the format string length; it now builds an iolist and collapses it once with IO.iodata_to_binary/1. table.sort and table.concat now read metatable-less tables directly from the data map (and table.sort writes back with a single Map.put) instead of routing every slot through Executor.table_index/table_newindex. apply_width_flags measures padding width with byte_size/1 instead of the grapheme-walking String.length/1. All three are behavior-preserving constant-factor wins; tables with a metatable keep the Executor path. Plan: B9 Closes #273

apply_width_flags now measures and fills width in bytes consistently. The threshold already used byte_size/1, but the padding branches used String.pad_*/String.slice, which count codepoints, so a multibyte %s under a width was padded by codepoints against a byte threshold. That mismatch produced output longer than the requested width (e.g. format("%6s", "café") emitted 7 bytes). Padding is now byte-for-byte, matching PUC-Lua, which passes the width straight to C's printf. Adds regression tests pinning multibyte %s width semantics: right- and left-justified fills measured in bytes, and the no-pad boundary when the byte length already meets the width.

The iolist rewrite of `format_string/3` appended one cell per literal character (`[acc, <<char::utf8>>]`), which ballooned both the iolist and its final `IO.iodata_to_binary/1` flatten on literal-heavy format strings — a ~430-char template regressed 2.75x in time and 4.7x in memory versus the per-character binary append it replaced, because the BEAM's append optimization had been coalescing those writes. Copy each maximal run of non-`%` bytes as a single chunk via `:binary.split/2` instead. `%` is ASCII 0x25 and never appears as a UTF-8 continuation byte, so byte-splitting is safe for multibyte literals. Specifier-dense formats keep the iolist win; literal-heavy formats no longer regress. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The existing string_ops benchmark formats a short specifier-dense string ("item_%d=%f"), which only exercises argument conversion. This adds a dedicated string.format workload that varies the format string itself: a long literal-heavy template, width-flagged specifiers, and a conversion-heavy many-specifier string. The literal-heavy case guards the literal-run accumulation path, whose cost is dominated by copying literal bytes rather than converting arguments. Follows the suite convention: helpers.exs run modes, and eval/chunk vs Luerl vs optional C Lua comparisons. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

davydog187 added 4 commits May 31, 2026 07:00

chore(B9): start plan

1416ad3

chore(B9): mark plan as review

68e9018

davydog187 force-pushed the perf/stdlib-hot-paths branch from b18e50d to a654cf3 Compare May 31, 2026 14:09

davydog187 and others added 2 commits May 31, 2026 10:55

davydog187 merged commit 5e868e6 into main May 31, 2026
5 checks passed

davydog187 deleted the perf/stdlib-hot-paths branch May 31, 2026 19:38

This was referenced Jun 1, 2026

perf(stdlib): batch write-back in table.sort plain fast path #307

Closed

perf(stdlib): defer string.format width padding to iolist #310

Open

perf(stdlib): defer string.format width padding to iolist #316

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(stdlib): iolist string.format and plain-table sort/concat fast paths#299

perf(stdlib): iolist string.format and plain-table sort/concat fast paths#299
davydog187 merged 6 commits into
mainfrom
perf/stdlib-hot-paths

davydog187 commented May 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davydog187 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

stdlib hot paths: string.format iolist + plain-table fast paths

Goal

⚠️ Regression found in review, and fixed

Benchmarks (lua chunk path, idle Apple M4, quick mode)

Success criteria

Changes

Verification

Out of scope (intentional)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davydog187 commented May 31, 2026 •

edited

Loading

Benchmarks (`lua` chunk path, idle Apple M4, quick mode)