Skip to content

perf(stdlib): iolist string.format and plain-table sort/concat fast paths#299

Merged
davydog187 merged 6 commits into
mainfrom
perf/stdlib-hot-paths
May 31, 2026
Merged

perf(stdlib): iolist string.format and plain-table sort/concat fast paths#299
davydog187 merged 6 commits into
mainfrom
perf/stdlib-hot-paths

Conversation

@davydog187
Copy link
Copy Markdown
Contributor

@davydog187 davydog187 commented May 31, 2026

stdlib hot paths: string.format iolist + plain-table fast paths

Plan: .agents/plans/B9-stdlib-hot-paths.md
Closes #273

Goal

Close three stdlib-side performance gaps surfaced by the benchmark suite.
Observable behavior is byte-for-byte identical before and after.

  1. string.format literal accumulation. format_string/3 built its
    result with per-character binary concatenation, which is awkward on long
    format strings. It now accumulates an iolist collapsed once with
    IO.iodata_to_binary/1, and copies each run of literal (non-%) bytes
    as a single chunk via :binary.split/2
    rather than one iolist cell per
    character. (% is ASCII 0x25 and never appears as a UTF-8 continuation
    byte, so byte-splitting is multibyte-safe.) See the note below — the
    initial per-character iolist regressed literal-heavy formats badly; the
    literal-run coalescing reverses that and beats the original.
  2. table.sort / table.concat plain-table fast path. When
    metatable == nil, both functions now read slots directly via
    Lua.VM.Table.get_data/2 and table.sort writes back via
    Lua.VM.Table.put/3 + a single Map.put/3, skipping the
    Executor.table_index/3 / table_newindex/3 dispatch. Tables with a
    metatable keep the Executor path unchanged.
  3. apply_width_flags byte-consistent width. Both the width threshold
    and the padding fill are now measured in bytes, matching PUC-Lua
    (which hands the width straight to C's printf). Previously the
    threshold used byte_size/1 but the fill used String.pad_*/String.slice,
    which count codepoints — so a multibyte %s under a width was padded by
    codepoints against a byte threshold (e.g. format("%6s", "café") emitted
    7 bytes). Now byte-for-byte.

⚠️ Regression found in review, and fixed

Benchmarking against the merge-base surfaced a regression introduced by the
as-is iolist change (1, above): appending one iolist cell per literal
character ([acc, <<char::utf8>>]) ballooned both the iolist and its final
IO.iodata_to_binary/1 flatten on literal-heavy format strings, while
the per-character binary append it replaced was being coalesced by the BEAM's
append optimization. A ~430-char template (3 specifiers, mostly literal text)
regressed ~2.6× in time and ~4.7× in memory.

d2f4e55 fixes this by grabbing each literal run in one chunk
(:binary.split/2) instead of per character. The literal-heavy case is now
~2.4× faster than the merge-base with slightly less memory, and
specifier-dense formats keep their win.

Benchmarks (lua chunk path, idle Apple M4, quick mode)

base = merge-base 4f93396. PR as-is = the per-character iolist
(bddb677). PR fixed = with literal-run coalescing (d2f4e55).

workload base PR as-is PR fixed fixed vs base
string.format long literal-heavy (n=1000) 269 ips · 3.72 ms · 6.21 MB 103 ips · 9.70 ms · 29.1 MB 658 ips · 1.52 ms · 5.72 MB ~2.4× faster, less mem
string.format width-flagged specifiers (n=1000) 250 ips · 12.4 MB 287 ips · 11.4 MB 260 ips · 11.4 MB +4%
string.format many specifiers (n=1000) 182 ips · 26.3 MB 247 ips · 24.5 MB 205 ips · 24.3 MB +13%
table.sort (n=100) 19.0K ips · 52.6 µs 33.0K ips · 30.3 µs 34.3K ips · 29.1 µs +80%

Notes:

  • The fixed branch beats the merge-base on every measured workload, with
    no regression anywhere.
  • On formats with many short literal runs (width-flagged, many-specifiers)
    the fixed version lands slightly below the as-is peak — :binary.split/2
    carries a small per-run cost the per-character version avoided — but stays
    ahead of base. The trade buys back the catastrophic literal-heavy loss.
  • Control workloads on unchanged paths (table build / iterate / map+reduce)
    stayed flat within noise, confirming the machine was stable.

57e9ff2 adds benchmarks/string_format.exs
so the format-string-shape axis stays covered: the existing string_ops
benchmark only formats a short specifier-dense string ("item_%d=%f"), which
exercises argument conversion but not literal accumulation. The new workload
varies the format string itself (long literal-heavy, width-flagged,
many-specifier) and follows suite convention (helpers.exs run modes;
eval/chunk vs Luerl vs optional C Lua). Run with
mix lua.bench --workload string_format.

Success criteria

  • mix format produces no diff (ran before commit).
  • mix compile --warnings-as-errors passes.
  • mix test green, no regressions.
  • mix test --only lua53 no regression.
  • string.format builds its result via an iolist with a single IO.iodata_to_binary/1; no acc <> ... concat remains in format_string/3.
  • string.format copies literal runs in one chunk (:binary.split/2), not one iolist cell per character, so literal-heavy formats don't regress.
  • table.sort / table.concat read plain tables via Table.get_data/2, and table.sort writes back via Table.put/3 + Map.put/3, only when metatable == nil.
  • Tables with a metatable still go through Executor.table_index/3 / table_newindex/3.
  • apply_width_flags/3 measures the width threshold and the padding fill in bytes (multibyte %s is padded byte-for-byte, matching PUC-Lua).
  • No behavioral change to existing string/table stdlib semantics; multibyte %s width pinned by new regression tests.
  • Benchmark coverage added for the format-string-shape axis (benchmarks/string_format.exs); literal-heavy case confirmed ~2.4× faster than base with less memory.

Changes

 .agents/plans/B9-stdlib-hot-paths.md | 215 +++++++++++++++++++++++++++++++++++
 benchmarks/string_format.exs         | 129 +++++++++++++++++++++
 lib/lua/vm/stdlib/string.ex          |  60 ++++++----
 lib/lua/vm/stdlib/table.ex           | 152 ++++++++++++++++---------
 test/lua/vm/string_test.exs          |  30 +++++
 5 files changed, 514 insertions(+), 72 deletions(-)

Verification

mix format                         # no diff
mix compile --warnings-as-errors   # clean
mix test test/lua/vm/string_test.exs   # 111 tests, 41 properties, 0 failures
mix test                           # green
mix test --only lua53              # no regression
mix lua.bench --workload string_format  # before/after numbers above

Out of scope (intentional)

  • The O(n^2) insertion-sort path used when a user-supplied Lua comparator
    is passed (sort_values/2 / insert_sorted/4). The comparator branch
    keeps going through Executor.call_function/3.
  • Lua.VM.Table.put/3's order_tail allocation on every write.
  • Any change to the comparator-driven write-back, error messages, or
    argument validation.

🤖 Generated with Claude Code

…aths

string.format accumulated its result with per-character binary
concatenation, making it O(n^2) in the format string length; it now
builds an iolist and collapses it once with IO.iodata_to_binary/1.
table.sort and table.concat now read metatable-less tables directly from
the data map (and table.sort writes back with a single Map.put) instead
of routing every slot through Executor.table_index/table_newindex.
apply_width_flags measures padding width with byte_size/1 instead of the
grapheme-walking String.length/1. All three are behavior-preserving
constant-factor wins; tables with a metatable keep the Executor path.

Plan: B9
Closes #273
apply_width_flags now measures and fills width in bytes consistently.
The threshold already used byte_size/1, but the padding branches used
String.pad_*/String.slice, which count codepoints, so a multibyte %s
under a width was padded by codepoints against a byte threshold. That
mismatch produced output longer than the requested width (e.g.
format("%6s", "café") emitted 7 bytes). Padding is now byte-for-byte,
matching PUC-Lua, which passes the width straight to C's printf.

Adds regression tests pinning multibyte %s width semantics: right- and
left-justified fills measured in bytes, and the no-pad boundary when the
byte length already meets the width.
@davydog187 davydog187 force-pushed the perf/stdlib-hot-paths branch from b18e50d to a654cf3 Compare May 31, 2026 14:09
davydog187 and others added 2 commits May 31, 2026 10:55
The iolist rewrite of `format_string/3` appended one cell per literal
character (`[acc, <<char::utf8>>]`), which ballooned both the iolist and
its final `IO.iodata_to_binary/1` flatten on literal-heavy format
strings — a ~430-char template regressed 2.75x in time and 4.7x in
memory versus the per-character binary append it replaced, because the
BEAM's append optimization had been coalescing those writes.

Copy each maximal run of non-`%` bytes as a single chunk via
`:binary.split/2` instead. `%` is ASCII 0x25 and never appears as a
UTF-8 continuation byte, so byte-splitting is safe for multibyte
literals. Specifier-dense formats keep the iolist win; literal-heavy
formats no longer regress.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The existing string_ops benchmark formats a short specifier-dense string
("item_%d=%f"), which only exercises argument conversion. This adds a
dedicated string.format workload that varies the format string itself:
a long literal-heavy template, width-flagged specifiers, and a
conversion-heavy many-specifier string. The literal-heavy case guards
the literal-run accumulation path, whose cost is dominated by copying
literal bytes rather than converting arguments.

Follows the suite convention: helpers.exs run modes, and eval/chunk vs
Luerl vs optional C Lua comparisons.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@davydog187 davydog187 merged commit 5e868e6 into main May 31, 2026
5 checks passed
@davydog187 davydog187 deleted the perf/stdlib-hot-paths branch May 31, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stdlib hot paths: string.format iolist + plain-table fast paths (B9)

1 participant