perf(stdlib): iolist string.format and plain-table sort/concat fast paths#299
Merged
Conversation
…aths string.format accumulated its result with per-character binary concatenation, making it O(n^2) in the format string length; it now builds an iolist and collapses it once with IO.iodata_to_binary/1. table.sort and table.concat now read metatable-less tables directly from the data map (and table.sort writes back with a single Map.put) instead of routing every slot through Executor.table_index/table_newindex. apply_width_flags measures padding width with byte_size/1 instead of the grapheme-walking String.length/1. All three are behavior-preserving constant-factor wins; tables with a metatable keep the Executor path. Plan: B9 Closes #273
apply_width_flags now measures and fills width in bytes consistently.
The threshold already used byte_size/1, but the padding branches used
String.pad_*/String.slice, which count codepoints, so a multibyte %s
under a width was padded by codepoints against a byte threshold. That
mismatch produced output longer than the requested width (e.g.
format("%6s", "café") emitted 7 bytes). Padding is now byte-for-byte,
matching PUC-Lua, which passes the width straight to C's printf.
Adds regression tests pinning multibyte %s width semantics: right- and
left-justified fills measured in bytes, and the no-pad boundary when the
byte length already meets the width.
b18e50d to
a654cf3
Compare
The iolist rewrite of `format_string/3` appended one cell per literal character (`[acc, <<char::utf8>>]`), which ballooned both the iolist and its final `IO.iodata_to_binary/1` flatten on literal-heavy format strings — a ~430-char template regressed 2.75x in time and 4.7x in memory versus the per-character binary append it replaced, because the BEAM's append optimization had been coalescing those writes. Copy each maximal run of non-`%` bytes as a single chunk via `:binary.split/2` instead. `%` is ASCII 0x25 and never appears as a UTF-8 continuation byte, so byte-splitting is safe for multibyte literals. Specifier-dense formats keep the iolist win; literal-heavy formats no longer regress. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The existing string_ops benchmark formats a short specifier-dense string
("item_%d=%f"), which only exercises argument conversion. This adds a
dedicated string.format workload that varies the format string itself:
a long literal-heavy template, width-flagged specifiers, and a
conversion-heavy many-specifier string. The literal-heavy case guards
the literal-run accumulation path, whose cost is dominated by copying
literal bytes rather than converting arguments.
Follows the suite convention: helpers.exs run modes, and eval/chunk vs
Luerl vs optional C Lua comparisons.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
stdlib hot paths: string.format iolist + plain-table fast paths
Plan:
.agents/plans/B9-stdlib-hot-paths.mdCloses #273
Goal
Close three stdlib-side performance gaps surfaced by the benchmark suite.
Observable behavior is byte-for-byte identical before and after.
string.formatliteral accumulation.format_string/3built itsresult with per-character binary concatenation, which is awkward on long
format strings. It now accumulates an iolist collapsed once with
IO.iodata_to_binary/1, and copies each run of literal (non-%) bytesas a single chunk via
:binary.split/2rather than one iolist cell percharacter. (
%is ASCII0x25and never appears as a UTF-8 continuationbyte, so byte-splitting is multibyte-safe.) See the note below — the
initial per-character iolist regressed literal-heavy formats badly; the
literal-run coalescing reverses that and beats the original.
table.sort/table.concatplain-table fast path. Whenmetatable == nil, both functions now read slots directly viaLua.VM.Table.get_data/2andtable.sortwrites back viaLua.VM.Table.put/3+ a singleMap.put/3, skipping theExecutor.table_index/3/table_newindex/3dispatch. Tables with ametatable keep the Executor path unchanged.
apply_width_flagsbyte-consistent width. Both the width thresholdand the padding fill are now measured in bytes, matching PUC-Lua
(which hands the width straight to C's
printf). Previously thethreshold used
byte_size/1but the fill usedString.pad_*/String.slice,which count codepoints — so a multibyte
%sunder a width was padded bycodepoints against a byte threshold (e.g.
format("%6s", "café")emitted7 bytes). Now byte-for-byte.
Benchmarking against the merge-base surfaced a regression introduced by the
as-is iolist change (1, above): appending one iolist cell per literal
character (
[acc, <<char::utf8>>]) ballooned both the iolist and its finalIO.iodata_to_binary/1flatten on literal-heavy format strings, whilethe per-character binary append it replaced was being coalesced by the BEAM's
append optimization. A ~430-char template (3 specifiers, mostly literal text)
regressed ~2.6× in time and ~4.7× in memory.
d2f4e55fixes this by grabbing each literal run in one chunk(
:binary.split/2) instead of per character. The literal-heavy case is now~2.4× faster than the merge-base with slightly less memory, and
specifier-dense formats keep their win.
Benchmarks (
luachunk path, idle Apple M4, quick mode)base= merge-base4f93396.PR as-is= the per-character iolist(
bddb677).PR fixed= with literal-run coalescing (d2f4e55).string.formatlong literal-heavy (n=1000)string.formatwidth-flagged specifiers (n=1000)string.formatmany specifiers (n=1000)table.sort(n=100)Notes:
no regression anywhere.
the fixed version lands slightly below the as-is peak —
:binary.split/2carries a small per-run cost the per-character version avoided — but stays
ahead of base. The trade buys back the catastrophic literal-heavy loss.
tablebuild / iterate / map+reduce)stayed flat within noise, confirming the machine was stable.
57e9ff2addsbenchmarks/string_format.exsso the format-string-shape axis stays covered: the existing
string_opsbenchmark only formats a short specifier-dense string (
"item_%d=%f"), whichexercises argument conversion but not literal accumulation. The new workload
varies the format string itself (long literal-heavy, width-flagged,
many-specifier) and follows suite convention (
helpers.exsrun modes;eval/chunk vs Luerl vs optional C Lua). Run with
mix lua.bench --workload string_format.Success criteria
mix formatproduces no diff (ran before commit).mix compile --warnings-as-errorspasses.mix testgreen, no regressions.mix test --only lua53no regression.string.formatbuilds its result via an iolist with a singleIO.iodata_to_binary/1; noacc <> ...concat remains informat_string/3.string.formatcopies literal runs in one chunk (:binary.split/2), not one iolist cell per character, so literal-heavy formats don't regress.table.sort/table.concatread plain tables viaTable.get_data/2, andtable.sortwrites back viaTable.put/3+Map.put/3, only whenmetatable == nil.Executor.table_index/3/table_newindex/3.apply_width_flags/3measures the width threshold and the padding fill in bytes (multibyte%sis padded byte-for-byte, matching PUC-Lua).%swidth pinned by new regression tests.benchmarks/string_format.exs); literal-heavy case confirmed ~2.4× faster than base with less memory.Changes
Verification
Out of scope (intentional)
is passed (
sort_values/2/insert_sorted/4). The comparator branchkeeps going through
Executor.call_function/3.Lua.VM.Table.put/3'sorder_tailallocation on every write.argument validation.
🤖 Generated with Claude Code