Skip to content

Speed up parsing by interning TOMLChar instances#488

Open
tfoutrein wants to merge 1 commit into
python-poetry:masterfrom
AstekGroup:perf/intern-tomlchar
Open

Speed up parsing by interning TOMLChar instances#488
tfoutrein wants to merge 1 commit into
python-poetry:masterfrom
AstekGroup:perf/intern-tomlchar

Conversation

@tfoutrein
Copy link
Copy Markdown
Contributor

What

The parser reads a document one character at a time, wrapping each in a TOMLChar. Since TOML draws on a tiny alphabet, the same single-character strings are reconstructed on every read.

This interns TOMLChar instances in a module-level cache and returns the cached instance, so every repeat construction becomes a dict lookup.

Correctness

The NUL character is deliberately left un-interned: Source uses TOMLChar("\0") as its end-of-input sentinel and detects EOF by identity (current is EOF). Caching "\0" would make a real U+0000 in the input alias the sentinel and be mistaken for end-of-file. Leaving it un-interned keeps EOF a unique sentinel. The full toml-test conformance suite — including the invalid/control and invalid/encoding/utf16 cases that exercise embedded nulls — passes.

Benchmarks

Parsing speedup across a range of document sizes/shapes (median, interleaved A/B vs master):

document speedup
poetry.lock-like (300 [[package]], ~64 KB) 1.14×
large flat (~90 KB) 1.12×
typical mixed (~4 KB) 1.13×
pyproject.toml 1.09×
string-heavy 1.09×
array-heavy 1.08×

No regression on any shape (tiny inputs are unchanged).

Tests

Full suite passes (972 tests, incl. the toml-test conformance submodule). No public API or behaviour change — round-trip output is byte-identical to master on a varied corpus.

The parser reads a document one character at a time, wrapping each in a
`TOMLChar`. Since TOML draws on a tiny alphabet, the same single-character
strings are reconstructed over and over.

Cache one `TOMLChar` per distinct character in a module-level dict and
return the cached instance, turning every repeat construction into a dict
lookup. The NUL character is deliberately left un-interned: `Source` uses
`TOMLChar("\0")` as its end-of-input sentinel and detects EOF by identity,
so caching it would make a real U+0000 in the input alias the sentinel and
be mistaken for end-of-file.

No behaviour change (full test suite incl. the toml-test conformance cases
passes); ~1.1x faster parsing across a range of document sizes/shapes.
@tfoutrein tfoutrein force-pushed the perf/intern-tomlchar branch from 0e0d381 to ce66d20 Compare June 5, 2026 14:06
@dimbleby
Copy link
Copy Markdown
Contributor

dimbleby commented Jun 5, 2026

One of the commits in #467 removed TOMLChar altogether, which feels as though it ought to be a better version of this?

@tfoutrein
Copy link
Copy Markdown
Contributor Author

Good point — yes, removing TOMLChar entirely is the better end-state than interning it, and that's where this should land eventually.

Context on the sequencing: #488 is the first and smallest of a short series of focused PRs (#489 index-based Source, #490 bulk run-scanning, #491 single-line string bodies). The last two matter here — they replace the per-character inc() loops with bulk scans over raw str slices, so they already avoid the vast majority of TOMLChar construction in the hot paths; it only gets built at run boundaries afterwards.

I sized the remaining effect honestly (size×shape matrix, interleaved A/B vs master): on top of #490/#491, interning the leftover TOMLChar constructions — a proxy for "stop paying for the per-char object" — adds only ~0–5% depending on the document, mostly in the noise. The 2–5× wins are the bulk-scans themselves, not the per-char object strategy. Removing TOMLChar outright would save that construction plus the is_*() method dispatch, so a little more, but still single-digit % over the scanned baseline.

So I'd suggest keeping #488 as a low-risk stepping stone and doing the full TOMLChar removal as a clean follow-up — it's a larger change (Source(str), the EOF sentinel, and the is_bare_key_char/is_spaces/… helpers all move to plain str + sets). Happy to go straight to removal instead if you and @frostming prefer; just flagging that, after the bulk-scans, the remaining headroom there is small.

@tfoutrein
Copy link
Copy Markdown
Contributor Author

Following up on @dimbleby's point above: I've opened #492, which removes TOMLChar entirely — the parser now reads plain str characters and detects end-of-input positionally — rather than interning it. That's the better end-state he suggested, and it builds on the bulk-scan series (#489#491). It's iso-functional (full suite incl. the toml-test conformance submodule, plus an ~11.5k-input adversarial differential over EOF/truncation, real-NUL placement, empty/whitespace and structural fuzz — byte-identical to master).

That makes this interning PR redundant, so my suggestion would be to close this one in favour of #492. Leaving it open for now in case you'd rather take the smaller, lower-risk interning change first — your call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants