Speed up parsing by interning TOMLChar instances by tfoutrein · Pull Request #488 · python-poetry/tomlkit

tfoutrein · 2026-06-05T14:04:25Z

What

The parser reads a document one character at a time, wrapping each in a TOMLChar. Since TOML draws on a tiny alphabet, the same single-character strings are reconstructed on every read.

This interns TOMLChar instances in a module-level cache and returns the cached instance, so every repeat construction becomes a dict lookup.

Correctness

The NUL character is deliberately left un-interned: Source uses TOMLChar("\0") as its end-of-input sentinel and detects EOF by identity (current is EOF). Caching "\0" would make a real U+0000 in the input alias the sentinel and be mistaken for end-of-file. Leaving it un-interned keeps EOF a unique sentinel. The full toml-test conformance suite — including the invalid/control and invalid/encoding/utf16 cases that exercise embedded nulls — passes.

Benchmarks

Parsing speedup across a range of document sizes/shapes (median, interleaved A/B vs master):

document	speedup
poetry.lock-like (300 `[[package]]`, ~64 KB)	1.14×
large flat (~90 KB)	1.12×
typical mixed (~4 KB)	1.13×
pyproject.toml	1.09×
string-heavy	1.09×
array-heavy	1.08×

No regression on any shape (tiny inputs are unchanged).

Tests

Full suite passes (972 tests, incl. the toml-test conformance submodule). No public API or behaviour change — round-trip output is byte-identical to master on a varied corpus.

The parser reads a document one character at a time, wrapping each in a `TOMLChar`. Since TOML draws on a tiny alphabet, the same single-character strings are reconstructed over and over. Cache one `TOMLChar` per distinct character in a module-level dict and return the cached instance, turning every repeat construction into a dict lookup. The NUL character is deliberately left un-interned: `Source` uses `TOMLChar("\0")` as its end-of-input sentinel and detects EOF by identity, so caching it would make a real U+0000 in the input alias the sentinel and be mistaken for end-of-file. No behaviour change (full test suite incl. the toml-test conformance cases passes); ~1.1x faster parsing across a range of document sizes/shapes.

dimbleby · 2026-06-05T22:00:02Z

One of the commits in #467 removed TOMLChar altogether, which feels as though it ought to be a better version of this?

tfoutrein · 2026-06-06T16:08:03Z

Good point — yes, removing TOMLChar entirely is the better end-state than interning it, and that's where this should land eventually.

Context on the sequencing: #488 is the first and smallest of a short series of focused PRs (#489 index-based Source, #490 bulk run-scanning, #491 single-line string bodies). The last two matter here — they replace the per-character inc() loops with bulk scans over raw str slices, so they already avoid the vast majority of TOMLChar construction in the hot paths; it only gets built at run boundaries afterwards.

I sized the remaining effect honestly (size×shape matrix, interleaved A/B vs master): on top of #490/#491, interning the leftover TOMLChar constructions — a proxy for "stop paying for the per-char object" — adds only ~0–5% depending on the document, mostly in the noise. The 2–5× wins are the bulk-scans themselves, not the per-char object strategy. Removing TOMLChar outright would save that construction plus the is_*() method dispatch, so a little more, but still single-digit % over the scanned baseline.

So I'd suggest keeping #488 as a low-risk stepping stone and doing the full TOMLChar removal as a clean follow-up — it's a larger change (Source(str), the EOF sentinel, and the is_bare_key_char/is_spaces/… helpers all move to plain str + sets). Happy to go straight to removal instead if you and @frostming prefer; just flagging that, after the bulk-scans, the remaining headroom there is small.

tfoutrein · 2026-06-06T21:18:33Z

Following up on @dimbleby's point above: I've opened #492, which removes TOMLChar entirely — the parser now reads plain str characters and detects end-of-input positionally — rather than interning it. That's the better end-state he suggested, and it builds on the bulk-scan series (#489–#491). It's iso-functional (full suite incl. the toml-test conformance submodule, plus an ~11.5k-input adversarial differential over EOF/truncation, real-NUL placement, empty/whitespace and structural fuzz — byte-identical to master).

That makes this interning PR redundant, so my suggestion would be to close this one in favour of #492. Leaving it open for now in case you'd rather take the smaller, lower-risk interning change first — your call.

tfoutrein force-pushed the perf/intern-tomlchar branch from 0e0d381 to ce66d20 Compare June 5, 2026 14:06

tfoutrein mentioned this pull request Jun 6, 2026

Remove the internal TOMLChar wrapper #492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up parsing by interning TOMLChar instances#488

Speed up parsing by interning TOMLChar instances#488
tfoutrein wants to merge 1 commit into
python-poetry:masterfrom
AstekGroup:perf/intern-tomlchar

tfoutrein commented Jun 5, 2026

Uh oh!

dimbleby commented Jun 5, 2026

Uh oh!

tfoutrein commented Jun 6, 2026

Uh oh!

tfoutrein commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tfoutrein commented Jun 5, 2026

What

Correctness

Benchmarks

Tests

Uh oh!

dimbleby commented Jun 5, 2026

Uh oh!

tfoutrein commented Jun 6, 2026

Uh oh!

tfoutrein commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants