perf(tokens): cache TokenizerCore per thread#7547
Merged
Conversation
a22a0c5 to
51b2bf8
Compare
Collaborator
|
Seems like this caused a segfault. |
Rebuilding TokenizerCore on every `parse_one` call was ~6µs of wasted work; the core's construction is purely a function of the Tokenizer subclass, so caching it per (thread, class) is safe. Also drops two `list[T](...)` subscripted-generic constructions that were pure type-annotation theatre, and narrows `bit_strings` / `hex_strings` to `has_bit_strings` / `has_hex_strings` bools since TokenizerCore only truthy-checks them. ThreadLocalCache lives in tokens.py (not sqlglotc-compiled). Subclassing threading.local inside a mypyc-compiled module causes a segfault because mypyc's fixed-slot attribute access bypasses threading.local's per-thread __dict__ swap, racing all threads on the same C slot.
c453c28 to
1bc41d2
Compare
Owner
Author
|
@georgesittas fixed |
Contributor
SQLGlot Integration Test ResultsComparing:
By Dialect
Overallmain: 113234 total, 112044 passed (pass rate: 98.9%), sqlglot version: sqlglot:perf/cache-tokenizer-core: 106858 total, 106858 passed (pass rate: 100.0%), sqlglot version: Transitions: Dialect pair changes: 0 previous results not found, 2 current results not found ✅ 34 test(s) passed |
Owner
Author
|
/benchmark |
Owner
Author
|
/bench |
Contributor
Benchmark ResultsLegend: 🟢🟢 = 5%+ faster | 🟢 = 3-5% faster | 🟩 = 1-3% faster | ⚪ = unchanged | 🟧 = 1-3% slower | 🔴 = 3-5% slower | 🔴🔴 = 5%+ slower sqlglot
sqlglot[c]
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tokenizer.__init__was rebuildingTokenizerCoreon everyparse_onecall (~6µs). Its args are pure functions of the Tokenizer subclass, so it's cached per (thread, class) via a newThreadLocalCachehelper insqlglot/helper.py.list[t.Union[str, tuple[str, str]]](...)subscripted-generic constructions at init that were pure type-annotation theatre (2.7µs / 41% of init cost).bit_strings/hex_stringsonTokenizerCoreare only truthy-checked, so they're narrowed tohas_bit_strings/has_hex_stringsbools.Benchmark —
parse_one("1")Pure Python, best of 5 × 30k iters, same session:
Speedup is fixed overhead reduction, so the win is largest on short SQL. Large queries (tpch, many_joins, etc.) see the same absolute ~7µs drop but it's drowned in parse time.
Thread safety
TokenizerCore.tokenize()mutates internal state (sql,_current,_line, …) viareset()before each call. Sharing one core across threads would race on those fields.ThreadLocalCachesubclassesthreading.local, so each thread gets its own cache dict and its own per-class core — same guarantees as today's "construct-fresh-each-call" behavior.Verified with a stress test modeled on #520: 32 threads × 10 iterations parsing a mix of SQL across 6 dialects concurrently; all results matched the serial baseline exactly.
Test plan
make stylepassesmake unitpasses (pure Python;SKIP_INTEGRATION=1 python -m unittest— 1231 tests)b'101',x'ff',0b101,0xff) post-rename