Skip to content

perf(tokens): cache TokenizerCore per thread#7547

Merged
tobymao merged 1 commit intomainfrom
perf/cache-tokenizer-core
Apr 23, 2026
Merged

perf(tokens): cache TokenizerCore per thread#7547
tobymao merged 1 commit intomainfrom
perf/cache-tokenizer-core

Conversation

@tobymao
Copy link
Copy Markdown
Owner

@tobymao tobymao commented Apr 23, 2026

Summary

  • Tokenizer.__init__ was rebuilding TokenizerCore on every parse_one call (~6µs). Its args are pure functions of the Tokenizer subclass, so it's cached per (thread, class) via a new ThreadLocalCache helper in sqlglot/helper.py.
  • Drops two list[t.Union[str, tuple[str, str]]](...) subscripted-generic constructions at init that were pure type-annotation theatre (2.7µs / 41% of init cost).
  • bit_strings / hex_strings on TokenizerCore are only truthy-checked, so they're narrowed to has_bit_strings / has_hex_strings bools.

Benchmark — parse_one("1")

Pure Python, best of 5 × 30k iters, same session:

time delta
before 31.98 µs/call
after 24.58 µs/call −23%

Speedup is fixed overhead reduction, so the win is largest on short SQL. Large queries (tpch, many_joins, etc.) see the same absolute ~7µs drop but it's drowned in parse time.

Thread safety

TokenizerCore.tokenize() mutates internal state (sql, _current, _line, …) via reset() before each call. Sharing one core across threads would race on those fields. ThreadLocalCache subclasses threading.local, so each thread gets its own cache dict and its own per-class core — same guarantees as today's "construct-fresh-each-call" behavior.

Verified with a stress test modeled on #520: 32 threads × 10 iterations parsing a mix of SQL across 6 dialects concurrently; all results matched the serial baseline exactly.

Test plan

  • make style passes
  • make unit passes (pure Python; SKIP_INTEGRATION=1 python -m unittest — 1231 tests)
  • Threaded stress test (32 workers, mixed dialects) — no divergence from serial
  • Bit/hex literal paths exercised manually (b'101', x'ff', 0b101, 0xff) post-rename

@tobymao tobymao changed the title perf(tokens): cache TokenizerCore per thread [CLAUDE] perf(tokens): cache TokenizerCore per thread Apr 23, 2026
@tobymao tobymao force-pushed the perf/cache-tokenizer-core branch from a22a0c5 to 51b2bf8 Compare April 23, 2026 06:11
@georgesittas
Copy link
Copy Markdown
Collaborator

Seems like this caused a segfault.

Rebuilding TokenizerCore on every `parse_one` call was ~6µs of wasted work;
the core's construction is purely a function of the Tokenizer subclass, so
caching it per (thread, class) is safe. Also drops two `list[T](...)`
subscripted-generic constructions that were pure type-annotation theatre,
and narrows `bit_strings` / `hex_strings` to `has_bit_strings` /
`has_hex_strings` bools since TokenizerCore only truthy-checks them.

ThreadLocalCache lives in tokens.py (not sqlglotc-compiled). Subclassing
threading.local inside a mypyc-compiled module causes a segfault because
mypyc's fixed-slot attribute access bypasses threading.local's per-thread
__dict__ swap, racing all threads on the same C slot.
@tobymao tobymao force-pushed the perf/cache-tokenizer-core branch from c453c28 to 1bc41d2 Compare April 23, 2026 14:49
@tobymao
Copy link
Copy Markdown
Owner Author

tobymao commented Apr 23, 2026

@georgesittas fixed

@github-actions
Copy link
Copy Markdown
Contributor

SQLGlot Integration Test Results

Comparing:

  • this branch (sqlglot:perf/cache-tokenizer-core, sqlglot version: perf/cache-tokenizer-core)
  • baseline (main, sqlglot version: 0.0.1.dev1)

By Dialect

dialect main sqlglot:perf/cache-tokenizer-core transitions links
bigquery -> bigquery 24645/24650 passed (100.0%) 23495/23495 passed (100.0%) No change full result / delta
bigquery -> duckdb 867/1154 passed (75.1%) 0/0 passed (0.0%) Results not found full result / delta
duckdb -> duckdb 5823/5823 passed (100.0%) 5823/5823 passed (100.0%) No change full result / delta
snowflake -> duckdb 1063/1961 passed (54.2%) 0/0 passed (0.0%) Results not found full result / delta
snowflake -> snowflake 65133/65133 passed (100.0%) 63027/63027 passed (100.0%) No change full result / delta
databricks -> databricks 1370/1370 passed (100.0%) 1370/1370 passed (100.0%) No change full result / delta
postgres -> postgres 6042/6042 passed (100.0%) 6042/6042 passed (100.0%) No change full result / delta
redshift -> redshift 7101/7101 passed (100.0%) 7101/7101 passed (100.0%) No change full result / delta

Overall

main: 113234 total, 112044 passed (pass rate: 98.9%), sqlglot version: 0.0.1.dev1

sqlglot:perf/cache-tokenizer-core: 106858 total, 106858 passed (pass rate: 100.0%), sqlglot version: perf/cache-tokenizer-core

Transitions:
No change

Dialect pair changes: 0 previous results not found, 2 current results not found

✅ 34 test(s) passed

@tobymao
Copy link
Copy Markdown
Owner Author

tobymao commented Apr 23, 2026

/benchmark

@tobymao
Copy link
Copy Markdown
Owner Author

tobymao commented Apr 23, 2026

/bench

@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results

Legend: 🟢🟢 = 5%+ faster | 🟢 = 3-5% faster | 🟩 = 1-3% faster | ⚪ = unchanged | 🟧 = 1-3% slower | 🔴 = 3-5% slower | 🔴🔴 = 5%+ slower

sqlglot

Query main PR diff
tpch 2.7ms 2.7ms 0.1% slower
short 199us 195us 2.0% faster 🟩
deep_arithmetic 8.4ms 8.5ms 0.6% slower
large_in 449.2ms 446.0ms 0.7% faster
values 513.6ms 511.4ms 0.4% faster
many_joins 11.4ms 11.3ms 0.6% faster
many_unions 40.8ms 41.3ms 1.3% slower 🟧
nested_subqueries 1.1ms 1.1ms 2.1% slower 🟧
many_columns 13.0ms 13.1ms 0.0%
large_case 38.5ms 37.6ms 2.5% faster 🟩
complex_where 28.2ms 28.4ms 0.9% slower
many_ctes 17.0ms 16.4ms 3.1% faster 🟢
many_windows 20.9ms 21.0ms 0.7% slower
nested_functions 691us 701us 1.4% slower 🟧
large_strings 5.6ms 5.6ms 0.5% slower
many_numbers 109.4ms 102.3ms 6.5% faster 🟢🟢

sqlglot[c]

Query main PR diff
tpch 657us 665us 1.1% slower 🟧
short 54us 47us 13.3% faster 🟢🟢
deep_arithmetic 2.2ms 2.7ms 26.2% slower 🔴🔴
large_in 114.8ms 121.9ms 6.2% slower 🔴🔴
values 126.8ms 134.3ms 5.9% slower 🔴🔴
many_joins 2.5ms 2.5ms 0.2% slower
many_unions 8.5ms 8.3ms 2.2% faster 🟩
nested_subqueries 239us 231us 3.3% faster 🟢
many_columns 3.2ms 3.0ms 3.9% faster 🟢
large_case 10.1ms 8.9ms 11.6% faster 🟢🟢
complex_where 7.4ms 6.7ms 9.9% faster 🟢🟢
many_ctes 3.5ms 3.5ms 0.4% faster
many_windows 4.9ms 5.2ms 5.0% slower 🔴🔴
nested_functions 161us 148us 7.8% faster 🟢🟢
large_strings 1.4ms 1.3ms 5.7% faster 🟢🟢
many_numbers 25.8ms 28.1ms 8.9% slower 🔴🔴

Comment /benchmark to re-run.

@tobymao tobymao merged commit 63f8dc6 into main Apr 23, 2026
8 checks passed
@tobymao tobymao deleted the perf/cache-tokenizer-core branch April 23, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants