Skip to content

v0.8.0 — performance & hardening

Choose a tag to compare

@raeq raeq released this 10 Jun 23:44
· 125 commits to main since this release
1776f89

A performance and hardening release. The headline is a benchmark-gated optimisation programme (#233) that makes short-string transliterate roughly 15–21× faster than Unidecode (up from ~7–9×) and beats Unidecode on its own benchmark, while shrinking the library's static and resident memory.

Highlights

  • Faster per call: a transliterate call now crosses the Python→Rust boundary exactly once and returns already-ASCII input as the original str object — roughly 70 ns with no allocation. Short strings are ~15–21× faster than Unidecode, and translit wins all four cells of Unidecode's own benchmark (#277, #281).
  • Smaller footprint: the default BMP table is a page-table + interned-blob trie (~1 MB → ~58 KB), hanzi→pinyin a dense interned array (~600 KB → ~50 KB), Hangul a single packed blob (#237); context dictionaries are now zero-copy, roughly halving their resident memory (#238); replacement and slug scanning use Aho-Corasick automata and emoji match through a code-point trie (#242).
  • Security hardening: is_safe_hostname flags every mixed-script label (#254); security presets no longer synthesise path separators from confusables (#248); rag_ingest runs the confusables step (#258); the stateful slugifiers validate lang (#257).

Upgrade notes

  • Minimum Python is now 3.10 (was 3.9). The extension targets the stable-ABI floor abi3-py310 (#277); Python 3.9 wheels are no longer produced.
  • is_safe_hostname now flags every mixed-script label as unsafe (#254), not only the Latin-paired high-risk combinations. Inspect the mixed_script / scripts fields for a more permissive policy; the check fails closed by design.
  • Output may change for some inputs: the security-preset path-separator fix (#248), rag_ingest confusables canonicalisation (#258), stateful-slugifier lang validation (#257), and a few correctness edge cases (#249, #253, #255).

See the full changelog for the complete list.