v0.8.0 — performance & hardening
A performance and hardening release. The headline is a benchmark-gated optimisation programme (#233) that makes short-string transliterate roughly 15–21× faster than Unidecode (up from ~7–9×) and beats Unidecode on its own benchmark, while shrinking the library's static and resident memory.
Highlights
- Faster per call: a
transliteratecall now crosses the Python→Rust boundary exactly once and returns already-ASCII input as the originalstrobject — roughly 70 ns with no allocation. Short strings are ~15–21× faster than Unidecode, and translit wins all four cells of Unidecode's own benchmark (#277, #281). - Smaller footprint: the default BMP table is a page-table + interned-blob trie (~1 MB → ~58 KB), hanzi→pinyin a dense interned array (~600 KB → ~50 KB), Hangul a single packed blob (#237); context dictionaries are now zero-copy, roughly halving their resident memory (#238); replacement and slug scanning use Aho-Corasick automata and emoji match through a code-point trie (#242).
- Security hardening:
is_safe_hostnameflags every mixed-script label (#254); security presets no longer synthesise path separators from confusables (#248);rag_ingestruns the confusables step (#258); the stateful slugifiers validatelang(#257).
Upgrade notes
- Minimum Python is now 3.10 (was 3.9). The extension targets the stable-ABI floor
abi3-py310(#277); Python 3.9 wheels are no longer produced. is_safe_hostnamenow flags every mixed-script label as unsafe (#254), not only the Latin-paired high-risk combinations. Inspect themixed_script/scriptsfields for a more permissive policy; the check fails closed by design.- Output may change for some inputs: the security-preset path-separator fix (#248),
rag_ingestconfusables canonicalisation (#258), stateful-slugifierlangvalidation (#257), and a few correctness edge cases (#249, #253, #255).
See the full changelog for the complete list.