v0.6.1 — bug-fix, correctness & performance
A bug-fix and test-hardening release. No public API was removed and no new
public names were added. One fix changes key output for inputs containing
invisible characters — see Upgrade notes.
Upgrade notes (output-affecting fix)
search_key/catalog_key/sort_keynow strip bidi overrides and
soft-hyphen / format characters (#93). Previously a value stored with an
invisible character (e.g."password","usertxt") produced a
different key from its clean equivalent, so dedup and lookup silently
missed. The new key is the correct one; if you persist these keys, regenerate
any that were computed over text that could contain invisible characters.
Fixed
- #93 — key functions (
search_key/catalog_key/sort_key) leaked bidi
and soft-hyphen characters, so visually-identical inputs produced
non-colliding keys. They nowstrip_bidiafter NFKC, matching the other
canonicalization presets. - #82 — Greek reverse transliteration (
transliterate(text, target="el"))
left literal Latin letters in the output ("psychi"→"ψyχη"). The forward
direction romanizes Υ/υ asY/y(including the ου/αυ/ευ diphthongs), so the
elreverse table now mapsY/yback to Greek; round-trips no longer leak
Latin letters. - #69 —
transliterate()resolved conflicting kwargs differently forstr
vslistinput (one path silently droppedtarget, the othercontext).
Conflicts are now checked once, before the dispatch, so both raise identically:
context+targetandcontext+tonesraiseValueError. - #72 —
translit.unidecode()now mirrors the Unidecode 1.3 signature
unidecode(string, errors="ignore", replace_str="?"), mapping Unidecode's
errorsmodes (ignore/replace/preserve/strict) onto the native error
handling, instead of raisingTypeErroron those kwargs. - #95 — Greek Extended polytonic capitals for omicron/upsilon/omega/rho
were corrupted, emitting unrelated Latin letters (Ὅμηρος→Xmiros,
Ὑγίεια→Pgieia). Corrected all 50 affected entries to the proper base
romanization, consistent with the monotonic forms (Ὅμηρος→Omiros). - #99.3 — a typo'd
form=/errors=value now raises even for pure-ASCII
input. Previously the ASCII fast-path returned before reaching Rust, so the
bad enum silently no-opped on ASCII and only raised on the first non-ASCII
string. Validation now runs before the fast-path innormalize()and
transliterate().
Performance
- #70 — the batch entry points (
transliterate,slugify,normalize,
strip_accentsonlist[str]) now release the GIL around their pure-Rust
compute loop viapy.allow_threads. Multi-threaded callers processing large
batches now get real parallelism (~1.8× wall-clock with two threads) instead
of serialising on the interpreter lock. Output is unchanged. Documented in the
new "Concurrency (GIL)" section ofdocs/performance.md.
Documentation
-
#94 —
strict_iso9is no longer described as "ISO 9:1995". It emits ASCII
digraphs (ж→zh, ч→ch, ш→sh), not the standard's diacritics (ž/č/š) — translit
tables are ASCII-only by design. Docstrings, the data-file header, and the docs
now describe it as a scholarly ASCII (ISO 9-style) transliteration and warn it
is not ISO 9-conformant. No behavior change. -
#98 —
docs/user-guide/transliteration.mdno longer instructs users to
pip install translit-rs[arabic|hebrew|context](those empty extras were
removed in 0.6.0); it now documents thebootstrap_dicts.sh/TRANSLIT_DICT_DIR
path, matching the README and the runtime error message. -
#99.1 / #99.2 — fixed two false docstrings:
sort_keyno longer claims to
preserve accents (it folds them via transliteration, coinciding with
search_key), andslugifyno longer documents apretranslatekwarg it
never had. -
#84 — corrected the README throughput table (Cyrillic ~106M chars/sec,
slugify ~712K slugs/sec on commodity 4-vCPU hardware) and added a
hardware/methodology footnote; added a matching variance note to
docs/performance.md. -
#77 — fixed the
Textfluent-builder docstring example (normalizeis
keyword-only:.normalize(form="NFC")), reconciled the language-profile count
(README now agrees with the docs at 83), and documented thecontextkwarg in
thetransliterate()docstring.
Internal / tests
- #78 — added adversarial coverage for the raw-bytes decode path
(detect_encoding/decode_to_utf8): deterministic hostile-byte cases in
CI plus a Hypothesisst.binary()fuzz suite proving no-panic and
invariant-preservation. Documented inTHREAT_MODEL.mdthat the decode path
has no input-size cap (caller's responsibility, per the 0.6.0 cap removal). - #79 — added a single-vs-batch kwarg parity regression test across the full
kwarg matrix and a multi-script corpus (thetonesbatch drop fixed in 0.6.0
can no longer recur silently).