Release v0.6.1 — bug-fix, correctness & performance · raeq/translit

A bug-fix and test-hardening release. No public API was removed and no new
public names were added. One fix changes key output for inputs containing
invisible characters — see Upgrade notes.

Upgrade notes (output-affecting fix)

search_key / catalog_key / sort_key now strip bidi overrides and
soft-hyphen / format characters (#93). Previously a value stored with an
invisible character (e.g. "password", "user‮txt") produced a
different key from its clean equivalent, so dedup and lookup silently
missed. The new key is the correct one; if you persist these keys, regenerate
any that were computed over text that could contain invisible characters.

Fixed

#93 — key functions (search_key/catalog_key/sort_key) leaked bidi
and soft-hyphen characters, so visually-identical inputs produced
non-colliding keys. They now strip_bidi after NFKC, matching the other
canonicalization presets.
#82 — Greek reverse transliteration (transliterate(text, target="el"))
left literal Latin letters in the output ("psychi" → "ψyχη"). The forward
direction romanizes Υ/υ as Y/y (including the ου/αυ/ευ diphthongs), so the
el reverse table now maps Y/y back to Greek; round-trips no longer leak
Latin letters.
#69 — transliterate() resolved conflicting kwargs differently for str
vs list input (one path silently dropped target, the other context).
Conflicts are now checked once, before the dispatch, so both raise identically:
context+target and context+tones raise ValueError.
#72 — translit.unidecode() now mirrors the Unidecode 1.3 signature
unidecode(string, errors="ignore", replace_str="?"), mapping Unidecode's
errors modes (ignore/replace/preserve/strict) onto the native error
handling, instead of raising TypeError on those kwargs.
#95 — Greek Extended polytonic capitals for omicron/upsilon/omega/rho
were corrupted, emitting unrelated Latin letters (Ὅμηρος → Xmiros,
Ὑγίεια → Pgieia). Corrected all 50 affected entries to the proper base
romanization, consistent with the monotonic forms (Ὅμηρος → Omiros).
#99.3 — a typo'd form=/errors= value now raises even for pure-ASCII
input. Previously the ASCII fast-path returned before reaching Rust, so the
bad enum silently no-opped on ASCII and only raised on the first non-ASCII
string. Validation now runs before the fast-path in normalize() and
transliterate().

Performance

#70 — the batch entry points (transliterate, slugify, normalize,
strip_accents on list[str]) now release the GIL around their pure-Rust
compute loop via py.allow_threads. Multi-threaded callers processing large
batches now get real parallelism (~1.8× wall-clock with two threads) instead
of serialising on the interpreter lock. Output is unchanged. Documented in the
new "Concurrency (GIL)" section of docs/performance.md.

Documentation

#94 — strict_iso9 is no longer described as "ISO 9:1995". It emits ASCII
digraphs (ж→zh, ч→ch, ш→sh), not the standard's diacritics (ž/č/š) — translit
tables are ASCII-only by design. Docstrings, the data-file header, and the docs
now describe it as a scholarly ASCII (ISO 9-style) transliteration and warn it
is not ISO 9-conformant. No behavior change.
#98 — docs/user-guide/transliteration.md no longer instructs users to
pip install translit-rs[arabic|hebrew|context] (those empty extras were
removed in 0.6.0); it now documents the bootstrap_dicts.sh / TRANSLIT_DICT_DIR
path, matching the README and the runtime error message.
#99.1 / #99.2 — fixed two false docstrings: sort_key no longer claims to
preserve accents (it folds them via transliteration, coinciding with
search_key), and slugify no longer documents a pretranslate kwarg it
never had.
#84 — corrected the README throughput table (Cyrillic ~106M chars/sec,
slugify ~712K slugs/sec on commodity 4-vCPU hardware) and added a
hardware/methodology footnote; added a matching variance note to
docs/performance.md.
#77 — fixed the Text fluent-builder docstring example (normalize is
keyword-only: .normalize(form="NFC")), reconciled the language-profile count
(README now agrees with the docs at 83), and documented the context kwarg in
the transliterate() docstring.

Internal / tests

#78 — added adversarial coverage for the raw-bytes decode path
(detect_encoding / decode_to_utf8): deterministic hostile-byte cases in
CI plus a Hypothesis st.binary() fuzz suite proving no-panic and
invariant-preservation. Documented in THREAT_MODEL.md that the decode path
has no input-size cap (caller's responsibility, per the 0.6.0 cap removal).
#79 — added a single-vs-batch kwarg parity regression test across the full
kwarg matrix and a multi-script corpus (the tones batch drop fixed in 0.6.0
can no longer recur silently).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.1 — bug-fix, correctness & performance

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Upgrade notes (output-affecting fix)

Fixed

Performance

Documentation

Internal / tests

Uh oh!