Skip to content

Releases: raeq/disarm

v0.9.0 — first disarm release

11 Jun 14:26
25d638d

Choose a tag to compare

The first release under the disarm name — the continuation of translit-rs (last released as 0.8.1). See #264 for the rename rationale.

disarm unifies the distribution and import names: pip install disarm then import disarm.

Breaking changes (migrating from translit-rs)

  • Install/import: pip install translit-rspip install disarm; import translitimport disarm.
  • Exception: the base exception TranslitErrorDisarmError (the subclasses InvalidArgumentError / ResourceLimitError / UnsupportedError keep their names). DisarmError is still a ValueError subclass, so except ValueError keeps working.
  • Env var: context-dictionary path TRANSLIT_DICT_DIRDISARM_DICT_DIR.
  • Native module translit._translitdisarm._disarm; console script translitdisarm.

The public transform API is otherwise unchanged: transliterate(), normalize(), slugify(), the security/pipeline helpers all keep their names and behaviour.

Install: pip install disarm==0.9.0

Full changelog: https://github.com/raeq/disarm/blob/main/CHANGELOG.md

v0.8.1 — honest benchmark numbers (final translit-rs release)

11 Jun 13:14
778b969

Choose a tag to compare

The final translit-rs release and the close of the 0.8 performance-hardening arc.

The project continues as disarm from 0.9.0 (#264). 0.8.1 ships honest, production-true benchmark numbers before the rename.

Changed

  • Fresh-string benchmark regime (#277, #302): every timed call now receives a newly constructed str, the way production traffic always does. The prior cached-object measurement let CPython's per-object AsUTF8 cache hide ~105–137 ns/call of UTF-8 encode cost that only translit pays (pure-Python comparators never call AsUTF8), flattering it. JSON records now carry regime: fresh-string/v2; pre-flip history is the cached v1 regime and must not be compared across regimes.
  • README short-string figures updated to the measured fresh-regime values: ~17× vs Unidecode (Latin), ~14× (mixed scripts), ~13× (Cyrillic/Greek); ~65 ns ASCII passthrough; the four-cell Unidecode-own sweep still holds (~1.3× on Unidecode's strongest case to ~25×).

Install: pip install translit-rs==0.8.1

Full changelog: https://github.com/raeq/translit/blob/main/CHANGELOG.md

v0.8.0 — performance & hardening

10 Jun 23:44
1776f89

Choose a tag to compare

A performance and hardening release. The headline is a benchmark-gated optimisation programme (#233) that makes short-string transliterate roughly 15–21× faster than Unidecode (up from ~7–9×) and beats Unidecode on its own benchmark, while shrinking the library's static and resident memory.

Highlights

  • Faster per call: a transliterate call now crosses the Python→Rust boundary exactly once and returns already-ASCII input as the original str object — roughly 70 ns with no allocation. Short strings are ~15–21× faster than Unidecode, and translit wins all four cells of Unidecode's own benchmark (#277, #281).
  • Smaller footprint: the default BMP table is a page-table + interned-blob trie (~1 MB → ~58 KB), hanzi→pinyin a dense interned array (~600 KB → ~50 KB), Hangul a single packed blob (#237); context dictionaries are now zero-copy, roughly halving their resident memory (#238); replacement and slug scanning use Aho-Corasick automata and emoji match through a code-point trie (#242).
  • Security hardening: is_safe_hostname flags every mixed-script label (#254); security presets no longer synthesise path separators from confusables (#248); rag_ingest runs the confusables step (#258); the stateful slugifiers validate lang (#257).

Upgrade notes

  • Minimum Python is now 3.10 (was 3.9). The extension targets the stable-ABI floor abi3-py310 (#277); Python 3.9 wheels are no longer produced.
  • is_safe_hostname now flags every mixed-script label as unsafe (#254), not only the Latin-paired high-risk combinations. Inspect the mixed_script / scripts fields for a more permissive policy; the check fails closed by design.
  • Output may change for some inputs: the security-preset path-separator fix (#248), rag_ingest confusables canonicalisation (#258), stateful-slugifier lang validation (#257), and a few correctness edge cases (#249, #253, #255).

See the full changelog for the complete list.

0.7.0

09 Jun 23:15
7a14394

Choose a tag to compare

A feature and architecture release. Headlines: a unified, catchable exception
hierarchy
; terminal column-width measurement (terminal_width /
grapheme_width); native errors="strict" transliteration; LLM/RAG
guardrail pipeline presets; and a substantial push of validation and
configuration logic down into the Rust core
, so the upcoming multi-language
bindings inherit one behaviour instead of reimplementing it. Most changes are
behaviour-preserving; the exceptions are called out under Upgrade notes.

Upgrade notes

  • Exceptions now form a hierarchy. Every library error subclasses
    TranslitError, with InvalidArgumentError, ResourceLimitError, and
    UnsupportedError beneath it. TranslitError remains a ValueError
    subclass, so existing except ValueError keeps working. Several error
    message strings were enriched/standardised (#186, #187) — code matching
    exact message text may need updating; code matching exception types is
    unaffected.
  • lang= is validated even for ASCII input (#197). A binding-side ASCII
    fast path previously skipped language validation, so
    transliterate("abc", lang="zz") silently returned the input; it now raises
    InvalidArgumentError, matching how non-ASCII input always behaved.
  • slugify_filename / Slugify(safe_chars=…) output corrected (see Fixed):
    slugify_filename("My Report.pdf") now returns "My_Report.pdf", not
    "My.Report_pdf". Output for inputs that use safe_chars may change.
  • New modes: errors="strict" for transliterate (#184) and
    decode_to_utf8(strict=True) (#189).

Added

  • terminal_width / grapheme_width (#224): terminal column width per
    grapheme cluster (UAX #11 East Asian Width). Wide/fullwidth and
    emoji-presented clusters are 2 columns; combining marks, controls, and
    zero-width characters are 0. Ambiguous characters are 1 by default, or 2 with
    ambiguous_wide=True. Width data is generated at build time from the pinned
    UCD (no runtime data, no unsafe). Measures cells, not pixels; tabs are not
    expanded.
  • errors="strict" + find_untranslatable (#184): strict transliteration
    raises on the first untranslatable character (reporting it and its byte
    offset); find_untranslatable returns all of them without raising.
  • Guardrail pipeline presets (#139): TextPipeline gains strip_bidi and
    strip_zalgo steps and the llm_guardrail / rag_ingest named profiles for
    LLM/RAG input sanitisation.
  • get_pipeline / list_profiles (#229): the named policy-profile registry
    now lives in the Rust core; the Python helpers are thin wrappers over it.
  • decode_to_utf8(strict=True) (#189): raise on lossy/replacement decoding
    instead of silently substituting U+FFFD.

Changed

  • Unified exception hierarchy (#183): the Python error surface is a
    TranslitError base with categorised subclasses; sites that previously raised
    bare ValueError are unified (foundation laid in 0.6.3 via #181).
  • Validation moved into the Rust core (#185, #217, #229, #230, #231): enum
    validation, the transliterate() argument-conflict matrix, non-negative
    max_length / max_graphemes checks, safe_chars, and min_confidence
    range-checking now live in the core, so other bindings enforce the identical
    contract without reimplementing it. The Python layer keeps only type guards.
  • Actionable error messages (#186, #187): weak messages now name the
    offending value, list valid options, and suggest a "did you mean…?" where
    applicable; message style is standardised across the surface.
  • Error cause chains (#188): wrapped errors surface the underlying cause via
    __cause__ rather than flattening it into the message.
  • TextPipeline step ordering (#174) is derived from a single source of
    truth, removing drift between configuration and execution order.
  • All-ASCII preset fast path (#198): presets skip the NFKC pass for pure-ASCII
    input (behaviour-preserving).

Fixed

  • slugify_filename / Slugify(safe_chars=…) preserved safe characters at
    the wrong positions — slugify_filename("My Report.pdf") returned
    "My.Report_pdf" instead of the awesome-slugify-correct "My_Report.pdf".
    safe_chars are now handled natively in the Rust core: kept verbatim and
    treated as word characters so they hold their position (#156, #230). The prior
    test only covered a dot-free input, so the bug was uncaught; regression tests
    now cover filenames with extensions, multiple dots, and UniqueSlugify +
    max_length.
  • slugify(default=…) is now sanitised through the same slug pipeline (so a
    caller-supplied fallback cannot smuggle path-traversal or URL metacharacters
    into output documented as URL-safe), threads through the stateful Slugifier /
    UniqueSlugifier forms, and a negative max_length now raises a catchable
    InvalidArgumentError on both the scalar and batch paths instead of an
    uncatchable OverflowError (#193, #169).
  • Low-severity hardening bundle (#200): eight small robustness fixes
    (bounds, overflow, and edge-case handling) gathered into one pass.

Security

  • The RustSec advisory audit (cargo-audit) now blocks merge via the
    required "Rust checks passed" gate on every PR — an advisory can land on a
    dependency without any code change here (#195).

Removed

  • Docker image build/publish and its Trivy CVE scan (#138). translit is a
    pip install-first library; previously published images remain as historical
    artifacts, but no new ones are produced. Install the CLI via
    pip install translit-rs.

Documentation

  • Executable cookbook (#154, #91, #140, #156, #172): a Sybil doc-test harness
    with a CI gate, unidecode→translit migration recipes, an "LLM pipelines" page,
    a tokenizer-preprocessing page, and an anti-rot lint that turned 307 decorative
    # => claims into checked assertions.
  • normalize-first canonicalisation recipe (#174) and a formal-verification
    assurance taxonomy
    (#223 — proof-by-exhaustion / structural / property-tested,
    tagging each I1–I7 invariant), plus grapheme-integrity property tests (#174).
  • The project adopted the Developer Certificate of Origin (#165); all commits
    are signed off. The custom-emoji-provider 9-codepoint window cap is now
    documented (#199).

v0.6.3

08 Jun 13:51
9bb192e

Choose a tag to compare

A correctness, maintenance, and architecture-foundation release. No output-affecting changes — behaviour-preserving throughout; the one new public behaviour (slugify(default=...)) is opt-in.

DOI

Highlights

  • Error-model foundation (#181): a pure-Rust Error enum + stable code() + a single From<Error> for PyErr boundary — decouples the core from PyO3 and lays the groundwork for the multi-language bindings roadmap.
  • slugify(text, default="…") — opt-in fallback for inputs that would slug to "" (#97).
  • Fixed: PRESETS["strip_obfuscation"] order (#141), lock-poison Python UserWarning (#117), docs/api/exceptions.md (#182).
  • Dependencies (migrated + verified behaviour-preserving): phf 0.13, criterion 0.8, chardetng 1.0 (#146 / #153 / #164).
  • Maintenance: __init__.py split (#73), build.rs language auto-discovery (#74), stub/binary drift-check (#76), integration-test split (#75), the "Conversations resolved" merge gate (#55), and a documented dependency-upgrade methodology.

Pre-release verified: full Tier-1 CI + Tier-3 exhaustive (all Hangul/BMP/CJK/Indic) + formal invariants I1–I7.

Full changelog: https://github.com/raeq/translit/blob/main/CHANGELOG.md

Correction: this release originally listed a Trivy image-scan fix (#138). That fix did not work, and the Docker build/publish pipeline has since been removed entirely.

pip install -U translit-rs

v0.6.2 — correctness, security, performance & maintenance

07 Jun 22:26
e035402

Choose a tag to compare

A correctness, security, performance and maintenance release triaged from a
post-0.6.1 issue sweep (#101#132). No public API removed; one small new public
behaviour (slugify(save_order=True) now functions). Two output-affecting
fixes
— see Upgrade notes.

Upgrade notes (output-affecting)

  • slugify(save_order=True) was an accepted no-op; it now strips only
    leading/trailing stopwords (preserving interior word order), matching
    python-slugify (#118). If you passed save_order=True, slug output changes.
  • decode_to_utf8 default min_confidence 0.50.95 (#103). The old
    default was inert (the detector only reports 0.50/0.95, and 0.50 < 0.50
    is false), so it never rejected. It now requires high confidence by default;
    pass min_confidence=0.0 to accept any guess. (No practical change today —
    the detector currently always reports 0.95.)

Fixed

  • #102UniqueSlugify no longer panics across the FFI boundary on a
    multibyte separator + small max_length (byte slice landed mid-codepoint;
    now uses floor_char_boundary).
  • #101 — context bigram disambiguation tier was unreachable (it reset on
    every inter-word space); it now resets only on hard boundaries, so the tier
    fires in normal prose.
  • #104set_emoji_provider now obeys seal_registrations() (the provider
    swap previously defeated the seal).
  • #103decode_to_utf8 default confidence now actually gates (see notes).
  • #107 — a corrupt context dictionary now reports a distinct "corrupt" error
    instead of the misleading "not found" remedy (DictState enum).
  • #121PRESETS["sanitize_user_input"] now reflects the real pipeline
    order (strip invisibles before zalgo); Python registry and Rust doc aligned.
  • #129Text.transliterate() stub now declares the tones/context
    parameters the implementation accepts.
  • #131Slugify(uids=...) emits a correct wrong-class warning rather than
    a spurious deprecation warning.
  • #122 — disambiguated the _compat should_warn nested ternary.

Security

  • #105 — added a cargo audit (RustSec advisory) CI job and a cargo
    Dependabot ecosystem.
  • #132 — added a Trivy CVE scan of the published image to the release
    workflow (SARIF → Security tab, fails on fixable HIGH/CRITICAL) + .trivyignore.
  • #106 — Rust diagnostics now route through Python warnings instead of
    bare eprintln!, so applications can capture/suppress them.

Performance (output-preserving)

  • #108 codepoint-range diacritic checks in tokenize(); #109 mem::take
    per token boundary; #110 single ch.nfkc() pass on the NFKC fallback;
    #111 lowered MAX_CAPACITY_HINT 256 MiB → 8 MiB; #112/#113 emoji
    matching uses stack buffers + a fixed sliding window (no per-char Vec/String);
    #114 slugify uses Cow (no eager to_owned); #115 context tokenize()
    returns borrowed (Cow) slices of the input — zero per-token allocation
    (Rust API: the crate-internal context::Token.text changed from String
    to Cow<'_, str>; no effect on the Python API); #116 clamped the
    ContextDict capacity hint.

Maintenance

  • #118 implemented slugify(save_order=True); #119 SlugConfig::from_pyargs
    dedupes the four slugify PyO3 entrypoints; #120 _build_slug_kwargs helper;
    #123 seal-enforcement docs on each tables:: mutator; #124
    infallibility comments; #125 typed _CallableModule.__call__ kwargs;
    #126 corrected recover_lock doc; #127 documented the lazy-import
    workaround; #128 renamed _mutation_generation_registration_generation;
    #130 annotated the defence-in-depth conflict check.

v0.6.1 — bug-fix, correctness & performance

07 Jun 17:40
97743a0

Choose a tag to compare

A bug-fix and test-hardening release. No public API was removed and no new
public names were added. One fix changes key output for inputs containing
invisible characters
— see Upgrade notes.

Upgrade notes (output-affecting fix)

  • search_key / catalog_key / sort_key now strip bidi overrides and
    soft-hyphen / format characters
    (#93). Previously a value stored with an
    invisible character (e.g. "pass­word", "user‮txt") produced a
    different key from its clean equivalent, so dedup and lookup silently
    missed. The new key is the correct one; if you persist these keys, regenerate
    any that were computed over text that could contain invisible characters.

Fixed

  • #93 — key functions (search_key/catalog_key/sort_key) leaked bidi
    and soft-hyphen characters, so visually-identical inputs produced
    non-colliding keys. They now strip_bidi after NFKC, matching the other
    canonicalization presets.
  • #82 — Greek reverse transliteration (transliterate(text, target="el"))
    left literal Latin letters in the output ("psychi""ψyχη"). The forward
    direction romanizes Υ/υ as Y/y (including the ου/αυ/ευ diphthongs), so the
    el reverse table now maps Y/y back to Greek; round-trips no longer leak
    Latin letters.
  • #69transliterate() resolved conflicting kwargs differently for str
    vs list input (one path silently dropped target, the other context).
    Conflicts are now checked once, before the dispatch, so both raise identically:
    context+target and context+tones raise ValueError.
  • #72translit.unidecode() now mirrors the Unidecode 1.3 signature
    unidecode(string, errors="ignore", replace_str="?"), mapping Unidecode's
    errors modes (ignore/replace/preserve/strict) onto the native error
    handling, instead of raising TypeError on those kwargs.
  • #95 — Greek Extended polytonic capitals for omicron/upsilon/omega/rho
    were corrupted, emitting unrelated Latin letters (ὍμηροςXmiros,
    ὙγίειαPgieia). Corrected all 50 affected entries to the proper base
    romanization, consistent with the monotonic forms (ὍμηροςOmiros).
  • #99.3 — a typo'd form=/errors= value now raises even for pure-ASCII
    input. Previously the ASCII fast-path returned before reaching Rust, so the
    bad enum silently no-opped on ASCII and only raised on the first non-ASCII
    string. Validation now runs before the fast-path in normalize() and
    transliterate().

Performance

  • #70 — the batch entry points (transliterate, slugify, normalize,
    strip_accents on list[str]) now release the GIL around their pure-Rust
    compute loop via py.allow_threads. Multi-threaded callers processing large
    batches now get real parallelism (~1.8× wall-clock with two threads) instead
    of serialising on the interpreter lock. Output is unchanged. Documented in the
    new "Concurrency (GIL)" section of docs/performance.md.

Documentation

  • #94strict_iso9 is no longer described as "ISO 9:1995". It emits ASCII
    digraphs (ж→zh, ч→ch, ш→sh), not the standard's diacritics (ž/č/š) — translit
    tables are ASCII-only by design. Docstrings, the data-file header, and the docs
    now describe it as a scholarly ASCII (ISO 9-style) transliteration and warn it
    is not ISO 9-conformant. No behavior change.

  • #98docs/user-guide/transliteration.md no longer instructs users to
    pip install translit-rs[arabic|hebrew|context] (those empty extras were
    removed in 0.6.0); it now documents the bootstrap_dicts.sh / TRANSLIT_DICT_DIR
    path, matching the README and the runtime error message.

  • #99.1 / #99.2 — fixed two false docstrings: sort_key no longer claims to
    preserve accents (it folds them via transliteration, coinciding with
    search_key), and slugify no longer documents a pretranslate kwarg it
    never had.

  • #84 — corrected the README throughput table (Cyrillic ~106M chars/sec,
    slugify ~712K slugs/sec on commodity 4-vCPU hardware) and added a
    hardware/methodology footnote; added a matching variance note to
    docs/performance.md.

  • #77 — fixed the Text fluent-builder docstring example (normalize is
    keyword-only: .normalize(form="NFC")), reconciled the language-profile count
    (README now agrees with the docs at 83), and documented the context kwarg in
    the transliterate() docstring.

Internal / tests

  • #78 — added adversarial coverage for the raw-bytes decode path
    (detect_encoding / decode_to_utf8): deterministic hostile-byte cases in
    CI plus a Hypothesis st.binary() fuzz suite proving no-panic and
    invariant-preservation. Documented in THREAT_MODEL.md that the decode path
    has no input-size cap (caller's responsibility, per the 0.6.0 cap removal).
  • #79 — added a single-vs-batch kwarg parity regression test across the full
    kwarg matrix and a multi-script corpus (the tones batch drop fixed in 0.6.0
    can no longer recur silently).

v0.6.0 — security hardening

07 Jun 14:11
dd25cf8

Choose a tag to compare

[0.6.0] — 2026-06-07

A hardening and bug-fix release. Two new opt-in helpers (dedup_batch,
make_cached_transliterator) make this a minor bump; no public API was
removed. Several fixes change output for specific inputs — read Upgrade
notes
before upgrading if you cache or persist transliterator/normalizer output.

Upgrade notes (output-affecting fixes)

Each of these was a bug; the new output is the correct one. If you store or cache
results that were keyed on the old (buggy) behaviour, regenerate them:

  • register_replacements() now actually applies. It was a silent no-op — the
    registered table was never consulted. Registered replacements now take effect
    across transliterate() (scalar, list, and context=True). If you registered
    replacements and (knowingly or not) relied on them being ignored, output changes.
  • transliterate(list, tones=True) now returns toned pinyin (was silently
    toneless on the list path); transliterate(list, target=…, tones=True) now
    raises ValueError for the forward-only parameter (was silently ignored).
  • normalize_confusables(text, target="cyrillic") no longer maps characters
    onto invisible combining marks (28 such mappings removed).
  • strip_obfuscation now folds intra-Latin ASCII homoglyphs (þ→p, ſ→f,
    ı→i, …) and is idempotent; sanitize_user_input is idempotent for
    control/invisible characters between combining marks; demojize no longer
    inserts a stray space after a tab/newline that precedes an emoji.
  • Context-aware transliteration (context=True, ar/fa/he) distribution
    changed.
    The empty arabic/hebrew/context pip extras have been removed
    (they never installed anything). The ~37 MB dictionaries are no longer tracked
    in git, and are not shipped in the wheel. Context mode now loads dictionaries
    from $TRANSLIT_DICT_DIR (build them with scripts/bootstrap_dicts.sh), or use
    the embed-dicts Cargo feature for a self-contained build. A packaged
    pip-installable distribution is tracked in #56/#60.
  • decode_to_utf8 default min_confidence changed 0.00.5. Low-confidence
    encoding guesses are now rejected by default instead of silently accepted; pass
    min_confidence=0.0 to restore the old behaviour. (#66)
  • Unknown lang codes now raise instead of silently falling back (#68). A
    typo'd code (lang="RU", lang="russian") used to behave exactly like
    lang=None — quietly-wrong output — while errors=/form= rejected bad
    values. transliterate, slugify, sanitize_filename, catalog_key,
    search_key, sort_key, and ml_normalize now raise TranslitError listing
    the valid codes. "auto", the nb/nn/da aliases, and register_lang()
    codes are accepted. (target= already validated.)

Changed

  • No library-imposed input-size limit (#80, #65). The 10 MiB input cap on
    transliterate, normalize, fold_case, and the preset pipelines has been
    removed — it was paternalistic, inconsistently applied (the ASCII fast
    path bypassed it; slugify/normalize_confusables/strip_zalgo never had it),
    and the threat model already disclaims DoS. All operations are linear time and
    memory; bounding untrusted input is the caller's responsibility, documented
    in the threat model and docstrings. The single retained size guard is the
    register_replacements output amplification bound (a tiny input can expand to
    an enormous string via a caller-registered value — an amplification a caller's
    own input check cannot foresee). Backward-compatible: only previously-rejected
    large inputs now succeed.
  • External wording: capability, not promise. Security-relevant features are now
    described as mechanisms (TR39 confusable mapping, bidi/zalgo stripping, hostname
    analysis) rather than outcome guarantees. Package descriptions, README, and docs no
    longer claim to "prevent"/"neutralize" attacks or achieve "perfect" recovery; the XMR
    benchmark figure is always stated with its tested-pairs scope. Engineering rigor is held
    to a high internal bar (see below); the external surface promises nothing it cannot
    measure.

Added

  • dedup_batch(texts, …) — transliterate a list, processing each distinct
    value once and mapping back (large win for repeated/categorical data; ~146× on a
    high-locality column). Stateless — no cache to invalidate; unique values are chunked
    at the 100k batch cap. (#31)
  • make_cached_transliterator(maxsize=…, …) — opt-in LRU-cached single-string
    transliterator with options fixed at construction. Self-invalidating: the next
    call after any register_lang/register_replacements/remove_replacement/
    clear_replacements clears the cache (via an internal table-generation counter), so
    it never serves stale results. Never enabled by default. (#31)
  • THREAT_MODEL.md — defines in-scope mechanisms, explicit out-of-scope items
    (confusables outside the bundled TR39 table, whole-script and multi-character
    confusables, Unicode-version skew, semantic attacks, DoS), and a vulnerability-vs-
    known-limitation policy, grounded in the literature (Holgers 2006, Deng 2020,
    BitAbuse 2025).
  • SECURITY.md rewritten on real footing: supported-version policy stated, triage
    scope defined, and linked to the threat model.
  • Security-invariant property tests + fuzzing. proptest invariants in Rust
    (src/presets.rs) assert no-panic, idempotence, and "no bidi/format control
    survives" for strip_obfuscation / security_clean / sanitize_user_input /
    strip_bidi across the Unicode input space; a deterministic, CI-gating
    adversarial attack-corpus regression (tests/test_attack_corpus.py:
    homoglyph / zalgo / invisible / bidi / combined, XMR-style); and a cargo-fuzz
    harness
    (fuzz/) for continuous coverage-guided fuzzing of the defense
    pipelines.
  • Confusable coverage for intra-Latin homoglyphs of basic ASCII letters
    (e.g. þ→p, ſ→f, ı→i, ƒ→f, Ɩ→l, ꜱ→s). The TR39 generator previously
    skipped all Latin-script sources for the Latin target, dropping ~83 genuine
    homoglyphs of A–Z/a–z; normalize_confusables/strip_obfuscation now fold
    them. Single-letter Latin confusable coverage of UTS#39 is now complete.
  • Pinned data/confusables.txt (UTS#39 17.0.0) as the reproducible, version-
    controlled input for scripts/gen_confusables.py (--download refreshes it),
    and a tests/test_confusable_coverage.py gate against Unicode-version drift.

Fixed

  • register_replacements() was a silent no-op — the global table was stored
    but never consulted by transliterate(). It now applies as a longest-match
    pre-pass (no cascade) across the scalar, list, and context=True forward paths,
    including ASCII-keyed replacements that previously bypassed Rust via the Python
    fast path. (#51)
  • tones= on the list/batch path was dropped: transliterate(["北京"], tones=True) returned toneless pinyin while the scalar path returned toned, and
    transliterate([...], target=…, tones=True) silently ignored the forward-only
    parameter instead of raising. Both now match the scalar path. (#14, #15)
  • normalize_confusables(target="cyrillic") emitted invisible combining marks
    28 mappings folded a visible character onto a combining Cyrillic-Extended mark (an
    obfuscation vector). The generator now excludes combining-mark targets. (#24)
  • script_info("CanadianAboriginal")["context_aware"] raised KeyError — the
    entry omitted a required ScriptMeta field; a completeness guard now prevents
    recurrence. (#18)
  • Context path skipped strict_iso9/gost7034 mutual-exclusion validation
    transliterate(text, context=True, strict_iso9=True, gost7034=True) now raises
    ValueError like the non-context path; the missing-dictionary error hint is now
    language-specific (hehebrew). (#18)
  • demojize inserted a stray space after a tab/newline preceding an emoji
    ("a\t😀""a\t grinning face"); it now checks for any whitespace. (#12)
  • Compatibility digit variants fold to digits, not letters (#89). The
    confusables table mapped Mathematical Alphanumeric digits 𝟎/𝟏 (and the
    other four families, plus superscripts) to the look-alike letters O/l, so
    normalize_confusables("𝟏𝟎") gave "lO" and strip_obfuscation corrupted
    digit runs. The generator now folds any character whose NFKC form is an ASCII
    digit to that digit. They remain detected as confusable (is_confusable),
    but canonicalize to the correct number. (ASCII 0/1 were already unaffected.)
  • NFKC-compatible Latin is recovered instead of dropped to [?] (#81).
    Mathematical Alphanumeric Symbols (𝕳𝖊𝖑𝖑𝖔 𝟙𝟚𝟛Hello 123), presentation
    ligatures (/fi/fl), and superscripts (x2) now
    transliterate: an unmapped non-ASCII char is NFKC-decomposed and re-tried
    before the error fallback. This matches unidecode/anyascii and closes a
    filter-evasion ("fancy text") gap. Purely additive — only chars that were
    previously [?] are affected; emoji (no ASCII decomposition) still map to [?].
  • Defense pipelines are now idempotent (bugs found by the property tests):
    • strip_obfuscation: emoji whose CLDR name contains typographic punctuation
      (e.g. 👒woman’s hat, U+2019 ) weren't folded because confusables ran
      before demojize; a second pass folded '. Confusables now runs after demojize.
    • sanitize_user_input: an invisible or control character between combining
      marks (e.g. soft-hyphen, NUL) split a mark-run, so removing it after
      zalgo-capping merged runs that a second pass then capped differently. Bidi,
      zero-width, and control characters are now stripped before zalgo-capping.
  • Build-time and doc corrections: build.rs now rejects malformed \u{…} escapes
    in TSV data; embedded-dictionary parse errors are logged (not silently dropped...
Read more

translit 0.5.0

06 Jun 13:14
12c7417

Choose a tag to compare

translit 0.5.0

This release sharpens what translit is: Unicode adversarial-text defense and canonicalization, powered by Rust — TR39 visual confusable mapping, homoglyph / bidi / zalgo / invisible-character stripping, and standards-based Latin/Cyrillic/Greek transliteration. It also adds context-aware transliteration for abjad scripts and fixes a long-standing Linux packaging bug.

Highlights

Adversarial-text defense, front and center. translit maps confusables by appearance (TR39: Cyrillic р → Latin p), the mapping that actually reverses a homoglyph attack — unlike unidecode/anyascii/ftfy, which map phonetically and can't. The new Adversarial-Text Defense guide covers the phonetic-vs-visual distinction and the XMR benchmark evidence.

from translit import strip_obfuscation, normalize_confusables, is_safe_hostname

strip_obfuscation("рroduсt")          # → "product"   (Cyrillic р→p, с→c via TR39)
normalize_confusables("раypal")        # → "paypal"
safe, details = is_safe_hostname("аpple.com")   # → (False, …)  leading Cyrillic а

Context-aware transliteration for Arabic, Persian, and Hebrew. transliterate(text, context=True) uses dictionary-based vowel restoration (bigram → unigram → context-free) to produce readable romanization instead of consonant skeletons. Opt in with pip install translit-rs[arabic] / [hebrew] / [context].

Fixed

  • Linux x86_64 wheels are now built as cp39-abi3. Earlier releases only shipped a cp38-cp38 x86_64 Linux wheel, forcing a source build (Rust toolchain) on Python 3.9+. pip install translit-rs now gets a prebuilt wheel on Linux x86_64 like every other platform. (#26)
  • Documentation corrections (consistent language-profile count; verified homoglyph examples).

Security

  • All third-party GitHub Actions pinned to commit SHAs across CI and the release pipeline; added Dependabot to keep them current. Dev/docs dependency bumps (Pygments 2.20.0, pytest 9.0.3).

Compatibility

No breaking changes. No public API, language codes, or script coverage were removed — translit-rs still has zero runtime dependencies. CJK/Indic/other scripts remain available as best-effort, unidecode-compatible coverage.

Install

pip install translit-rs

Full changelog: https://github.com/raeq/translit/blob/main/CHANGELOG.md

v0.4.0

29 Mar 19:43
dc54260

Choose a tag to compare

v0.4.0

Breaking changes

  • Batch functions removed. transliterate_batch(), slugify_batch(), normalize_batch(), and strip_accents_batch() are gone. The base functions now accept both str and list[str] via @typing.overload:

    transliterate("café")              # → "cafe"
    transliterate(["café", "naïve"])   # → ["cafe", "naive"]
  • strip_obfuscation() no longer transliterates. Uses TR39 confusable mapping (visual similarity) instead of phonetic transliteration. lang= parameter removed. Chain with transliterate() if romanization is also needed.

New features

  • strip_obfuscation() — maximum-strength deobfuscation preset. Resolves homoglyph spoofing (Cyrillic р→p, с→c), strips zalgo, invisible chars, bidi attacks, expands emoji.
  • lang_info() / script_info() — structured metadata for all 83 languages and 57 scripts, with import-time drift assertions.
  • 18 new languages (Balinese, Bamum, Buginese, Cherokee, Cham, Coptic, Tai Lue, Lisu, Meitei, Northern Thai, N'Ko, Santali, Sundanese, Syriac, Tai Le, Tagalog, Tamazight, Vai) and 10 new Script enum members.

Bug fixes

  • Combining marks and zero-width characters no longer produce [?] (283 new TSV mappings)
  • TextPipeline confusable ordering fixed (transliterate before confusables)
  • demojize() spaces adjacent emoji replacements ("🔥🔥""fire fire")
  • SCRIPT_RANGES sort order fix + invariant test
  • Tibetan documentation corrected (Indic-phonetic, not Wylie)

Infrastructure

  • API stability tests (133), mutation testing killers (92)
  • CI restructured: 10× faster Python tests, path-filtered CodeQL, no duplicate runs
  • Transliteration provenance documentation
  • docs/index.md generated from README.md (single source of truth)

See CHANGELOG.md for full details.