Releases: raeq/disarm
v0.9.0 — first disarm release
The first release under the disarm name — the continuation of translit-rs (last released as 0.8.1). See #264 for the rename rationale.
disarm unifies the distribution and import names: pip install disarm then import disarm.
Breaking changes (migrating from translit-rs)
- Install/import:
pip install translit-rs→pip install disarm;import translit→import disarm. - Exception: the base exception
TranslitError→DisarmError(the subclassesInvalidArgumentError/ResourceLimitError/UnsupportedErrorkeep their names).DisarmErroris still aValueErrorsubclass, soexcept ValueErrorkeeps working. - Env var: context-dictionary path
TRANSLIT_DICT_DIR→DISARM_DICT_DIR. - Native module
translit._translit→disarm._disarm; console scripttranslit→disarm.
The public transform API is otherwise unchanged: transliterate(), normalize(), slugify(), the security/pipeline helpers all keep their names and behaviour.
Install: pip install disarm==0.9.0
Full changelog: https://github.com/raeq/disarm/blob/main/CHANGELOG.md
v0.8.1 — honest benchmark numbers (final translit-rs release)
The final translit-rs release and the close of the 0.8 performance-hardening arc.
The project continues as disarm from 0.9.0 (#264). 0.8.1 ships honest, production-true benchmark numbers before the rename.
Changed
- Fresh-string benchmark regime (#277, #302): every timed call now receives a newly constructed
str, the way production traffic always does. The prior cached-object measurement let CPython's per-objectAsUTF8cache hide ~105–137 ns/call of UTF-8 encode cost that onlytranslitpays (pure-Python comparators never callAsUTF8), flattering it. JSON records now carryregime: fresh-string/v2; pre-flip history is the cachedv1regime and must not be compared across regimes. - README short-string figures updated to the measured fresh-regime values: ~17× vs Unidecode (Latin), ~14× (mixed scripts), ~13× (Cyrillic/Greek); ~65 ns ASCII passthrough; the four-cell Unidecode-own sweep still holds (~1.3× on Unidecode's strongest case to ~25×).
Install: pip install translit-rs==0.8.1
Full changelog: https://github.com/raeq/translit/blob/main/CHANGELOG.md
v0.8.0 — performance & hardening
A performance and hardening release. The headline is a benchmark-gated optimisation programme (#233) that makes short-string transliterate roughly 15–21× faster than Unidecode (up from ~7–9×) and beats Unidecode on its own benchmark, while shrinking the library's static and resident memory.
Highlights
- Faster per call: a
transliteratecall now crosses the Python→Rust boundary exactly once and returns already-ASCII input as the originalstrobject — roughly 70 ns with no allocation. Short strings are ~15–21× faster than Unidecode, and translit wins all four cells of Unidecode's own benchmark (#277, #281). - Smaller footprint: the default BMP table is a page-table + interned-blob trie (~1 MB → ~58 KB), hanzi→pinyin a dense interned array (~600 KB → ~50 KB), Hangul a single packed blob (#237); context dictionaries are now zero-copy, roughly halving their resident memory (#238); replacement and slug scanning use Aho-Corasick automata and emoji match through a code-point trie (#242).
- Security hardening:
is_safe_hostnameflags every mixed-script label (#254); security presets no longer synthesise path separators from confusables (#248);rag_ingestruns the confusables step (#258); the stateful slugifiers validatelang(#257).
Upgrade notes
- Minimum Python is now 3.10 (was 3.9). The extension targets the stable-ABI floor
abi3-py310(#277); Python 3.9 wheels are no longer produced. is_safe_hostnamenow flags every mixed-script label as unsafe (#254), not only the Latin-paired high-risk combinations. Inspect themixed_script/scriptsfields for a more permissive policy; the check fails closed by design.- Output may change for some inputs: the security-preset path-separator fix (#248),
rag_ingestconfusables canonicalisation (#258), stateful-slugifierlangvalidation (#257), and a few correctness edge cases (#249, #253, #255).
See the full changelog for the complete list.
0.7.0
A feature and architecture release. Headlines: a unified, catchable exception
hierarchy; terminal column-width measurement (terminal_width /
grapheme_width); native errors="strict" transliteration; LLM/RAG
guardrail pipeline presets; and a substantial push of validation and
configuration logic down into the Rust core, so the upcoming multi-language
bindings inherit one behaviour instead of reimplementing it. Most changes are
behaviour-preserving; the exceptions are called out under Upgrade notes.
Upgrade notes
- Exceptions now form a hierarchy. Every library error subclasses
TranslitError, withInvalidArgumentError,ResourceLimitError, and
UnsupportedErrorbeneath it.TranslitErrorremains aValueError
subclass, so existingexcept ValueErrorkeeps working. Several error
message strings were enriched/standardised (#186, #187) — code matching
exact message text may need updating; code matching exception types is
unaffected. lang=is validated even for ASCII input (#197). A binding-side ASCII
fast path previously skipped language validation, so
transliterate("abc", lang="zz")silently returned the input; it now raises
InvalidArgumentError, matching how non-ASCII input always behaved.slugify_filename/Slugify(safe_chars=…)output corrected (see Fixed):
slugify_filename("My Report.pdf")now returns"My_Report.pdf", not
"My.Report_pdf". Output for inputs that usesafe_charsmay change.- New modes:
errors="strict"fortransliterate(#184) and
decode_to_utf8(strict=True)(#189).
Added
terminal_width/grapheme_width(#224): terminal column width per
grapheme cluster (UAX #11 East Asian Width). Wide/fullwidth and
emoji-presented clusters are 2 columns; combining marks, controls, and
zero-width characters are 0. Ambiguous characters are 1 by default, or 2 with
ambiguous_wide=True. Width data is generated at build time from the pinned
UCD (no runtime data, nounsafe). Measures cells, not pixels; tabs are not
expanded.errors="strict"+find_untranslatable(#184): strict transliteration
raises on the first untranslatable character (reporting it and its byte
offset);find_untranslatablereturns all of them without raising.- Guardrail pipeline presets (#139):
TextPipelinegainsstrip_bidiand
strip_zalgosteps and thellm_guardrail/rag_ingestnamed profiles for
LLM/RAG input sanitisation. get_pipeline/list_profiles(#229): the named policy-profile registry
now lives in the Rust core; the Python helpers are thin wrappers over it.decode_to_utf8(strict=True)(#189): raise on lossy/replacement decoding
instead of silently substituting U+FFFD.
Changed
- Unified exception hierarchy (#183): the Python error surface is a
TranslitErrorbase with categorised subclasses; sites that previously raised
bareValueErrorare unified (foundation laid in 0.6.3 via #181). - Validation moved into the Rust core (#185, #217, #229, #230, #231): enum
validation, thetransliterate()argument-conflict matrix, non-negative
max_length/max_graphemeschecks,safe_chars, andmin_confidence
range-checking now live in the core, so other bindings enforce the identical
contract without reimplementing it. The Python layer keeps only type guards. - Actionable error messages (#186, #187): weak messages now name the
offending value, list valid options, and suggest a "did you mean…?" where
applicable; message style is standardised across the surface. - Error cause chains (#188): wrapped errors surface the underlying cause via
__cause__rather than flattening it into the message. TextPipelinestep ordering (#174) is derived from a single source of
truth, removing drift between configuration and execution order.- All-ASCII preset fast path (#198): presets skip the NFKC pass for pure-ASCII
input (behaviour-preserving).
Fixed
slugify_filename/Slugify(safe_chars=…)preserved safe characters at
the wrong positions —slugify_filename("My Report.pdf")returned
"My.Report_pdf"instead of the awesome-slugify-correct"My_Report.pdf".
safe_charsare now handled natively in the Rust core: kept verbatim and
treated as word characters so they hold their position (#156, #230). The prior
test only covered a dot-free input, so the bug was uncaught; regression tests
now cover filenames with extensions, multiple dots, andUniqueSlugify+
max_length.slugify(default=…)is now sanitised through the same slug pipeline (so a
caller-supplied fallback cannot smuggle path-traversal or URL metacharacters
into output documented as URL-safe), threads through the statefulSlugifier/
UniqueSlugifierforms, and a negativemax_lengthnow raises a catchable
InvalidArgumentErroron both the scalar and batch paths instead of an
uncatchableOverflowError(#193, #169).- Low-severity hardening bundle (#200): eight small robustness fixes
(bounds, overflow, and edge-case handling) gathered into one pass.
Security
- The RustSec advisory audit (
cargo-audit) now blocks merge via the
required "Rust checks passed" gate on every PR — an advisory can land on a
dependency without any code change here (#195).
Removed
- Docker image build/publish and its Trivy CVE scan (#138). translit is a
pip install-first library; previously published images remain as historical
artifacts, but no new ones are produced. Install the CLI via
pip install translit-rs.
Documentation
- Executable cookbook (#154, #91, #140, #156, #172): a Sybil doc-test harness
with a CI gate, unidecode→translit migration recipes, an "LLM pipelines" page,
a tokenizer-preprocessing page, and an anti-rot lint that turned 307 decorative
# =>claims into checked assertions. - normalize-first canonicalisation recipe (#174) and a formal-verification
assurance taxonomy (#223 — proof-by-exhaustion / structural / property-tested,
tagging each I1–I7 invariant), plus grapheme-integrity property tests (#174). - The project adopted the Developer Certificate of Origin (#165); all commits
are signed off. The custom-emoji-provider 9-codepoint window cap is now
documented (#199).
v0.6.3
A correctness, maintenance, and architecture-foundation release. No output-affecting changes — behaviour-preserving throughout; the one new public behaviour (slugify(default=...)) is opt-in.
Highlights
- Error-model foundation (#181): a pure-Rust
Errorenum + stablecode()+ a singleFrom<Error> for PyErrboundary — decouples the core from PyO3 and lays the groundwork for the multi-language bindings roadmap. slugify(text, default="…")— opt-in fallback for inputs that would slug to""(#97).- Fixed:
PRESETS["strip_obfuscation"]order (#141), lock-poison PythonUserWarning(#117),docs/api/exceptions.md(#182). - Dependencies (migrated + verified behaviour-preserving): phf 0.13, criterion 0.8, chardetng 1.0 (#146 / #153 / #164).
- Maintenance:
__init__.pysplit (#73), build.rs language auto-discovery (#74), stub/binary drift-check (#76), integration-test split (#75), the "Conversations resolved" merge gate (#55), and a documented dependency-upgrade methodology.
Pre-release verified: full Tier-1 CI + Tier-3 exhaustive (all Hangul/BMP/CJK/Indic) + formal invariants I1–I7.
Full changelog: https://github.com/raeq/translit/blob/main/CHANGELOG.md
Correction: this release originally listed a Trivy image-scan fix (#138). That fix did not work, and the Docker build/publish pipeline has since been removed entirely.
pip install -U translit-rs
v0.6.2 — correctness, security, performance & maintenance
A correctness, security, performance and maintenance release triaged from a
post-0.6.1 issue sweep (#101–#132). No public API removed; one small new public
behaviour (slugify(save_order=True) now functions). Two output-affecting
fixes — see Upgrade notes.
Upgrade notes (output-affecting)
slugify(save_order=True)was an accepted no-op; it now strips only
leading/trailing stopwords (preserving interior word order), matching
python-slugify (#118). If you passedsave_order=True, slug output changes.decode_to_utf8defaultmin_confidence0.5→0.95(#103). The old
default was inert (the detector only reports0.50/0.95, and0.50 < 0.50
is false), so it never rejected. It now requires high confidence by default;
passmin_confidence=0.0to accept any guess. (No practical change today —
the detector currently always reports0.95.)
Fixed
- #102 —
UniqueSlugifyno longer panics across the FFI boundary on a
multibyte separator + smallmax_length(byte slice landed mid-codepoint;
now usesfloor_char_boundary). - #101 — context bigram disambiguation tier was unreachable (it reset on
every inter-word space); it now resets only on hard boundaries, so the tier
fires in normal prose. - #104 —
set_emoji_providernow obeysseal_registrations()(the provider
swap previously defeated the seal). - #103 —
decode_to_utf8default confidence now actually gates (see notes). - #107 — a corrupt context dictionary now reports a distinct "corrupt" error
instead of the misleading "not found" remedy (DictStateenum). - #121 —
PRESETS["sanitize_user_input"]now reflects the real pipeline
order (strip invisibles before zalgo); Python registry and Rust doc aligned. - #129 —
Text.transliterate()stub now declares thetones/context
parameters the implementation accepts. - #131 —
Slugify(uids=...)emits a correct wrong-class warning rather than
a spurious deprecation warning. - #122 — disambiguated the
_compatshould_warnnested ternary.
Security
- #105 — added a
cargo audit(RustSec advisory) CI job and acargo
Dependabot ecosystem. - #132 — added a Trivy CVE scan of the published image to the release
workflow (SARIF → Security tab, fails on fixable HIGH/CRITICAL) +.trivyignore. - #106 — Rust diagnostics now route through Python
warningsinstead of
bareeprintln!, so applications can capture/suppress them.
Performance (output-preserving)
- #108 codepoint-range diacritic checks in
tokenize(); #109mem::take
per token boundary; #110 singlech.nfkc()pass on the NFKC fallback;
#111 loweredMAX_CAPACITY_HINT256 MiB → 8 MiB; #112/#113 emoji
matching uses stack buffers + a fixed sliding window (no per-charVec/String);
#114 slugify usesCow(no eagerto_owned); #115 contexttokenize()
returns borrowed (Cow) slices of the input — zero per-token allocation
(Rust API: the crate-internalcontext::Token.textchanged fromString
toCow<'_, str>; no effect on the Python API); #116 clamped the
ContextDictcapacity hint.
Maintenance
- #118 implemented
slugify(save_order=True); #119SlugConfig::from_pyargs
dedupes the four slugify PyO3 entrypoints; #120_build_slug_kwargshelper;
#123 seal-enforcement docs on eachtables::mutator; #124
infallibility comments; #125 typed_CallableModule.__call__kwargs;
#126 correctedrecover_lockdoc; #127 documented the lazy-import
workaround; #128 renamed_mutation_generation→_registration_generation;
#130 annotated the defence-in-depth conflict check.
v0.6.1 — bug-fix, correctness & performance
A bug-fix and test-hardening release. No public API was removed and no new
public names were added. One fix changes key output for inputs containing
invisible characters — see Upgrade notes.
Upgrade notes (output-affecting fix)
search_key/catalog_key/sort_keynow strip bidi overrides and
soft-hyphen / format characters (#93). Previously a value stored with an
invisible character (e.g."password","usertxt") produced a
different key from its clean equivalent, so dedup and lookup silently
missed. The new key is the correct one; if you persist these keys, regenerate
any that were computed over text that could contain invisible characters.
Fixed
- #93 — key functions (
search_key/catalog_key/sort_key) leaked bidi
and soft-hyphen characters, so visually-identical inputs produced
non-colliding keys. They nowstrip_bidiafter NFKC, matching the other
canonicalization presets. - #82 — Greek reverse transliteration (
transliterate(text, target="el"))
left literal Latin letters in the output ("psychi"→"ψyχη"). The forward
direction romanizes Υ/υ asY/y(including the ου/αυ/ευ diphthongs), so the
elreverse table now mapsY/yback to Greek; round-trips no longer leak
Latin letters. - #69 —
transliterate()resolved conflicting kwargs differently forstr
vslistinput (one path silently droppedtarget, the othercontext).
Conflicts are now checked once, before the dispatch, so both raise identically:
context+targetandcontext+tonesraiseValueError. - #72 —
translit.unidecode()now mirrors the Unidecode 1.3 signature
unidecode(string, errors="ignore", replace_str="?"), mapping Unidecode's
errorsmodes (ignore/replace/preserve/strict) onto the native error
handling, instead of raisingTypeErroron those kwargs. - #95 — Greek Extended polytonic capitals for omicron/upsilon/omega/rho
were corrupted, emitting unrelated Latin letters (Ὅμηρος→Xmiros,
Ὑγίεια→Pgieia). Corrected all 50 affected entries to the proper base
romanization, consistent with the monotonic forms (Ὅμηρος→Omiros). - #99.3 — a typo'd
form=/errors=value now raises even for pure-ASCII
input. Previously the ASCII fast-path returned before reaching Rust, so the
bad enum silently no-opped on ASCII and only raised on the first non-ASCII
string. Validation now runs before the fast-path innormalize()and
transliterate().
Performance
- #70 — the batch entry points (
transliterate,slugify,normalize,
strip_accentsonlist[str]) now release the GIL around their pure-Rust
compute loop viapy.allow_threads. Multi-threaded callers processing large
batches now get real parallelism (~1.8× wall-clock with two threads) instead
of serialising on the interpreter lock. Output is unchanged. Documented in the
new "Concurrency (GIL)" section ofdocs/performance.md.
Documentation
-
#94 —
strict_iso9is no longer described as "ISO 9:1995". It emits ASCII
digraphs (ж→zh, ч→ch, ш→sh), not the standard's diacritics (ž/č/š) — translit
tables are ASCII-only by design. Docstrings, the data-file header, and the docs
now describe it as a scholarly ASCII (ISO 9-style) transliteration and warn it
is not ISO 9-conformant. No behavior change. -
#98 —
docs/user-guide/transliteration.mdno longer instructs users to
pip install translit-rs[arabic|hebrew|context](those empty extras were
removed in 0.6.0); it now documents thebootstrap_dicts.sh/TRANSLIT_DICT_DIR
path, matching the README and the runtime error message. -
#99.1 / #99.2 — fixed two false docstrings:
sort_keyno longer claims to
preserve accents (it folds them via transliteration, coinciding with
search_key), andslugifyno longer documents apretranslatekwarg it
never had. -
#84 — corrected the README throughput table (Cyrillic ~106M chars/sec,
slugify ~712K slugs/sec on commodity 4-vCPU hardware) and added a
hardware/methodology footnote; added a matching variance note to
docs/performance.md. -
#77 — fixed the
Textfluent-builder docstring example (normalizeis
keyword-only:.normalize(form="NFC")), reconciled the language-profile count
(README now agrees with the docs at 83), and documented thecontextkwarg in
thetransliterate()docstring.
Internal / tests
- #78 — added adversarial coverage for the raw-bytes decode path
(detect_encoding/decode_to_utf8): deterministic hostile-byte cases in
CI plus a Hypothesisst.binary()fuzz suite proving no-panic and
invariant-preservation. Documented inTHREAT_MODEL.mdthat the decode path
has no input-size cap (caller's responsibility, per the 0.6.0 cap removal). - #79 — added a single-vs-batch kwarg parity regression test across the full
kwarg matrix and a multi-script corpus (thetonesbatch drop fixed in 0.6.0
can no longer recur silently).
v0.6.0 — security hardening
[0.6.0] — 2026-06-07
A hardening and bug-fix release. Two new opt-in helpers (dedup_batch,
make_cached_transliterator) make this a minor bump; no public API was
removed. Several fixes change output for specific inputs — read Upgrade
notes before upgrading if you cache or persist transliterator/normalizer output.
Upgrade notes (output-affecting fixes)
Each of these was a bug; the new output is the correct one. If you store or cache
results that were keyed on the old (buggy) behaviour, regenerate them:
register_replacements()now actually applies. It was a silent no-op — the
registered table was never consulted. Registered replacements now take effect
acrosstransliterate()(scalar, list, andcontext=True). If you registered
replacements and (knowingly or not) relied on them being ignored, output changes.transliterate(list, tones=True)now returns toned pinyin (was silently
toneless on the list path);transliterate(list, target=…, tones=True)now
raisesValueErrorfor the forward-only parameter (was silently ignored).normalize_confusables(text, target="cyrillic")no longer maps characters
onto invisible combining marks (28 such mappings removed).strip_obfuscationnow folds intra-Latin ASCII homoglyphs (þ→p,ſ→f,
ı→i, …) and is idempotent;sanitize_user_inputis idempotent for
control/invisible characters between combining marks;demojizeno longer
inserts a stray space after a tab/newline that precedes an emoji.- Context-aware transliteration (
context=True, ar/fa/he) distribution
changed. The emptyarabic/hebrew/contextpip extras have been removed
(they never installed anything). The ~37 MB dictionaries are no longer tracked
in git, and are not shipped in the wheel. Context mode now loads dictionaries
from$TRANSLIT_DICT_DIR(build them withscripts/bootstrap_dicts.sh), or use
theembed-dictsCargo feature for a self-contained build. A packaged
pip-installable distribution is tracked in #56/#60. decode_to_utf8defaultmin_confidencechanged0.0→0.5. Low-confidence
encoding guesses are now rejected by default instead of silently accepted; pass
min_confidence=0.0to restore the old behaviour. (#66)- Unknown
langcodes now raise instead of silently falling back (#68). A
typo'd code (lang="RU",lang="russian") used to behave exactly like
lang=None— quietly-wrong output — whileerrors=/form=rejected bad
values.transliterate,slugify,sanitize_filename,catalog_key,
search_key,sort_key, andml_normalizenow raiseTranslitErrorlisting
the valid codes."auto", thenb/nn/daaliases, andregister_lang()
codes are accepted. (target=already validated.)
Changed
- No library-imposed input-size limit (#80, #65). The 10 MiB input cap on
transliterate,normalize,fold_case, and the preset pipelines has been
removed — it was paternalistic, inconsistently applied (the ASCII fast
path bypassed it;slugify/normalize_confusables/strip_zalgonever had it),
and the threat model already disclaims DoS. All operations are linear time and
memory; bounding untrusted input is the caller's responsibility, documented
in the threat model and docstrings. The single retained size guard is the
register_replacementsoutput amplification bound (a tiny input can expand to
an enormous string via a caller-registered value — an amplification a caller's
own input check cannot foresee). Backward-compatible: only previously-rejected
large inputs now succeed. - External wording: capability, not promise. Security-relevant features are now
described as mechanisms (TR39 confusable mapping, bidi/zalgo stripping, hostname
analysis) rather than outcome guarantees. Package descriptions, README, and docs no
longer claim to "prevent"/"neutralize" attacks or achieve "perfect" recovery; the XMR
benchmark figure is always stated with its tested-pairs scope. Engineering rigor is held
to a high internal bar (see below); the external surface promises nothing it cannot
measure.
Added
dedup_batch(texts, …)— transliterate a list, processing each distinct
value once and mapping back (large win for repeated/categorical data; ~146× on a
high-locality column). Stateless — no cache to invalidate; unique values are chunked
at the 100k batch cap. (#31)make_cached_transliterator(maxsize=…, …)— opt-in LRU-cached single-string
transliterator with options fixed at construction. Self-invalidating: the next
call after anyregister_lang/register_replacements/remove_replacement/
clear_replacementsclears the cache (via an internal table-generation counter), so
it never serves stale results. Never enabled by default. (#31)THREAT_MODEL.md— defines in-scope mechanisms, explicit out-of-scope items
(confusables outside the bundled TR39 table, whole-script and multi-character
confusables, Unicode-version skew, semantic attacks, DoS), and a vulnerability-vs-
known-limitation policy, grounded in the literature (Holgers 2006, Deng 2020,
BitAbuse 2025).SECURITY.mdrewritten on real footing: supported-version policy stated, triage
scope defined, and linked to the threat model.- Security-invariant property tests + fuzzing.
proptestinvariants in Rust
(src/presets.rs) assert no-panic, idempotence, and "no bidi/format control
survives" forstrip_obfuscation/security_clean/sanitize_user_input/
strip_bidiacross the Unicode input space; a deterministic, CI-gating
adversarial attack-corpus regression (tests/test_attack_corpus.py:
homoglyph / zalgo / invisible / bidi / combined, XMR-style); and acargo-fuzz
harness (fuzz/) for continuous coverage-guided fuzzing of the defense
pipelines. - Confusable coverage for intra-Latin homoglyphs of basic ASCII letters
(e.g.þ→p,ſ→f,ı→i,ƒ→f,Ɩ→l,ꜱ→s). The TR39 generator previously
skipped all Latin-script sources for the Latin target, dropping ~83 genuine
homoglyphs of A–Z/a–z;normalize_confusables/strip_obfuscationnow fold
them. Single-letter Latin confusable coverage of UTS#39 is now complete. - Pinned
data/confusables.txt(UTS#39 17.0.0) as the reproducible, version-
controlled input forscripts/gen_confusables.py(--downloadrefreshes it),
and atests/test_confusable_coverage.pygate against Unicode-version drift.
Fixed
register_replacements()was a silent no-op — the global table was stored
but never consulted bytransliterate(). It now applies as a longest-match
pre-pass (no cascade) across the scalar, list, andcontext=Trueforward paths,
including ASCII-keyed replacements that previously bypassed Rust via the Python
fast path. (#51)tones=on the list/batch path was dropped:transliterate(["北京"], tones=True)returned toneless pinyin while the scalar path returned toned, and
transliterate([...], target=…, tones=True)silently ignored the forward-only
parameter instead of raising. Both now match the scalar path. (#14, #15)normalize_confusables(target="cyrillic")emitted invisible combining marks —
28 mappings folded a visible character onto a combining Cyrillic-Extended mark (an
obfuscation vector). The generator now excludes combining-mark targets. (#24)script_info("CanadianAboriginal")["context_aware"]raisedKeyError— the
entry omitted a requiredScriptMetafield; a completeness guard now prevents
recurrence. (#18)- Context path skipped
strict_iso9/gost7034mutual-exclusion validation —
transliterate(text, context=True, strict_iso9=True, gost7034=True)now raises
ValueErrorlike the non-context path; the missing-dictionary error hint is now
language-specific (he→hebrew). (#18) demojizeinserted a stray space after a tab/newline preceding an emoji
("a\t😀"→"a\t grinning face"); it now checks for any whitespace. (#12)- Compatibility digit variants fold to digits, not letters (#89). The
confusables table mapped Mathematical Alphanumeric digits𝟎/𝟏(and the
other four families, plus superscripts) to the look-alike lettersO/l, so
normalize_confusables("𝟏𝟎")gave"lO"andstrip_obfuscationcorrupted
digit runs. The generator now folds any character whose NFKC form is an ASCII
digit to that digit. They remain detected as confusable (is_confusable),
but canonicalize to the correct number. (ASCII0/1were already unaffected.) - NFKC-compatible Latin is recovered instead of dropped to
[?](#81).
Mathematical Alphanumeric Symbols (𝕳𝖊𝖑𝖑𝖔 𝟙𝟚𝟛→Hello 123), presentation
ligatures (fi/fl→fi/fl), and superscripts (x²→x2) now
transliterate: an unmapped non-ASCII char is NFKC-decomposed and re-tried
before the error fallback. This matches unidecode/anyascii and closes a
filter-evasion ("fancy text") gap. Purely additive — only chars that were
previously[?]are affected; emoji (no ASCII decomposition) still map to[?]. - Defense pipelines are now idempotent (bugs found by the property tests):
strip_obfuscation: emoji whose CLDR name contains typographic punctuation
(e.g.👒→woman’s hat, U+2019’) weren't folded because confusables ran
before demojize; a second pass folded’→'. Confusables now runs after demojize.sanitize_user_input: an invisible or control character between combining
marks (e.g. soft-hyphen, NUL) split a mark-run, so removing it after
zalgo-capping merged runs that a second pass then capped differently. Bidi,
zero-width, and control characters are now stripped before zalgo-capping.
- Build-time and doc corrections:
build.rsnow rejects malformed\u{…}escapes
in TSV data; embedded-dictionary parse errors are logged (not silently dropped...
translit 0.5.0
translit 0.5.0
This release sharpens what translit is: Unicode adversarial-text defense and canonicalization, powered by Rust — TR39 visual confusable mapping, homoglyph / bidi / zalgo / invisible-character stripping, and standards-based Latin/Cyrillic/Greek transliteration. It also adds context-aware transliteration for abjad scripts and fixes a long-standing Linux packaging bug.
Highlights
Adversarial-text defense, front and center. translit maps confusables by appearance (TR39: Cyrillic р → Latin p), the mapping that actually reverses a homoglyph attack — unlike unidecode/anyascii/ftfy, which map phonetically and can't. The new Adversarial-Text Defense guide covers the phonetic-vs-visual distinction and the XMR benchmark evidence.
from translit import strip_obfuscation, normalize_confusables, is_safe_hostname
strip_obfuscation("рroduсt") # → "product" (Cyrillic р→p, с→c via TR39)
normalize_confusables("раypal") # → "paypal"
safe, details = is_safe_hostname("аpple.com") # → (False, …) leading Cyrillic аContext-aware transliteration for Arabic, Persian, and Hebrew. transliterate(text, context=True) uses dictionary-based vowel restoration (bigram → unigram → context-free) to produce readable romanization instead of consonant skeletons. Opt in with pip install translit-rs[arabic] / [hebrew] / [context].
Fixed
- Linux x86_64 wheels are now built as
cp39-abi3. Earlier releases only shipped acp38-cp38x86_64 Linux wheel, forcing a source build (Rust toolchain) on Python 3.9+.pip install translit-rsnow gets a prebuilt wheel on Linux x86_64 like every other platform. (#26) - Documentation corrections (consistent language-profile count; verified homoglyph examples).
Security
- All third-party GitHub Actions pinned to commit SHAs across CI and the release pipeline; added Dependabot to keep them current. Dev/docs dependency bumps (Pygments 2.20.0, pytest 9.0.3).
Compatibility
No breaking changes. No public API, language codes, or script coverage were removed — translit-rs still has zero runtime dependencies. CJK/Indic/other scripts remain available as best-effort, unidecode-compatible coverage.
Install
pip install translit-rsFull changelog: https://github.com/raeq/translit/blob/main/CHANGELOG.md
v0.4.0
v0.4.0
Breaking changes
-
Batch functions removed.
transliterate_batch(),slugify_batch(),normalize_batch(), andstrip_accents_batch()are gone. The base functions now accept bothstrandlist[str]via@typing.overload:transliterate("café") # → "cafe" transliterate(["café", "naïve"]) # → ["cafe", "naive"]
-
strip_obfuscation()no longer transliterates. Uses TR39 confusable mapping (visual similarity) instead of phonetic transliteration.lang=parameter removed. Chain withtransliterate()if romanization is also needed.
New features
strip_obfuscation()— maximum-strength deobfuscation preset. Resolves homoglyph spoofing (Cyrillic р→p, с→c), strips zalgo, invisible chars, bidi attacks, expands emoji.lang_info()/script_info()— structured metadata for all 83 languages and 57 scripts, with import-time drift assertions.- 18 new languages (Balinese, Bamum, Buginese, Cherokee, Cham, Coptic, Tai Lue, Lisu, Meitei, Northern Thai, N'Ko, Santali, Sundanese, Syriac, Tai Le, Tagalog, Tamazight, Vai) and 10 new Script enum members.
Bug fixes
- Combining marks and zero-width characters no longer produce
[?](283 new TSV mappings) TextPipelineconfusable ordering fixed (transliterate before confusables)demojize()spaces adjacent emoji replacements ("🔥🔥"→"fire fire")- SCRIPT_RANGES sort order fix + invariant test
- Tibetan documentation corrected (Indic-phonetic, not Wylie)
Infrastructure
- API stability tests (133), mutation testing killers (92)
- CI restructured: 10× faster Python tests, path-filtered CodeQL, no duplicate runs
- Transliteration provenance documentation
docs/index.mdgenerated fromREADME.md(single source of truth)
See CHANGELOG.md for full details.