Releases · raeq/disarm

11 Jun 14:26

raeq

v0.9.0

25d638d

Latest

The first release under the disarm name — the continuation of translit-rs (last released as 0.8.1). See #264 for the rename rationale.

disarm unifies the distribution and import names: pip install disarm then import disarm.

Breaking changes (migrating from `translit-rs`)

Install/import: pip install translit-rs → pip install disarm; import translit → import disarm.
Exception: the base exception TranslitError → DisarmError (the subclasses InvalidArgumentError / ResourceLimitError / UnsupportedError keep their names). DisarmError is still a ValueError subclass, so except ValueError keeps working.
Env var: context-dictionary path TRANSLIT_DICT_DIR → DISARM_DICT_DIR.
Native module translit._translit → disarm._disarm; console script translit → disarm.

The public transform API is otherwise unchanged: transliterate(), normalize(), slugify(), the security/pipeline helpers all keep their names and behaviour.

Install: pip install disarm==0.9.0

Full changelog: https://github.com/raeq/disarm/blob/main/CHANGELOG.md

Assets 2

11 Jun 13:14

raeq

v0.8.1

778b969

v0.8.1 — honest benchmark numbers (final translit-rs release)

The final translit-rs release and the close of the 0.8 performance-hardening arc.

The project continues as disarm from 0.9.0 (#264). 0.8.1 ships honest, production-true benchmark numbers before the rename.

Changed

Fresh-string benchmark regime (#277, #302): every timed call now receives a newly constructed str, the way production traffic always does. The prior cached-object measurement let CPython's per-object AsUTF8 cache hide ~105–137 ns/call of UTF-8 encode cost that only translit pays (pure-Python comparators never call AsUTF8), flattering it. JSON records now carry regime: fresh-string/v2; pre-flip history is the cached v1 regime and must not be compared across regimes.
README short-string figures updated to the measured fresh-regime values: ~17× vs Unidecode (Latin), ~14× (mixed scripts), ~13× (Cyrillic/Greek); ~65 ns ASCII passthrough; the four-cell Unidecode-own sweep still holds (~1.3× on Unidecode's strongest case to ~25×).

Install: pip install translit-rs==0.8.1

Full changelog: https://github.com/raeq/translit/blob/main/CHANGELOG.md

Assets 2

10 Jun 23:44

raeq

v0.8.0

1776f89

v0.8.0 — performance & hardening

A performance and hardening release. The headline is a benchmark-gated optimisation programme (#233) that makes short-string transliterate roughly 15–21× faster than Unidecode (up from ~7–9×) and beats Unidecode on its own benchmark, while shrinking the library's static and resident memory.

Highlights

Faster per call: a transliterate call now crosses the Python→Rust boundary exactly once and returns already-ASCII input as the original str object — roughly 70 ns with no allocation. Short strings are ~15–21× faster than Unidecode, and translit wins all four cells of Unidecode's own benchmark (#277, #281).
Smaller footprint: the default BMP table is a page-table + interned-blob trie (~1 MB → ~58 KB), hanzi→pinyin a dense interned array (~600 KB → ~50 KB), Hangul a single packed blob (#237); context dictionaries are now zero-copy, roughly halving their resident memory (#238); replacement and slug scanning use Aho-Corasick automata and emoji match through a code-point trie (#242).
Security hardening: is_safe_hostname flags every mixed-script label (#254); security presets no longer synthesise path separators from confusables (#248); rag_ingest runs the confusables step (#258); the stateful slugifiers validate lang (#257).

Upgrade notes

Minimum Python is now 3.10 (was 3.9). The extension targets the stable-ABI floor abi3-py310 (#277); Python 3.9 wheels are no longer produced.
is_safe_hostname now flags every mixed-script label as unsafe (#254), not only the Latin-paired high-risk combinations. Inspect the mixed_script / scripts fields for a more permissive policy; the check fails closed by design.
Output may change for some inputs: the security-preset path-separator fix (#248), rag_ingest confusables canonicalisation (#258), stateful-slugifier lang validation (#257), and a few correctness edge cases (#249, #253, #255).

See the full changelog for the complete list.

Assets 2

09 Jun 23:15

raeq

v0.7.0

7a14394

0.7.0

A feature and architecture release. Headlines: a unified, catchable exception
hierarchy; terminal column-width measurement (terminal_width /
grapheme_width); native errors="strict" transliteration; LLM/RAG
guardrail pipeline presets; and a substantial push of validation and
configuration logic down into the Rust core, so the upcoming multi-language
bindings inherit one behaviour instead of reimplementing it. Most changes are
behaviour-preserving; the exceptions are called out under Upgrade notes.

Upgrade notes

Exceptions now form a hierarchy. Every library error subclasses
TranslitError, with InvalidArgumentError, ResourceLimitError, and
UnsupportedError beneath it. TranslitError remains a ValueError
subclass, so existing except ValueError keeps working. Several error
message strings were enriched/standardised (#186, #187) — code matching
exact message text may need updating; code matching exception types is
unaffected.
lang= is validated even for ASCII input (#197). A binding-side ASCII
fast path previously skipped language validation, so
transliterate("abc", lang="zz") silently returned the input; it now raises
InvalidArgumentError, matching how non-ASCII input always behaved.
slugify_filename / Slugify(safe_chars=…) output corrected (see Fixed):
slugify_filename("My Report.pdf") now returns "My_Report.pdf", not
"My.Report_pdf". Output for inputs that use safe_chars may change.
New modes: errors="strict" for transliterate (#184) and
decode_to_utf8(strict=True) (#189).

Added

terminal_width / grapheme_width (#224): terminal column width per
grapheme cluster (UAX #11 East Asian Width). Wide/fullwidth and
emoji-presented clusters are 2 columns; combining marks, controls, and
zero-width characters are 0. Ambiguous characters are 1 by default, or 2 with
ambiguous_wide=True. Width data is generated at build time from the pinned
UCD (no runtime data, no unsafe). Measures cells, not pixels; tabs are not
expanded.
errors="strict" + find_untranslatable (#184): strict transliteration
raises on the first untranslatable character (reporting it and its byte
offset); find_untranslatable returns all of them without raising.
Guardrail pipeline presets (#139): TextPipeline gains strip_bidi and
strip_zalgo steps and the llm_guardrail / rag_ingest named profiles for
LLM/RAG input sanitisation.
get_pipeline / list_profiles (#229): the named policy-profile registry
now lives in the Rust core; the Python helpers are thin wrappers over it.
decode_to_utf8(strict=True) (#189): raise on lossy/replacement decoding
instead of silently substituting U+FFFD.

Changed

Unified exception hierarchy (#183): the Python error surface is a
TranslitError base with categorised subclasses; sites that previously raised
bare ValueError are unified (foundation laid in 0.6.3 via #181).
Validation moved into the Rust core (#185, #217, #229, #230, #231): enum
validation, the transliterate() argument-conflict matrix, non-negative
max_length / max_graphemes checks, safe_chars, and min_confidence
range-checking now live in the core, so other bindings enforce the identical
contract without reimplementing it. The Python layer keeps only type guards.
Actionable error messages (#186, #187): weak messages now name the
offending value, list valid options, and suggest a "did you mean…?" where
applicable; message style is standardised across the surface.
Error cause chains (#188): wrapped errors surface the underlying cause via
__cause__ rather than flattening it into the message.
TextPipeline step ordering (#174) is derived from a single source of
truth, removing drift between configuration and execution order.
All-ASCII preset fast path (#198): presets skip the NFKC pass for pure-ASCII
input (behaviour-preserving).

Fixed

slugify_filename / Slugify(safe_chars=…) preserved safe characters at
the wrong positions — slugify_filename("My Report.pdf") returned
"My.Report_pdf" instead of the awesome-slugify-correct "My_Report.pdf".
safe_chars are now handled natively in the Rust core: kept verbatim and
treated as word characters so they hold their position (#156, #230). The prior
test only covered a dot-free input, so the bug was uncaught; regression tests
now cover filenames with extensions, multiple dots, and UniqueSlugify +
max_length.
slugify(default=…) is now sanitised through the same slug pipeline (so a
caller-supplied fallback cannot smuggle path-traversal or URL metacharacters
into output documented as URL-safe), threads through the stateful Slugifier /
UniqueSlugifier forms, and a negative max_length now raises a catchable
InvalidArgumentError on both the scalar and batch paths instead of an
uncatchable OverflowError (#193, #169).
Low-severity hardening bundle (#200): eight small robustness fixes
(bounds, overflow, and edge-case handling) gathered into one pass.

Security

The RustSec advisory audit (cargo-audit) now blocks merge via the
required "Rust checks passed" gate on every PR — an advisory can land on a
dependency without any code change here (#195).

Removed

Docker image build/publish and its Trivy CVE scan (#138). translit is a
pip install-first library; previously published images remain as historical
artifacts, but no new ones are produced. Install the CLI via
pip install translit-rs.

Documentation

Executable cookbook (#154, #91, #140, #156, #172): a Sybil doc-test harness
with a CI gate, unidecode→translit migration recipes, an "LLM pipelines" page,
a tokenizer-preprocessing page, and an anti-rot lint that turned 307 decorative
# => claims into checked assertions.
normalize-first canonicalisation recipe (#174) and a formal-verification
assurance taxonomy (#223 — proof-by-exhaustion / structural / property-tested,
tagging each I1–I7 invariant), plus grapheme-integrity property tests (#174).
The project adopted the Developer Certificate of Origin (#165); all commits
are signed off. The custom-emoji-provider 9-codepoint window cap is now
documented (#199).

Assets 2

08 Jun 13:51

raeq

v0.6.3

9bb192e

v0.6.3

A correctness, maintenance, and architecture-foundation release. No output-affecting changes — behaviour-preserving throughout; the one new public behaviour (slugify(default=...)) is opt-in.

Highlights

Error-model foundation (#181): a pure-Rust Error enum + stable code() + a single From<Error> for PyErr boundary — decouples the core from PyO3 and lays the groundwork for the multi-language bindings roadmap.
slugify(text, default="…") — opt-in fallback for inputs that would slug to "" (#97).
Fixed: PRESETS["strip_obfuscation"] order (#141), lock-poison Python UserWarning (#117), docs/api/exceptions.md (#182).
Dependencies (migrated + verified behaviour-preserving): phf 0.13, criterion 0.8, chardetng 1.0 (#146 / #153 / #164).
Maintenance: __init__.py split (#73), build.rs language auto-discovery (#74), stub/binary drift-check (#76), integration-test split (#75), the "Conversations resolved" merge gate (#55), and a documented dependency-upgrade methodology.

Pre-release verified: full Tier-1 CI + Tier-3 exhaustive (all Hangul/BMP/CJK/Indic) + formal invariants I1–I7.

Full changelog: https://github.com/raeq/translit/blob/main/CHANGELOG.md

Correction: this release originally listed a Trivy image-scan fix (#138). That fix did not work, and the Docker build/publish pipeline has since been removed entirely.

pip install -U translit-rs

Assets 2

07 Jun 22:26

raeq

v0.6.2

e035402

v0.6.2 — correctness, security, performance & maintenance

A correctness, security, performance and maintenance release triaged from a
post-0.6.1 issue sweep (#101–#132). No public API removed; one small new public
behaviour (slugify(save_order=True) now functions). Two output-affecting
fixes — see Upgrade notes.

Upgrade notes (output-affecting)

slugify(save_order=True) was an accepted no-op; it now strips only
leading/trailing stopwords (preserving interior word order), matching
python-slugify (#118). If you passed save_order=True, slug output changes.
decode_to_utf8 default min_confidence 0.5 → 0.95 (#103). The old
default was inert (the detector only reports 0.50/0.95, and 0.50 < 0.50
is false), so it never rejected. It now requires high confidence by default;
pass min_confidence=0.0 to accept any guess. (No practical change today —
the detector currently always reports 0.95.)

Fixed

#102 — UniqueSlugify no longer panics across the FFI boundary on a
multibyte separator + small max_length (byte slice landed mid-codepoint;
now uses floor_char_boundary).
#101 — context bigram disambiguation tier was unreachable (it reset on
every inter-word space); it now resets only on hard boundaries, so the tier
fires in normal prose.
#104 — set_emoji_provider now obeys seal_registrations() (the provider
swap previously defeated the seal).
#103 — decode_to_utf8 default confidence now actually gates (see notes).
#107 — a corrupt context dictionary now reports a distinct "corrupt" error
instead of the misleading "not found" remedy (DictState enum).
#121 — PRESETS["sanitize_user_input"] now reflects the real pipeline
order (strip invisibles before zalgo); Python registry and Rust doc aligned.
#129 — Text.transliterate() stub now declares the tones/context
parameters the implementation accepts.
#131 — Slugify(uids=...) emits a correct wrong-class warning rather than
a spurious deprecation warning.
#122 — disambiguated the _compat should_warn nested ternary.

Security

#105 — added a cargo audit (RustSec advisory) CI job and a cargo
Dependabot ecosystem.
#132 — added a Trivy CVE scan of the published image to the release
workflow (SARIF → Security tab, fails on fixable HIGH/CRITICAL) + .trivyignore.
#106 — Rust diagnostics now route through Python warnings instead of
bare eprintln!, so applications can capture/suppress them.

Performance (output-preserving)

#108 codepoint-range diacritic checks in tokenize(); #109 mem::take
per token boundary; #110 single ch.nfkc() pass on the NFKC fallback;
#111 lowered MAX_CAPACITY_HINT 256 MiB → 8 MiB; #112/#113 emoji
matching uses stack buffers + a fixed sliding window (no per-char Vec/String);
#114 slugify uses Cow (no eager to_owned); #115 context tokenize()
returns borrowed (Cow) slices of the input — zero per-token allocation
(Rust API: the crate-internal context::Token.text changed from String
to Cow<'_, str>; no effect on the Python API); #116 clamped the
ContextDict capacity hint.

Maintenance

#118 implemented slugify(save_order=True); #119 SlugConfig::from_pyargs
dedupes the four slugify PyO3 entrypoints; #120 _build_slug_kwargs helper;
#123 seal-enforcement docs on each tables:: mutator; #124
infallibility comments; #125 typed _CallableModule.__call__ kwargs;
#126 corrected recover_lock doc; #127 documented the lazy-import
workaround; #128 renamed _mutation_generation → _registration_generation;
#130 annotated the defence-in-depth conflict check.

Assets 2

07 Jun 17:40

raeq

v0.6.1

97743a0

v0.6.1 — bug-fix, correctness & performance

A bug-fix and test-hardening release. No public API was removed and no new
public names were added. One fix changes key output for inputs containing
invisible characters — see Upgrade notes.

Upgrade notes (output-affecting fix)

search_key / catalog_key / sort_key now strip bidi overrides and
soft-hyphen / format characters (#93). Previously a value stored with an
invisible character (e.g. "password", "user‮txt") produced a
different key from its clean equivalent, so dedup and lookup silently
missed. The new key is the correct one; if you persist these keys, regenerate
any that were computed over text that could contain invisible characters.

Fixed

#93 — key functions (search_key/catalog_key/sort_key) leaked bidi
and soft-hyphen characters, so visually-identical inputs produced
non-colliding keys. They now strip_bidi after NFKC, matching the other
canonicalization presets.
#82 — Greek reverse transliteration (transliterate(text, target="el"))
left literal Latin letters in the output ("psychi" → "ψyχη"). The forward
direction romanizes Υ/υ as Y/y (including the ου/αυ/ευ diphthongs), so the
el reverse table now maps Y/y back to Greek; round-trips no longer leak
Latin letters.
#69 — transliterate() resolved conflicting kwargs differently for str
vs list input (one path silently dropped target, the other context).
Conflicts are now checked once, before the dispatch, so both raise identically:
context+target and context+tones raise ValueError.
#72 — translit.unidecode() now mirrors the Unidecode 1.3 signature
unidecode(string, errors="ignore", replace_str="?"), mapping Unidecode's
errors modes (ignore/replace/preserve/strict) onto the native error
handling, instead of raising TypeError on those kwargs.
#95 — Greek Extended polytonic capitals for omicron/upsilon/omega/rho
were corrupted, emitting unrelated Latin letters (Ὅμηρος → Xmiros,
Ὑγίεια → Pgieia). Corrected all 50 affected entries to the proper base
romanization, consistent with the monotonic forms (Ὅμηρος → Omiros).
#99.3 — a typo'd form=/errors= value now raises even for pure-ASCII
input. Previously the ASCII fast-path returned before reaching Rust, so the
bad enum silently no-opped on ASCII and only raised on the first non-ASCII
string. Validation now runs before the fast-path in normalize() and
transliterate().

Performance

#70 — the batch entry points (transliterate, slugify, normalize,
strip_accents on list[str]) now release the GIL around their pure-Rust
compute loop via py.allow_threads. Multi-threaded callers processing large
batches now get real parallelism (~1.8× wall-clock with two threads) instead
of serialising on the interpreter lock. Output is unchanged. Documented in the
new "Concurrency (GIL)" section of docs/performance.md.

Documentation

#94 — strict_iso9 is no longer described as "ISO 9:1995". It emits ASCII
digraphs (ж→zh, ч→ch, ш→sh), not the standard's diacritics (ž/č/š) — translit
tables are ASCII-only by design. Docstrings, the data-file header, and the docs
now describe it as a scholarly ASCII (ISO 9-style) transliteration and warn it
is not ISO 9-conformant. No behavior change.
#98 — docs/user-guide/transliteration.md no longer instructs users to
pip install translit-rs[arabic|hebrew|context] (those empty extras were
removed in 0.6.0); it now documents the bootstrap_dicts.sh / TRANSLIT_DICT_DIR
path, matching the README and the runtime error message.
#99.1 / #99.2 — fixed two false docstrings: sort_key no longer claims to
preserve accents (it folds them via transliteration, coinciding with
search_key), and slugify no longer documents a pretranslate kwarg it
never had.
#84 — corrected the README throughput table (Cyrillic ~106M chars/sec,
slugify ~712K slugs/sec on commodity 4-vCPU hardware) and added a
hardware/methodology footnote; added a matching variance note to
docs/performance.md.
#77 — fixed the Text fluent-builder docstring example (normalize is
keyword-only: .normalize(form="NFC")), reconciled the language-profile count
(README now agrees with the docs at 83), and documented the context kwarg in
the transliterate() docstring.

Internal / tests

#78 — added adversarial coverage for the raw-bytes decode path
(detect_encoding / decode_to_utf8): deterministic hostile-byte cases in
CI plus a Hypothesis st.binary() fuzz suite proving no-panic and
invariant-preservation. Documented in THREAT_MODEL.md that the decode path
has no input-size cap (caller's responsibility, per the 0.6.0 cap removal).
#79 — added a single-vs-batch kwarg parity regression test across the full
kwarg matrix and a multi-script corpus (the tones batch drop fixed in 0.6.0
can no longer recur silently).

Assets 2

07 Jun 14:11

raeq

v0.6.0

dd25cf8

v0.6.0 — security hardening

[0.6.0] — 2026-06-07

A hardening and bug-fix release. Two new opt-in helpers (dedup_batch,
make_cached_transliterator) make this a minor bump; no public API was
removed. Several fixes change output for specific inputs — read Upgrade
notes before upgrading if you cache or persist transliterator/normalizer output.

Upgrade notes (output-affecting fixes)

Each of these was a bug; the new output is the correct one. If you store or cache
results that were keyed on the old (buggy) behaviour, regenerate them:

register_replacements() now actually applies. It was a silent no-op — the
registered table was never consulted. Registered replacements now take effect
across transliterate() (scalar, list, and context=True). If you registered
replacements and (knowingly or not) relied on them being ignored, output changes.
transliterate(list, tones=True) now returns toned pinyin (was silently
toneless on the list path); transliterate(list, target=…, tones=True) now
raises ValueError for the forward-only parameter (was silently ignored).
normalize_confusables(text, target="cyrillic") no longer maps characters
onto invisible combining marks (28 such mappings removed).
strip_obfuscation now folds intra-Latin ASCII homoglyphs (þ→p, ſ→f,
ı→i, …) and is idempotent; sanitize_user_input is idempotent for
control/invisible characters between combining marks; demojize no longer
inserts a stray space after a tab/newline that precedes an emoji.
Context-aware transliteration (context=True, ar/fa/he) distribution
changed. The empty arabic/hebrew/context pip extras have been removed
(they never installed anything). The ~37 MB dictionaries are no longer tracked
in git, and are not shipped in the wheel. Context mode now loads dictionaries
from $TRANSLIT_DICT_DIR (build them with scripts/bootstrap_dicts.sh), or use
the embed-dicts Cargo feature for a self-contained build. A packaged
pip-installable distribution is tracked in #56/#60.
decode_to_utf8 default min_confidence changed 0.0 → 0.5. Low-confidence
encoding guesses are now rejected by default instead of silently accepted; pass
min_confidence=0.0 to restore the old behaviour. (#66)
Unknown lang codes now raise instead of silently falling back (#68). A
typo'd code (lang="RU", lang="russian") used to behave exactly like
lang=None — quietly-wrong output — while errors=/form= rejected bad
values. transliterate, slugify, sanitize_filename, catalog_key,
search_key, sort_key, and ml_normalize now raise TranslitError listing
the valid codes. "auto", the nb/nn/da aliases, and register_lang()
codes are accepted. (target= already validated.)

Changed

No library-imposed input-size limit (#80, #65). The 10 MiB input cap on
transliterate, normalize, fold_case, and the preset pipelines has been
removed — it was paternalistic, inconsistently applied (the ASCII fast
path bypassed it; slugify/normalize_confusables/strip_zalgo never had it),
and the threat model already disclaims DoS. All operations are linear time and
memory; bounding untrusted input is the caller's responsibility, documented
in the threat model and docstrings. The single retained size guard is the
register_replacements output amplification bound (a tiny input can expand to
an enormous string via a caller-registered value — an amplification a caller's
own input check cannot foresee). Backward-compatible: only previously-rejected
large inputs now succeed.
External wording: capability, not promise. Security-relevant features are now
described as mechanisms (TR39 confusable mapping, bidi/zalgo stripping, hostname
analysis) rather than outcome guarantees. Package descriptions, README, and docs no
longer claim to "prevent"/"neutralize" attacks or achieve "perfect" recovery; the XMR
benchmark figure is always stated with its tested-pairs scope. Engineering rigor is held
to a high internal bar (see below); the external surface promises nothing it cannot
measure.

Added

dedup_batch(texts, …) — transliterate a list, processing each distinct
value once and mapping back (large win for repeated/categorical data; ~146× on a
high-locality column). Stateless — no cache to invalidate; unique values are chunked
at the 100k batch cap. (#31)
make_cached_transliterator(maxsize=…, …) — opt-in LRU-cached single-string
transliterator with options fixed at construction. Self-invalidating: the next
call after any register_lang/register_replacements/remove_replacement/
clear_replacements clears the cache (via an internal table-generation counter), so
it never serves stale results. Never enabled by default. (#31)
THREAT_MODEL.md — defines in-scope mechanisms, explicit out-of-scope items
(confusables outside the bundled TR39 table, whole-script and multi-character
confusables, Unicode-version skew, semantic attacks, DoS), and a vulnerability-vs-
known-limitation policy, grounded in the literature (Holgers 2006, Deng 2020,
BitAbuse 2025).
SECURITY.md rewritten on real footing: supported-version policy stated, triage
scope defined, and linked to the threat model.
Security-invariant property tests + fuzzing. proptest invariants in Rust
(src/presets.rs) assert no-panic, idempotence, and "no bidi/format control
survives" for strip_obfuscation / security_clean / sanitize_user_input /
strip_bidi across the Unicode input space; a deterministic, CI-gating
adversarial attack-corpus regression (tests/test_attack_corpus.py:
homoglyph / zalgo / invisible / bidi / combined, XMR-style); and a cargo-fuzz
harness (fuzz/) for continuous coverage-guided fuzzing of the defense
pipelines.
Confusable coverage for intra-Latin homoglyphs of basic ASCII letters
(e.g. þ→p, ſ→f, ı→i, ƒ→f, Ɩ→l, ꜱ→s). The TR39 generator previously
skipped all Latin-script sources for the Latin target, dropping ~83 genuine
homoglyphs of A–Z/a–z; normalize_confusables/strip_obfuscation now fold
them. Single-letter Latin confusable coverage of UTS#39 is now complete.
Pinned data/confusables.txt (UTS#39 17.0.0) as the reproducible, version-
controlled input for scripts/gen_confusables.py (--download refreshes it),
and a tests/test_confusable_coverage.py gate against Unicode-version drift.

Fixed

register_replacements() was a silent no-op — the global table was stored
but never consulted by transliterate(). It now applies as a longest-match
pre-pass (no cascade) across the scalar, list, and context=True forward paths,
including ASCII-keyed replacements that previously bypassed Rust via the Python
fast path. (#51)
tones= on the list/batch path was dropped: transliterate(["北京"], tones=True) returned toneless pinyin while the scalar path returned toned, and
transliterate([...], target=…, tones=True) silently ignored the forward-only
parameter instead of raising. Both now match the scalar path. (#14, #15)
normalize_confusables(target="cyrillic") emitted invisible combining marks —
28 mappings folded a visible character onto a combining Cyrillic-Extended mark (an
obfuscation vector). The generator now excludes combining-mark targets. (#24)
script_info("CanadianAboriginal")["context_aware"] raised KeyError — the
entry omitted a required ScriptMeta field; a completeness guard now prevents
recurrence. (#18)
Context path skipped strict_iso9/gost7034 mutual-exclusion validation —
transliterate(text, context=True, strict_iso9=True, gost7034=True) now raises
ValueError like the non-context path; the missing-dictionary error hint is now
language-specific (he→hebrew). (#18)
demojize inserted a stray space after a tab/newline preceding an emoji
("a\t😀" → "a\t grinning face"); it now checks for any whitespace. (#12)
Compatibility digit variants fold to digits, not letters (#89). The
confusables table mapped Mathematical Alphanumeric digits 𝟎/𝟏 (and the
other four families, plus superscripts) to the look-alike letters O/l, so
normalize_confusables("𝟏𝟎") gave "lO" and strip_obfuscation corrupted
digit runs. The generator now folds any character whose NFKC form is an ASCII
digit to that digit. They remain detected as confusable (is_confusable),
but canonicalize to the correct number. (ASCII 0/1 were already unaffected.)
NFKC-compatible Latin is recovered instead of dropped to [?] (#81).
Mathematical Alphanumeric Symbols (𝕳𝖊𝖑𝖑𝖔 𝟙𝟚𝟛 → Hello 123), presentation
ligatures (ﬁ/ﬂ → fi/fl), and superscripts (x² → x2) now
transliterate: an unmapped non-ASCII char is NFKC-decomposed and re-tried
before the error fallback. This matches unidecode/anyascii and closes a
filter-evasion ("fancy text") gap. Purely additive — only chars that were
previously [?] are affected; emoji (no ASCII decomposition) still map to [?].
Defense pipelines are now idempotent (bugs found by the property tests):
- strip_obfuscation: emoji whose CLDR name contains typographic punctuation
  (e.g. 👒 → woman’s hat, U+2019 ’) weren't folded because confusables ran
  before demojize; a second pass folded ’→'. Confusables now runs after demojize.
- sanitize_user_input: an invisible or control character between combining
  marks (e.g. soft-hyphen, NUL) split a mark-run, so removing it after
  zalgo-capping merged runs that a second pass then capped differently. Bidi,
  zero-width, and control characters are now stripped before zalgo-capping.
Build-time and doc corrections: build.rs now rejects malformed \u{…} escapes
in TSV data; embedded-dictionary parse errors are logged (not silently dropped...

Assets 2

06 Jun 13:14

raeq

v0.5.0

12c7417

translit 0.5.0

This release sharpens what translit is: Unicode adversarial-text defense and canonicalization, powered by Rust — TR39 visual confusable mapping, homoglyph / bidi / zalgo / invisible-character stripping, and standards-based Latin/Cyrillic/Greek transliteration. It also adds context-aware transliteration for abjad scripts and fixes a long-standing Linux packaging bug.

Highlights

Adversarial-text defense, front and center. translit maps confusables by appearance (TR39: Cyrillic р → Latin p), the mapping that actually reverses a homoglyph attack — unlike unidecode/anyascii/ftfy, which map phonetically and can't. The new Adversarial-Text Defense guide covers the phonetic-vs-visual distinction and the XMR benchmark evidence.

from translit import strip_obfuscation, normalize_confusables, is_safe_hostname

strip_obfuscation("рroduсt")          # → "product"   (Cyrillic р→p, с→c via TR39)
normalize_confusables("раypal")        # → "paypal"
safe, details = is_safe_hostname("аpple.com")   # → (False, …)  leading Cyrillic а

Context-aware transliteration for Arabic, Persian, and Hebrew. transliterate(text, context=True) uses dictionary-based vowel restoration (bigram → unigram → context-free) to produce readable romanization instead of consonant skeletons. Opt in with pip install translit-rs[arabic] / [hebrew] / [context].

Fixed

Linux x86_64 wheels are now built as cp39-abi3. Earlier releases only shipped a cp38-cp38 x86_64 Linux wheel, forcing a source build (Rust toolchain) on Python 3.9+. pip install translit-rs now gets a prebuilt wheel on Linux x86_64 like every other platform. (#26)
Documentation corrections (consistent language-profile count; verified homoglyph examples).

Security

All third-party GitHub Actions pinned to commit SHAs across CI and the release pipeline; added Dependabot to keep them current. Dev/docs dependency bumps (Pygments 2.20.0, pytest 9.0.3).

Compatibility

No breaking changes. No public API, language codes, or script coverage were removed — translit-rs still has zero runtime dependencies. CJK/Indic/other scripts remain available as best-effort, unidecode-compatible coverage.

Install

pip install translit-rs

Full changelog: https://github.com/raeq/translit/blob/main/CHANGELOG.md

Assets 2

29 Mar 19:43

raeq

v0.4.0

dc54260

v0.4.0

Breaking changes

Batch functions removed. transliterate_batch(), slugify_batch(), normalize_batch(), and strip_accents_batch() are gone. The base functions now accept both str and list[str] via @typing.overload:
```
transliterate("café")              # → "cafe"
transliterate(["café", "naïve"])   # → ["cafe", "naive"]
```
strip_obfuscation() no longer transliterates. Uses TR39 confusable mapping (visual similarity) instead of phonetic transliteration. lang= parameter removed. Chain with transliterate() if romanization is also needed.

New features

strip_obfuscation() — maximum-strength deobfuscation preset. Resolves homoglyph spoofing (Cyrillic р→p, с→c), strips zalgo, invisible chars, bidi attacks, expands emoji.
lang_info() / script_info() — structured metadata for all 83 languages and 57 scripts, with import-time drift assertions.
18 new languages (Balinese, Bamum, Buginese, Cherokee, Cham, Coptic, Tai Lue, Lisu, Meitei, Northern Thai, N'Ko, Santali, Sundanese, Syriac, Tai Le, Tagalog, Tamazight, Vai) and 10 new Script enum members.

Bug fixes

Combining marks and zero-width characters no longer produce [?] (283 new TSV mappings)
TextPipeline confusable ordering fixed (transliterate before confusables)
demojize() spaces adjacent emoji replacements ("🔥🔥" → "fire fire")
SCRIPT_RANGES sort order fix + invariant test
Tibetan documentation corrected (Indic-phonetic, not Wylie)

Infrastructure

API stability tests (133), mutation testing killers (92)
CI restructured: 10× faster Python tests, path-filtered CodeQL, no duplicate runs
Transliteration provenance documentation
docs/index.md generated from README.md (single source of truth)

See CHANGELOG.md for full details.

Assets 2

Releases: raeq/disarm

v0.9.0 — first disarm release

Breaking changes (migrating from translit-rs)

Uh oh!

v0.8.1 — honest benchmark numbers (final translit-rs release)

Changed

Uh oh!

v0.8.0 — performance & hardening

Highlights

Upgrade notes

Uh oh!

0.7.0

Upgrade notes

Added

Changed

Fixed

Security

Removed

Documentation

Uh oh!

v0.6.3

Highlights

Uh oh!

v0.6.2 — correctness, security, performance & maintenance

Upgrade notes (output-affecting)

Fixed

Security

Performance (output-preserving)

Maintenance

Uh oh!

v0.6.1 — bug-fix, correctness & performance

Upgrade notes (output-affecting fix)

Fixed

Performance

Documentation

Internal / tests

Uh oh!

v0.6.0 — security hardening

[0.6.0] — 2026-06-07

Upgrade notes (output-affecting fixes)

Changed

Added

Fixed

Uh oh!

translit 0.5.0

translit 0.5.0

Highlights

Fixed

Security

Compatibility

Install

Uh oh!

v0.4.0

v0.4.0

Breaking changes

New features

Bug fixes

Infrastructure

Uh oh!

Breaking changes (migrating from `translit-rs`)