Skip to content

v0.4.0

Choose a tag to compare

@shaik-abdul-thouhid shaik-abdul-thouhid released this 11 Jun 14:22
· 12 commits to main since this release

[0.4.0] - 2026-06-11

Changed

  • The CodePoint contract is now documented as such, on the type itself
    (encoding.CodePoint) and in the README: values of this type are presumed
    valid Unicode scalars; producers uphold the contract, consumers rely on it
    and skip decoding/validation (which is what makes the []const CodePoint
    API variants the cheap path for already-decoded text). The README quick-look
    examples were also brought back in sync with the real signatures
    (initUTF8View out-param, view.iter(), nfc over code points).
  • Breaking: "unchecked" now means one thing everywhere — the caller
    guarantees the documented preconditions; violations are asserted /
    safety-checked (trap in Debug/ReleaseSafe, undefined in
    ReleaseFast/ReleaseSmall); unchecked functions never return errors or
    panic.
    Accordingly:
    • encoding.utf8.codePointLenReverseUnchecked returns plain u3
      (was UTF8ValidationError!u3).
    • encoding.utf16.utf16SequenceLenReverseUnchecked returns plain u2
      (was UTF16ValidationError!u2) and asserts end_index < buf.len
      instead of returning ZeroLengthUnits.
    • encoding.utf8.decodeCodePointReverseUnchecked no longer documents (or
      contains) a panic path; its preconditions are asserted.

Removed

  • Breaking: the Undefined member is gone from all six UTF-16/UTF-32
    error sets (UTF16ValidationError, UTF16ValidationLossyError,
    UTF16EncodeError, UTF32ValidationError, UTF32ValidationLossyError,
    UTF32EncodeError). It was never returned by any code path; exhaustive
    switches over these error sets can drop their dead error.Undefined arm.

Fixed

  • unicode.segmentation byte-level iterators (grapheme / word / sentence /
    line) and step helpers no longer contain catch @panic shims around lossy
    decoding. They decode through encoding.utf8.decodeCodePointLossy, so the
    "lossy never errors" promise now holds structurally: malformed UTF-8 yields
    U+FFFD segments and can never trap. lineStepBytes documents its
    byte_pos < bytes.len contract (asserted, safety-checked).

Added

  • Early-exit collation compare (@stable-since: v0.4.0):
    Collator.compareCodePointsIncremental / compareUtf8Incremental generate
    collation elements lazily and stop at the first differing weight of the
    shallowest differing level — no sort keys are materialized. Identical
    results to compareCodePoints / compareUtf8 for every input and option
    set (verified across the strength × variable-weighting matrix). Strings that
    differ early at the primary level — the common case — pay for only a few
    collation elements; the weighting logic is shared with buildKey via an
    internal resolveCE so the two paths cannot diverge.
  • BOM utilities — new encoding.bom module (@stable-since: v0.4.0):
    Bom enum (utf8 / utf16_le / utf16_be / utf32_le / utf32_be) with
    bytes / len / endian / match, plus detect (longest-match: the
    ambiguous FF FE 00 00 reports UTF-32 LE) and zero-copy strip. The
    codecs themselves still never consume or produce BOMs; this is the
    explicit seam, re-exported at the package root as bom.
  • unicode.normalization.nfkcCaseFold — sequence-level NFKC_Casefold
    (@stable-since: v0.4.0), per the UCD definition
    NFKC_CF(X) = NFC(toNFKC_Casefold(X)), built on the already-shipped
    per-scalar nfkcCaseFoldMap table. The identifier-caseless form used by
    UAX #31 and security profiles: fullwidth compatibility variants fold
    (ABCabc), Default_Ignorables map away, and the result is
    idempotent (verified by a BMP-wide sweep).
  • String-level titlecase in unicode.casing
    (@stable-since: v0.4.0): titlecaseAlloc ([]const CodePoint) and
    titlecaseUtf8Alloc (UTF-8), implementing the Unicode default algorithm
    (R3) — UAX #29 word segmentation, full titlecase mapping on each word's
    first cased scalar, full lowercase on the rest, with Final_Sigma context
    for U+03A3 ("ΜΕΓΑΣ""Μεγας"). Default root-locale mappings; no
    Turkic/Lithuanian tailoring.
  • Case-insensitive search in unicode.casing
    (@stable-since: v0.4.0): indexOfFold / containsFold over UTF-8 bytes
    and indexOfFoldCodePoints / containsFoldCodePoints over scalar slices.
    Both sides fold lazily during the scan — no allocation — honoring expanding
    folds in .full mode ("STRASSE" matches "Straße"), with a whole-scalar
    boundary rule (needle "s" never matches inside "ß"'s expansion). The
    CodePoint variants skip decoding/validation entirely per the CodePoint
    contract.
  • SIMD chunked scanners for UTF-16, mirroring the v0.3.0 UTF-8 set
    (@stable-since: v0.4.0, portable @Vector compares with scalar tails, no
    target intrinsics): utf16.nonSurrogateRunLength (length of the leading
    run of standalone scalars — the UTF-16 analogue of asciiRunLength) and
    utf16.countScalarsSimd (unchecked scalar count via the high-surrogate
    rule). utf16.validate now skips surrogate-free runs in SIMD strides and
    falls back to scalar pair checks only at actual surrogates.
  • Bulk encode-direction APIs taking []const CodePoint
    (@stable-since: v0.4.0): encodeCodePoints{Len,Buffer,Alloc} on all three
    codecs, the inverse of bytesToUTF8String / bufToUTF16String /
    bufToUTF32String. Callers holding already-decoded scalars encode without
    any decoding or validation, per the CodePoint contract.
  • encoding.utf8.StreamingValidator — incremental, resumable validation over
    arbitrarily-chunked input (@stable-since: v0.4.0). The Höhrmann DFA state
    carries across chunk boundaries (no buffering, no copies); update reports
    the absolute offset of the first malformed sequence, finish distinguishes
    a truncated trailing scalar from valid end-of-input, and ASCII runs are
    skipped in SIMD strides.
  • Error position reporting (@stable-since: v0.4.0): invalidIndex on
    all three codecs returns the unit/byte offset where the first malformed
    sequence starts (null when valid), so diagnostics no longer require a
    re-scan. Decoding strictly at the reported offset recovers the fine-grained
    error. The UTF-8 variant skips ASCII runs in SIMD strides.
  • Unchecked forward decode entry points, completing the strict / unchecked
    / lossy matrix in the forward direction (@stable-since: v0.4.0): callers
    holding already-validated text can decode without re-validating and without
    wrapping the input in a View.
    • encoding.utf8.decodeCodePointUnchecked
    • encoding.utf16.decodeU16CodePointUnchecked
    • encoding.utf32.decodeU32CodePointUnchecked
  • encoding.utf8.decodeCodePointLossy — infallible lossy decode primitive
    (@stable-since: v0.4.0). Malformed sequences yield U+FFFD and are never
    reported as errors; the only precondition (offset < bytes.len) is asserted
    (safety-checked), not error-returned, and the returned len is always >= 1
    so forward scans are guaranteed to make progress. UTF8LossyIterator and
    UTF8SimdLossyIterator now decode through it, removing their internal
    catch unreachable/catch break shims.