Skip to content

v0.3.0

Choose a tag to compare

@shaik-abdul-thouhid shaik-abdul-thouhid released this 09 Jun 18:14
· 30 commits to main since this release

[0.3.0] - 2026-06-09

Added

  • SIMD chunked scanners in encoding.utf8 (additive — every pre-existing API
    keeps its exact signature and behaviour). All are portable @Vector compares
    and reductions with a scalar tail (no target intrinsics, no dynamic shuffles),
    striding std.simd.suggestVectorLength(u8) bytes at a time. @stable-since: v0.3.0:
    • asciiRunLength — length of the leading ASCII run (<= 0x7F), the shared
      primitive behind the others and usable directly for an ASCII fast path.
    • countScalarsSimdunchecked scalar count via the non-continuation-byte
      rule ((b & 0xC0) != 0x80); equals countScalars on valid input.
    • simdLossyIterator / UTF8SimdLossyIterator — a buffered lossy decode
      iterator that widens ASCII runs in bulk; output is identical to
      lossyIterator (malformed → U+FFFD, orphaned continuation runs collapse to a
      single replacement).
  • Enumerable code-point range tables for Unicode properties, so consumers
    can resolve property classes into sorted ranges at comptime (the per-code-point
    page tables cannot be enumerated without walking all 1.1M code points). New
    zig build generate-ranges step (no network; reuses the committed page tables)
    emits:
    • properties.category_runs (CategoryRun{ start, end, category }) — a full
      partition of 0..=0x10FFFF by General_Category, including unassigned runs.
    • properties.derived_runs (DerivedRun{ start, end, mask }) —
      DerivedCoreProperties runs keyed by the same bitmask as derivedPropertyMask.
    • properties.white_space_ranges and properties.join_control_ranges
      (CodePointRange{ start, end }) — PropList bases for \s and \w.
    • scripts.script_runs (ScriptRun{ start, end, script }) — Script runs for
      assigned code points.
  • properties.isWord — Perl \w / word-boundary predicate
    (Alphabetic ∪ Mark ∪ Decimal_Number ∪ Connector_Punctuation ∪ Join_Control).
    Resolved from the enumerable range tables (with an ASCII fast path), not the
    per-code-point page tries, so a consumer that needs only isWord never links
    the page tables. @stable-since: v0.3.0.
  • Range-table-backed per-code-point queries — equivalent to the page-table
    predicates but linking only the enumerable range tables (no two-level page
    tries), so a size-sensitive consumer can drop the tries entirely. Each is
    proven equal to its page-table twin for every code point. @stable-since: v0.3.0:
    • properties.categoryFromRunsGeneral_Category via binary search over
      category_runs (twin of generalCategory).
    • properties.derivedMaskFromRuns — DerivedCoreProperties bitmask via binary
      search over derived_runs (twin of derivedPropertyMask).
    • properties.isIdentifierStartByRanges / isIdentifierContinueByRanges
      twins of isIdentifierStart / isIdentifierContinue.
  • A dedicated unicode.emoji module for the UTS #51 emoji character
    properties (emoji-data.txt), promoting the six emoji predicates out of
    unicode.segmentation into a first-class property module alongside scripts,
    blocks, etc. The generated page/range tables (emoji.generated, regenerated
    by zig build generate) now live under unicode/emoji/generated/. All
    @stable-since: v0.3.0:
    • Per-code-point predicates emoji.isEmoji, isEmojiPresentation,
      isEmojiModifier, isEmojiModifierBase, isEmojiComponent, and
      isExtendedPictographic (also surfaced as unicode.isEmoji, … ).
    • emoji.EmojiProperty (enum of the six properties), emoji.EmojiProperties
      (a packed struct of all six bools with .any()), emoji.emojiProperties
      (resolve all six at once), emoji.hasEmojiProperty (runtime-selected
      dispatch), and emoji.hasAnyEmojiProperty.
    • Enumerable code-point range tables so consumers can resolve \p{Emoji},
      \p{Extended_Pictographic}, etc. into sorted ranges at comptime without
      walking all 1.1M code points (same rationale as scripts.script_runs).
      Emitted by an extended zig build generate-ranges into
      unicode/emoji/generated/emoji_ranges.zig and re-exported as
      emoji.emoji_ranges, emoji.emoji_presentation_ranges,
      emoji.emoji_modifier_ranges, emoji.emoji_modifier_base_ranges,
      emoji.emoji_component_ranges, and emoji.extended_pictographic_ranges
      (EmojiRange{ start, end }), with emoji.rangesFor(property) for
      runtime selection. Each table is proven (test) to enumerate exactly its
      predicate over the whole code space.

Changed

  • The emoji predicates moved from unicode.segmentation to the new
    unicode.emoji module (see Added). segmentation.isEmoji,
    isEmojiPresentation, isEmojiModifier, isEmojiModifierBase,
    isEmojiComponent, and isExtendedPictographic remain as deprecated
    re-export aliases (so segmentation keeps compiling and UAX #29 grapheme
    clustering still resolves Extended_Pictographic); prefer unicode.emoji.*.
    unicode.emoji_data now points at emoji.generated rather than
    segmentation.emoji_data. All still v0.3.0-unreleased.

  • The Unicode range-table re-exports (properties.category_runs,
    properties.derived_runs, properties.white_space_ranges,
    properties.join_control_ranges, scripts.script_runs,
    casing.case_folding.common_simple_table) are now []const T slices over a
    single backing array instead of by-value array re-exports. Iteration,
    indexing, slicing and .len are unchanged; this removes a duplicate copy of
    each table that the by-value alias materialized in consumer binaries (and the
    extra comptime-materialized copy). Still all v0.3.0-unreleased.

  • Performance: encoding.utf8.validate now skips ASCII runs in bulk via SIMD
    (asciiRunLength) while the Höhrmann DFA is on a scalar boundary, instead of
    feeding every byte through the DFA. ASCII bytes always keep the DFA in accept,
    so the verdict is identical; only the dominant ASCII case is faster. Signature
    and result are unchanged.

  • Performance: the UAX #14 line-break steppers (lineStep, lineStepBytes,
    and the LineBreakIterator / CodePointLineBoundaryIterator they drive) now
    compute the forward look-ahead only when a look-ahead-dependent rule
    (LB15b, LB15c, LB19a, LB25, LB28a) can actually fire, instead of on every
    code point. Roughly 25–37% faster line iteration on the benchmark corpora.

  • Performance: the streaming sentence iterators (SentenceIterator,
    CodePointSentenceIterator) memoise the SB8 look-ahead across an
    ATerm Close* Sp* window, eliminating repeated forward rescans (~10–15%
    faster CodePointSentenceIterator) and bounding a previously quadratic
    worst case for long ATerm runs.

  • Performance: deduplicated the General_Category lookup shared by LB15b and
    LB19 within the line-break rule scan.

  • The Regional_Indicator run trackers BoundaryState.ri_run and
    WordStepState.ri_count are now a single parity bit (u1) rather than a
    full usize; only the run parity was ever consulted, so the per-step state
    structs are smaller. No behavioural change.