v0.3.0

shaik-abdul-thouhid released this 09 Jun 18:14

· 30 commits to main since this release

daa2670

[0.3.0] - 2026-06-09

Added

SIMD chunked scanners in encoding.utf8 (additive — every pre-existing API
keeps its exact signature and behaviour). All are portable @Vector compares
and reductions with a scalar tail (no target intrinsics, no dynamic shuffles),
striding std.simd.suggestVectorLength(u8) bytes at a time. @stable-since: v0.3.0:
- asciiRunLength — length of the leading ASCII run (<= 0x7F), the shared
  primitive behind the others and usable directly for an ASCII fast path.
- countScalarsSimd — unchecked scalar count via the non-continuation-byte
  rule ((b & 0xC0) != 0x80); equals countScalars on valid input.
- simdLossyIterator / UTF8SimdLossyIterator — a buffered lossy decode
  iterator that widens ASCII runs in bulk; output is identical to
  lossyIterator (malformed → U+FFFD, orphaned continuation runs collapse to a
  single replacement).
Enumerable code-point range tables for Unicode properties, so consumers
can resolve property classes into sorted ranges at comptime (the per-code-point
page tables cannot be enumerated without walking all 1.1M code points). New
zig build generate-ranges step (no network; reuses the committed page tables)
emits:
- properties.category_runs (CategoryRun{ start, end, category }) — a full
  partition of 0..=0x10FFFF by General_Category, including unassigned runs.
- properties.derived_runs (DerivedRun{ start, end, mask }) —
  DerivedCoreProperties runs keyed by the same bitmask as derivedPropertyMask.
- properties.white_space_ranges and properties.join_control_ranges
  (CodePointRange{ start, end }) — PropList bases for \s and \w.
- scripts.script_runs (ScriptRun{ start, end, script }) — Script runs for
  assigned code points.
properties.isWord — Perl \w / word-boundary predicate
(Alphabetic ∪ Mark ∪ Decimal_Number ∪ Connector_Punctuation ∪ Join_Control).
Resolved from the enumerable range tables (with an ASCII fast path), not the
per-code-point page tries, so a consumer that needs only isWord never links
the page tables. @stable-since: v0.3.0.
Range-table-backed per-code-point queries — equivalent to the page-table
predicates but linking only the enumerable range tables (no two-level page
tries), so a size-sensitive consumer can drop the tries entirely. Each is
proven equal to its page-table twin for every code point. @stable-since: v0.3.0:
- properties.categoryFromRuns — General_Category via binary search over
  category_runs (twin of generalCategory).
- properties.derivedMaskFromRuns — DerivedCoreProperties bitmask via binary
  search over derived_runs (twin of derivedPropertyMask).
- properties.isIdentifierStartByRanges / isIdentifierContinueByRanges —
  twins of isIdentifierStart / isIdentifierContinue.
A dedicated unicode.emoji module for the UTS #51 emoji character
properties (emoji-data.txt), promoting the six emoji predicates out of
unicode.segmentation into a first-class property module alongside scripts,
blocks, etc. The generated page/range tables (emoji.generated, regenerated
by zig build generate) now live under unicode/emoji/generated/. All
@stable-since: v0.3.0:
- Per-code-point predicates emoji.isEmoji, isEmojiPresentation,
  isEmojiModifier, isEmojiModifierBase, isEmojiComponent, and
  isExtendedPictographic (also surfaced as unicode.isEmoji, … ).
- emoji.EmojiProperty (enum of the six properties), emoji.EmojiProperties
  (a packed struct of all six bools with .any()), emoji.emojiProperties
  (resolve all six at once), emoji.hasEmojiProperty (runtime-selected
  dispatch), and emoji.hasAnyEmojiProperty.
- Enumerable code-point range tables so consumers can resolve \p{Emoji},
  \p{Extended_Pictographic}, etc. into sorted ranges at comptime without
  walking all 1.1M code points (same rationale as scripts.script_runs).
  Emitted by an extended zig build generate-ranges into
  unicode/emoji/generated/emoji_ranges.zig and re-exported as
  emoji.emoji_ranges, emoji.emoji_presentation_ranges,
  emoji.emoji_modifier_ranges, emoji.emoji_modifier_base_ranges,
  emoji.emoji_component_ranges, and emoji.extended_pictographic_ranges
  (EmojiRange{ start, end }), with emoji.rangesFor(property) for
  runtime selection. Each table is proven (test) to enumerate exactly its
  predicate over the whole code space.

Changed

The emoji predicates moved from unicode.segmentation to the new
unicode.emoji module (see Added). segmentation.isEmoji,
isEmojiPresentation, isEmojiModifier, isEmojiModifierBase,
isEmojiComponent, and isExtendedPictographic remain as deprecated
re-export aliases (so segmentation keeps compiling and UAX #29 grapheme
clustering still resolves Extended_Pictographic); prefer unicode.emoji.*.
unicode.emoji_data now points at emoji.generated rather than
segmentation.emoji_data. All still v0.3.0-unreleased.
The Unicode range-table re-exports (properties.category_runs,
properties.derived_runs, properties.white_space_ranges,
properties.join_control_ranges, scripts.script_runs,
casing.case_folding.common_simple_table) are now []const T slices over a
single backing array instead of by-value array re-exports. Iteration,
indexing, slicing and .len are unchanged; this removes a duplicate copy of
each table that the by-value alias materialized in consumer binaries (and the
extra comptime-materialized copy). Still all v0.3.0-unreleased.
Performance: encoding.utf8.validate now skips ASCII runs in bulk via SIMD
(asciiRunLength) while the Höhrmann DFA is on a scalar boundary, instead of
feeding every byte through the DFA. ASCII bytes always keep the DFA in accept,
so the verdict is identical; only the dominant ASCII case is faster. Signature
and result are unchanged.
Performance: the UAX #14 line-break steppers (lineStep, lineStepBytes,
and the LineBreakIterator / CodePointLineBoundaryIterator they drive) now
compute the forward look-ahead only when a look-ahead-dependent rule
(LB15b, LB15c, LB19a, LB25, LB28a) can actually fire, instead of on every
code point. Roughly 25–37% faster line iteration on the benchmark corpora.
Performance: the streaming sentence iterators (SentenceIterator,
CodePointSentenceIterator) memoise the SB8 look-ahead across an
ATerm Close* Sp* window, eliminating repeated forward rescans (~10–15%
faster CodePointSentenceIterator) and bounding a previously quadratic
worst case for long ATerm runs.
Performance: deduplicated the General_Category lookup shared by LB15b and
LB19 within the line-break rule scan.
The Regional_Indicator run trackers BoundaryState.ri_run and
WordStepState.ri_count are now a single parity bit (u1) rather than a
full usize; only the run parity was ever consulted, so the per-step state
structs are smaller. No behavioural change.

Assets 2