Skip to content

Releases: speedyk-005/yasbd-lib

yasbd-lib v0.6.0

16 Jun 21:08

Choose a tag to compare

What's Changed

🚀 Added

Full Changelog: v0.5.0...v0.6.0

yasbd-lib v0.5.0

13 Jun 22:44

Choose a tag to compare

What's Changed

🚀 Added

⚙️ Changed

📝 Documentation

  • Update benchmarks with 92-case golden and per-language links by @speedyk-005 in #94

Full Changelog: v0.4.0...v0.5.0

yasbd-lib v0.4.0

10 Jun 21:56

Choose a tag to compare

What's Changed

🚀 Added

  • Enhance error handling with custom exceptions and suggestions by @speedyk-005 in #78
  • Implement radicli-based CLI with segment, detect, and langs commands by @speedyk-005 in #79
  • Enhance cleaner with extra steps, logging, and CLI options by @speedyk-005 in #82
  • Expose utils submodules at package root level by @speedyk-005 in #84

⚙️ Changed

  • API doc generation: replace python_docstring_markdown with pydoc-markdown by @speedyk-005 in #83
  • Enhance error handling with custom exceptions and suggestions by @speedyk-005 in #78
  • Extract log_info helper to yasbd.utils.logger
  • Rename variables for clarity: MID_SENTENCE_ABBRVS => INLINE_ONLY_ABBRVS, HEADING_TOKENS => SECTION_MARKERS, GEOPOLITICAL_ABBRVS => DOTTED_GEOPOL_ABBRVS
  • Clean Spanish COMMON_SENT_STARTERS: remove 15 prepositions and 15 verbs that caused false boundaries after Ud./Vd.

Full Changelog: v0.3.0...v0.4.0

yasbd-lib v0.3.0

08 Jun 20:12
12fb326

Choose a tag to compare

What's Changed

🚀 Added

  • Add language support for Russian, Arabic, Chinese, and Portuguese by @speedyk-005 in #48
  • Add German (de) language support by @speedyk-005 in #54
  • Base abbreviation expansion: Added diag to REFERENCE_ABBRVS by @speedyk-005.
  • Rules.apply early return: Added guard for empty/whitespace-only input to skip processing by @speedyk-005.

⚙️ Changed

  • NAIVE_BOUNDARY_FINDER cluster logic unification: Merged contiguous terminator handling into the lookahead assertion by @speedyk-005.
  • FULLWIDTH_GEOPOLITICAL_ABBRVS moved to class-level attribute with dynamic regex matching by @speedyk-005.
  • COMMON_SENT_STARTERS expanded with time-related adverbs across all 9 languages by @speedyk-005.
  • BoundaryDetector.detect refactored to reduce cognitive complexity by @speedyk-005.

🐛 Fixed

  • Fix newline boundary handling in NAIVE_BOUNDARY_FINDER by @speedyk-005 in #64
  • Prevent single-letter markers from being treated as list items by @speedyk-005 in #71
  • Full-width geopolitical abbreviation over-matching fixed with dynamic regex by @speedyk-005.
  • Acronym/initialism boundary constraint simplified to reduce false positives by @speedyk-005.
  • Superscript indicator false splits prevented after ordinal markers by @speedyk-005.
  • Em-dash quoted text splitting fixed by adding pattern to QUOTE_AND_PAREN_FINDER by @speedyk-005.

📝 Documentation

  • Update documentation for language support and changelog by @speedyk-005 in #55

Full Changelog: v0.2.0...v0.3.0

yasbd-lib v0.2.0

04 Jun 23:59

Choose a tag to compare

What's Changed

🚀 Added

  • Add configurable StreamCleaner cleanup stages by @speedyk-005 in #41
  • _post_process_boundaries hook: Added language-aware sentence boundary correction without modifying the regex core pipeline (PR #39).

⚙️ Changed

  • Regex architecture refactor in base.py: Promoted local regex patterns into class-level attributes for consistency and reuse by @speedyk-005.
  • STREET_ABBRVS merged into MID_SENTENCE_ABBRVS: Now strictly non-splitting; English restores boundary logic via post-processing hook by @speedyk-005.
  • COMMON_ORG_NOUNS renamed to ORG_PROPER_NOUNS and restricted to proper nouns only by @speedyk-005.
  • Geopolitical abbreviations normalization: Standardized casing across languages for consistent detection behavior by @speedyk-005.

🐛 Fixed

  • Fix Spanish sentence boundaries (#31) by @JheanLL in #31
  • Add opening bracket to reference abbreviation lookahead (#35) by @Jah-yee in #35
  • Fix false negative for Spanish 'ave' due to street abbrv inheritance (#37) by @JheanLL in #37
  • Fix sentence splitting after a.m./p.m. before date tokens (#40) by @Rajesh270712 in #40
  • Fix sentence splitting after mixed-case scientific units (#42) by @Rajesh270712 in #42
  • Fix/heading aware sbd (#44) by @speedyk-005 in #44
  • Japanese over-matching boundary logic: Removed invalid \b dependency in CJK context by @speedyk-005.
  • Time-date pipeline cleanup (English-specific logic): Ensures time/date handling is isolated to English rules by @Rajesh270712.

New Contributors

Full Changelog: v0.1.3...v0.2.0

yasbd v0.1.3 - Bugfix release

01 Jun 21:37

Choose a tag to compare

pip install --upgrade yasbd-lib

Fixed

  • HORIZONTAL_LIST_FINDER over-match: Single-letter abbreviations (p., h., s.) no longer treated as alphabetic list markers. Restricted marker range to [a-eA-E].

yasbd v0.1.2 - Bugfix release

01 Jun 21:40

Choose a tag to compare

Accuracy-focused release: 84-case golden benchmark, expanded abbreviations, faster regex compilation.

pip install --upgrade yasbd-lib

Added

  • 84-case golden benchmark suite (EN_GOLDEN_DATA.py): Covers abbreviations, ellipsis, contiguous terminators, parentheses, quotes, mixed CJK, decimal times, list markers, and exclamation-safe words. Used to compare all 7 libraries side-by-side.
  • Expanded abbreviations: Dozens of new abbreviations across all categories — reference (eq, ex, pp), date (Tue, Fri, Feb), street (Hwy, Ave, Blvd), title (Prof, Dr, Mr), and more.

Changed

  • Trie-based pattern building: Replaced "|".join() sorting with retrie.Trie for faster, more consistent abbreviation regex generation.
  • Abbreviation redistribution: Shared abbreviations (fr, ing, messrs, mlle, mme, etc.) moved to base class. Language-specific rules now only add their unique abbreviations.
  • Benchmarks rewritten: Cold/warm timing tables updated with real measured values; accuracy table and conclusion added.

Fixed

  • ModuleNotFoundError masking: boundary_detector.py no longer masks unrelated import errors when a language module exists but a sub-dependency is missing.
  • P.M. false positive: All-caps P.M. no longer caught by the acronym pattern (p\.m and a\.m explicitly excluded).

yasbd v0.1.1 - Bugfix release

01 Jun 21:41

Choose a tag to compare

dialog, ellipsis, initialism, and list marker improvements.

pip install --upgrade yasbd-lib

Fixed

  • Single-quote dialog: No longer splits before the dialogue tag (e.g., 'Is this great?' she said.).
  • Ellipsis mid-thought: Three-dot ellipsis (...) no longer splits mid-sentence. Only four dots are sentence boundaries.
  • Initialism detection: Pronoun I no longer triggers false splits in names like Albert I. Jones.
  • N° reference: Added to reference abbreviations to prevent split in N°. 1026.253.553.

Changed

  • HORIZONTAL_LIST_FINDER: Switched to re2 for lookbehind support. Uses \b + negative lookbehind for capitalized words instead of requiring a terminator prefix. Supports other scripts via \p{Ll}.

yasbd v0.1.0 (First public release) 🎉

29 May 21:48

Choose a tag to compare

yasbd-lib is now on PyPI. Pure Python sentence boundary detection, 5 languages, drop-in pysbd adapter.

pip install yasbd-lib

What's inside

  • 2-pass pointer-based engine: abbreviation safe list + main splitter. No ML, no models, no bloat.
  • 5 languages: en, fr, es, ht, ja. Add yours by copying a template.
  • pysbd adapter: swap without changing a line of pipeline code.
  • Streaming: detect() yields integer offsets, segment() yields strings. Lazy generators, zero materialization.
  • Benchmarked against 6 competitors across 7 edge cases. #1 in accuracy.

What's next

  • More languages
  • spaCy pipeline component
  • StreamCleaner skip flags (issue #19)
  • Stabilize API for v0.2.0

Full changelog at CHANGELOG.md.