Releases: speedyk-005/yasbd-lib
Releases · speedyk-005/yasbd-lib
yasbd-lib v0.6.0
What's Changed
🚀 Added
- Add spaCy component with tests and documentation by @speedyk-005 in #95
- Expand base TERMINATORS across major scripts by @speedyk-005 in #96
- Add book benchmarking script and update related documentation by @speedyk-005 in #97
- Add Italian language support by @speedyk-005 in #99
- Add Thai language support and refine related features by @speedyk-005 in #100
- Add Greek language support and update documentation by @speedyk-005 in #101
- Expand DOTTED_GEOPOL_ABBRVS with major organizations and states by @speedyk-005 in #102
Full Changelog: v0.5.0...v0.6.0
yasbd-lib v0.5.0
What's Changed
🚀 Added
- Add auto language detection to BoundaryDetector (default) by @speedyk-005 in #86
- Add Amharic (am) language support by @speedyk-005 in #91
⚙️ Changed
- Make lang required, auto is now opt-in by @speedyk-005 in #93
📝 Documentation
- Update benchmarks with 92-case golden and per-language links by @speedyk-005 in #94
Full Changelog: v0.4.0...v0.5.0
yasbd-lib v0.4.0
What's Changed
🚀 Added
- Enhance error handling with custom exceptions and suggestions by @speedyk-005 in #78
- Implement radicli-based CLI with segment, detect, and langs commands by @speedyk-005 in #79
- Enhance cleaner with extra steps, logging, and CLI options by @speedyk-005 in #82
- Expose utils submodules at package root level by @speedyk-005 in #84
⚙️ Changed
- API doc generation: replace python_docstring_markdown with pydoc-markdown by @speedyk-005 in #83
- Enhance error handling with custom exceptions and suggestions by @speedyk-005 in #78
- Extract
log_infohelper toyasbd.utils.logger - Rename variables for clarity:
MID_SENTENCE_ABBRVS=>INLINE_ONLY_ABBRVS,HEADING_TOKENS=>SECTION_MARKERS,GEOPOLITICAL_ABBRVS=>DOTTED_GEOPOL_ABBRVS - Clean Spanish
COMMON_SENT_STARTERS: remove 15 prepositions and 15 verbs that caused false boundaries afterUd./Vd.
Full Changelog: v0.3.0...v0.4.0
yasbd-lib v0.3.0
What's Changed
🚀 Added
- Add language support for Russian, Arabic, Chinese, and Portuguese by @speedyk-005 in #48
- Add German (de) language support by @speedyk-005 in #54
- Base abbreviation expansion: Added
diagtoREFERENCE_ABBRVSby @speedyk-005. Rules.applyearly return: Added guard for empty/whitespace-only input to skip processing by @speedyk-005.
⚙️ Changed
NAIVE_BOUNDARY_FINDERcluster logic unification: Merged contiguous terminator handling into the lookahead assertion by @speedyk-005.FULLWIDTH_GEOPOLITICAL_ABBRVSmoved to class-level attribute with dynamic regex matching by @speedyk-005.COMMON_SENT_STARTERSexpanded with time-related adverbs across all 9 languages by @speedyk-005.BoundaryDetector.detectrefactored to reduce cognitive complexity by @speedyk-005.
🐛 Fixed
- Fix newline boundary handling in NAIVE_BOUNDARY_FINDER by @speedyk-005 in #64
- Prevent single-letter markers from being treated as list items by @speedyk-005 in #71
- Full-width geopolitical abbreviation over-matching fixed with dynamic regex by @speedyk-005.
- Acronym/initialism boundary constraint simplified to reduce false positives by @speedyk-005.
- Superscript indicator false splits prevented after ordinal markers by @speedyk-005.
- Em-dash quoted text splitting fixed by adding pattern to
QUOTE_AND_PAREN_FINDERby @speedyk-005.
📝 Documentation
- Update documentation for language support and changelog by @speedyk-005 in #55
Full Changelog: v0.2.0...v0.3.0
yasbd-lib v0.2.0
What's Changed
🚀 Added
- Add configurable StreamCleaner cleanup stages by @speedyk-005 in #41
_post_process_boundarieshook: Added language-aware sentence boundary correction without modifying the regex core pipeline (PR #39).
⚙️ Changed
- Regex architecture refactor in
base.py: Promoted local regex patterns into class-level attributes for consistency and reuse by @speedyk-005. STREET_ABBRVSmerged intoMID_SENTENCE_ABBRVS: Now strictly non-splitting; English restores boundary logic via post-processing hook by @speedyk-005.COMMON_ORG_NOUNSrenamed toORG_PROPER_NOUNSand restricted to proper nouns only by @speedyk-005.- Geopolitical abbreviations normalization: Standardized casing across languages for consistent detection behavior by @speedyk-005.
🐛 Fixed
- Fix Spanish sentence boundaries (#31) by @JheanLL in #31
- Add opening bracket to reference abbreviation lookahead (#35) by @Jah-yee in #35
- Fix false negative for Spanish 'ave' due to street abbrv inheritance (#37) by @JheanLL in #37
- Fix sentence splitting after a.m./p.m. before date tokens (#40) by @Rajesh270712 in #40
- Fix sentence splitting after mixed-case scientific units (#42) by @Rajesh270712 in #42
- Fix/heading aware sbd (#44) by @speedyk-005 in #44
- Japanese over-matching boundary logic: Removed invalid
\bdependency in CJK context by @speedyk-005. - Time-date pipeline cleanup (English-specific logic): Ensures time/date handling is isolated to English rules by @Rajesh270712.
New Contributors
- @JheanLL made their first contribution in #31
- @Jah-yee made their first contribution in #35
- @Rajesh270712 made their first contribution in #40
Full Changelog: v0.1.3...v0.2.0
yasbd v0.1.3 - Bugfix release
pip install --upgrade yasbd-libFixed
- HORIZONTAL_LIST_FINDER over-match: Single-letter abbreviations (
p.,h.,s.) no longer treated as alphabetic list markers. Restricted marker range to[a-eA-E].
yasbd v0.1.2 - Bugfix release
Accuracy-focused release: 84-case golden benchmark, expanded abbreviations, faster regex compilation.
pip install --upgrade yasbd-libAdded
- 84-case golden benchmark suite (
EN_GOLDEN_DATA.py): Covers abbreviations, ellipsis, contiguous terminators, parentheses, quotes, mixed CJK, decimal times, list markers, and exclamation-safe words. Used to compare all 7 libraries side-by-side. - Expanded abbreviations: Dozens of new abbreviations across all categories — reference (
eq,ex,pp), date (Tue,Fri,Feb), street (Hwy,Ave,Blvd), title (Prof,Dr,Mr), and more.
Changed
- Trie-based pattern building: Replaced
"|".join()sorting withretrie.Triefor faster, more consistent abbreviation regex generation. - Abbreviation redistribution: Shared abbreviations (
fr,ing,messrs,mlle,mme, etc.) moved to base class. Language-specific rules now only add their unique abbreviations. - Benchmarks rewritten: Cold/warm timing tables updated with real measured values; accuracy table and conclusion added.
Fixed
- ModuleNotFoundError masking:
boundary_detector.pyno longer masks unrelated import errors when a language module exists but a sub-dependency is missing. - P.M. false positive: All-caps
P.M.no longer caught by the acronym pattern (p\.manda\.mexplicitly excluded).
yasbd v0.1.1 - Bugfix release
dialog, ellipsis, initialism, and list marker improvements.
pip install --upgrade yasbd-libFixed
- Single-quote dialog: No longer splits before the dialogue tag (e.g.,
'Is this great?' she said.). - Ellipsis mid-thought: Three-dot ellipsis (
...) no longer splits mid-sentence. Only four dots are sentence boundaries. - Initialism detection: Pronoun
Ino longer triggers false splits in names likeAlbert I. Jones. - N° reference: Added to reference abbreviations to prevent split in
N°. 1026.253.553.
Changed
- HORIZONTAL_LIST_FINDER: Switched to
re2for lookbehind support. Uses\b+ negative lookbehind for capitalized words instead of requiring a terminator prefix. Supports other scripts via\p{Ll}.
yasbd v0.1.0 (First public release) 🎉
yasbd-lib is now on PyPI. Pure Python sentence boundary detection, 5 languages, drop-in pysbd adapter.
pip install yasbd-libWhat's inside
- 2-pass pointer-based engine: abbreviation safe list + main splitter. No ML, no models, no bloat.
- 5 languages: en, fr, es, ht, ja. Add yours by copying a template.
- pysbd adapter: swap without changing a line of pipeline code.
- Streaming:
detect()yields integer offsets,segment()yields strings. Lazy generators, zero materialization. - Benchmarked against 6 competitors across 7 edge cases. #1 in accuracy.
What's next
- More languages
- spaCy pipeline component
StreamCleanerskip flags (issue #19)- Stabilize API for v0.2.0
Full changelog at CHANGELOG.md.