Skip to content

fix(parser): text signal hardening — noise-prefix strip + joined-prefix lookahead#70

Merged
thewrz merged 2 commits into
mainfrom
feat/issue-67-text-signal-hardening
May 18, 2026
Merged

fix(parser): text signal hardening — noise-prefix strip + joined-prefix lookahead#70
thewrz merged 2 commits into
mainfrom
feat/issue-67-text-signal-hardening

Conversation

@thewrz
Copy link
Copy Markdown
Contributor

@thewrz thewrz commented May 17, 2026

Summary

  • Strips leading ], ]], [_____] noise prefixes from lines before signal cascade — fixes UFGS specifier-note bracket bleed where ] PART 2 PRODUCTS was silently classified as continuation
  • Loosens ARTICLE_RE and PR_SIGNALS (pr1, pr2) to accept joined prefixes with no space — fixes CPI export pattern 1.3QUALITY, B.Included, 1.Manufacturers falling to continuation
  • Uses (?=\S) lookahead for ARTICLE_RE (digit-letter boundary has no \b) and (?=[^\d\s]) for PR2 to prevent 1.1 bare from matching as pr2

Test plan

pnpm test src/parser/text/signals.test.ts   # 9 new tests + all prior tests (31 total)
pnpm test                                    # full suite, no regressions (432 tests)
pnpm lint                                    # clean

Closes #67

Summary by CodeRabbit

  • Tests

    • Added comprehensive test coverage for text line classification with edge cases, including noise-prefix handling and joined-prefix detection.
  • Bug Fixes

    • Improved parser robustness for handling unusual formatting patterns, bracketed prefixes, and ambiguous punctuation in document structure detection.

Review Change Stack

Test plan results

  • pnpm test src/parser/text/signals.test.ts — 31 tests (9 new + 22 prior) passed
  • pnpm test — 432 tests passed, no regressions
  • pnpm lint — ESLint + tsc + prettier clean

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

📝 Walkthrough

Walkthrough

This PR hardens the plaintext parser's text signal classification with two distinct fixes: (1) stripping noise prefixes (closing brackets, placeholders) that prevent structural keyword matching, and (2) loosening regex patterns and prefix stripping to handle joined prefixes where numbering or letters attach directly to content without whitespace.

Changes

Text signal hardening

Layer / File(s) Summary
Noise prefix stripping for bracket and placeholder tokens
src/parser/text/signals.ts, src/parser/text/signals.test.ts
Introduces NOISE_PREFIX_RE regex and stripNoisePrefixes helper to remove leading ], ]], and [_____] patterns. Integrates stripping into classifyLine before signal detection so structural keywords like PART and SECTION are recognized after noise removal. Tests validate that noise-prefixed lines (e.g., ] PART 2 PRODUCTS, ]] PART 3 EXECUTION) classify with correct type and text.
Joined-prefix regex and stripping updates
src/parser/text/signals.ts, src/parser/text/signals.test.ts
Updates ARTICLE_RE and PR_SIGNALS (pr1, pr2) to match article and requirement prefixes immediately adjacent to content using word-boundary and lookahead patterns. Adjusts stripArticlePrefix and stripPrPrefix to remove prefixes without mandatory whitespace suffixes. Tests validate that joined patterns (1.3QUALITY, B.Included, 1.Manufacturers) classify correctly and return text with prefix stripped.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

A parser once stumbled on brackets that clung,
And prefixes merged into words without space—
Now noise fades away, and the signals ring true,
From ] PART to 1.3QUALITY, each finds its place. 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main changes: noise-prefix stripping and joined-prefix lookahead handling in the text signal parser.
Linked Issues check ✅ Passed The PR fully implements both scope items from #67: noise-prefix stripping (NOISE_PREFIX_RE regex and stripNoisePrefixes helper in classifyLine) and joined-prefix lookahead (updated ARTICLE_RE and PR signal regexes), with comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to the two stated failure modes: noise-prefix handling and joined-prefix regex loosening; test additions directly validate these fixes.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/issue-67-text-signal-hardening

Comment @coderabbitai help to get the list of available commands and usage tips.

@thewrz thewrz merged commit 1c4bdb8 into main May 18, 2026
5 checks passed
@thewrz thewrz deleted the feat/issue-67-text-signal-hardening branch May 18, 2026 04:09
thewrz added a commit that referenced this pull request May 18, 2026
Status table: add 1c-iii..1c-viii and 1c-sec-i/ii rows covering PRs
#69 #70 #71 #72 #74 #75 #76. Updates 'Active development' subtitle
to reflect Phase 1c being complete.

Parsing section: add plaintext signal hardening (#70), parse-anomaly
warnings (#75), and DOCX resilience suite (#72) bullets.

MCP section: note POST /mcp rate limiting (#69).

Not Yet Built: strike completed items (DOCX cross-ref extraction in
PR #76, parse worker concurrency cap in PR #71). Add new known gap:
REST persistTree ignores extracted refs (follow-up to #53).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(parser): text signal hardening — specifier-note strip, en-dash PART, joined prefixes, anomaly surfacing

1 participant