test(parser): DOCX resilience — LibreOffice fixture + integration tests#72
Conversation
Adds libreoffice.integration.test.ts with 6 structural assertions covering PART/article/pr1 classification of a committed LibreOffice-generated DOCX. Discovery revealed Signal 1 misclassifies <ol><li> items as PART nodes when LibreOffice assigns numId > 0, ilvl=0 (same level as CSI PART headings). Fix: add isPartHeading() guard in heuristics.ts; trySignal1 in inference.ts now requires PART text confirmation when ilvl=0. Regression tests added for both the guard and the LibreOffice ol pattern. Existing tests updated to use PART-conformant text for ilvl=0 paragraphs.
📝 WalkthroughWalkthroughAdds a PART-heading regex and exported ChangesPART heading pattern and Signal 1 classification guard
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/parser/docx/heuristics.ts (1)
25-37: ⚡ Quick winAvoid regex drift between Signal 1 guard and Signal 4 matching.
The PART pattern is now duplicated in two places. Reusing one shared constant for both
isPartHeadingandTEXT_SIGNALSprevents future divergence.♻️ Proposed refactor
+const PART_HEADING_PATTERN = /^PART\s+\d+/i; + // All patterns anchored to ^ — prevents mid-word matches (e.g. "3i)" in product codes). // Ordered most-specific first. const TEXT_SIGNALS: readonly TextSignalEntry[] = [ - { pattern: /^PART\s+\d+/i, nodeType: 'part', normalizedIlvl: 0 }, + { pattern: PART_HEADING_PATTERN, nodeType: 'part', normalizedIlvl: 0 }, { pattern: /^\d+\.\d+\s+/, nodeType: 'article', normalizedIlvl: 1 }, { pattern: /^[A-Z]\.\s/, nodeType: 'pr1', normalizedIlvl: 2 }, { pattern: /^\d+\.\s/, nodeType: 'pr2', normalizedIlvl: 3 }, { pattern: /^[a-z]\.\s/, nodeType: 'pr3', normalizedIlvl: 4 }, { pattern: /^\d+\)\s/, nodeType: 'pr4', normalizedIlvl: 5 }, { pattern: /^[a-z]\)\s/, nodeType: 'pr5', normalizedIlvl: 6 }, ]; @@ -const PART_HEADING_PATTERN = /^PART\s+\d+/i;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/parser/docx/heuristics.ts` around lines 25 - 37, The PART heading regex is duplicated; centralize it and reuse it in both isPartHeading and TEXT_SIGNALS by exporting the shared constant PART_HEADING_PATTERN (or a newly exported name like EXPORTED_PART_HEADING_PATTERN) and replacing the duplicate literal in TEXT_SIGNALS with that constant; update the isPartHeading function to use the exported constant and remove the other copy so both Signal 1 (isPartHeading) and Signal 4 (TEXT_SIGNALS) reference the single shared pattern (symbols: PART_HEADING_PATTERN, isPartHeading, TEXT_SIGNALS).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/parser/docx/libreoffice.integration.test.ts`:
- Around line 45-50: The existing test comment is stale: update the explanatory
comment above the test to state that the isPartHeading guard now blocks Signal 1
from misclassifying non-PART text, so ordered-list (<ol><li>) items with "1. "
are handled by the non-part pattern (Signal 4) and not treated as 'part';
reference the guards and signals by name (isPartHeading, Signal 1, Signal 4) and
remove or rephrase the sentence that claims "Signal 1 wins in the hit array" so
the note reflects current guard behavior and avoids confusion during future
regressions.
---
Nitpick comments:
In `@src/parser/docx/heuristics.ts`:
- Around line 25-37: The PART heading regex is duplicated; centralize it and
reuse it in both isPartHeading and TEXT_SIGNALS by exporting the shared constant
PART_HEADING_PATTERN (or a newly exported name like
EXPORTED_PART_HEADING_PATTERN) and replacing the duplicate literal in
TEXT_SIGNALS with that constant; update the isPartHeading function to use the
exported constant and remove the other copy so both Signal 1 (isPartHeading) and
Signal 4 (TEXT_SIGNALS) reference the single shared pattern (symbols:
PART_HEADING_PATTERN, isPartHeading, TEXT_SIGNALS).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 2bb54f44-9b1b-4b1c-a363-f558a2bd9b76
⛔ Files ignored due to path filters (1)
tests/fixtures/libreoffice/csi-spec-sample.docxis excluded by!**/*.docx
📒 Files selected for processing (6)
src/parser/docx/heuristics.tssrc/parser/docx/index.test.tssrc/parser/docx/inference.test.tssrc/parser/docx/inference.tssrc/parser/docx/libreoffice.integration.test.tstests/fixtures/libreoffice/csi-spec-sample.html
…ding guard behavior
Status table: add 1c-iii..1c-viii and 1c-sec-i/ii rows covering PRs #69 #70 #71 #72 #74 #75 #76. Updates 'Active development' subtitle to reflect Phase 1c being complete. Parsing section: add plaintext signal hardening (#70), parse-anomaly warnings (#75), and DOCX resilience suite (#72) bullets. MCP section: note POST /mcp rate limiting (#69). Not Yet Built: strike completed items (DOCX cross-ref extraction in PR #76, parse worker concurrency cap in PR #71). Add new known gap: REST persistTree ignores extracted refs (follow-up to #53).
Summary
tests/fixtures/libreoffice/csi-spec-sample.html— minimal CSI spec HTML source (3 PARTs, 2 articles each, pr1 + pr2 content)tests/fixtures/libreoffice/csi-spec-sample.docxvia LibreOffice headless CLI and commits it (synthetic content, no copyright) — CI always runs the testssrc/parser/docx/libreoffice.integration.test.ts— 6 structural assertions following ARCAT/CPI pattern<ol><li>items withnumId > 0, ilvl=0— same level as CSI PART headings — causing misclassification aspartnodes; fix addsisPartHeading()guard inheuristics.ts, applied intrySignal1ininference.tsGoogle Docs fixtures are out of scope — tracked separately (requires browser/manual export).
Discovery findings
Running the parser against the LibreOffice DOCX revealed:
numId=4atilvl=0/1for<h1>/<h2>headings — Signal 1 correctly fires aspart/article<p>paragraphs get no numId — Signal 4 (text) fires correctly viaA.pattern<ol><li>) — FIXED: LibreOffice assignsnumId=1/2/3atilvl=0for list items — Signal 1 was firing aspart. Fix: require PART text confirmation when ilvl=0No known-ambiguous conflicts remain after the fix. All 6 integration tests pass.
Test plan
pnpm test:integration src/parser/docx/libreoffice.integration.test.ts pnpm test pnpm lintCloses #57
Summary by CodeRabbit
Bug Fixes
Tests
Test plan results
pnpm test:integration src/parser/docx/libreoffice.integration.test.ts— 6/6 passedpnpm test— 424 tests passed, no regressionspnpm lint— ESLint + tsc + prettier clean