Skip to content

test(parser): DOCX resilience — LibreOffice fixture + integration tests#72

Merged
thewrz merged 3 commits into
mainfrom
feat/issue-57-docx-resilience
May 18, 2026
Merged

test(parser): DOCX resilience — LibreOffice fixture + integration tests#72
thewrz merged 3 commits into
mainfrom
feat/issue-57-docx-resilience

Conversation

@thewrz
Copy link
Copy Markdown
Contributor

@thewrz thewrz commented May 17, 2026

Summary

  • Adds tests/fixtures/libreoffice/csi-spec-sample.html — minimal CSI spec HTML source (3 PARTs, 2 articles each, pr1 + pr2 content)
  • Generates tests/fixtures/libreoffice/csi-spec-sample.docx via LibreOffice headless CLI and commits it (synthetic content, no copyright) — CI always runs the tests
  • Adds src/parser/docx/libreoffice.integration.test.ts — 6 structural assertions following ARCAT/CPI pattern
  • Fixes Signal 1 false-positive: LibreOffice exports <ol><li> items with numId > 0, ilvl=0 — same level as CSI PART headings — causing misclassification as part nodes; fix adds isPartHeading() guard in heuristics.ts, applied in trySignal1 in inference.ts
  • Adds regression test for the LibreOffice ol pattern and updates existing synthetic tests to use PART-conformant text at ilvl=0

Google Docs fixtures are out of scope — tracked separately (requires browser/manual export).

Discovery findings

Running the parser against the LibreOffice DOCX revealed:

  • Pattern A (PART/article): LibreOffice assigns numId=4 at ilvl=0/1 for <h1>/<h2> headings — Signal 1 correctly fires as part/article
  • Pattern B (pr1): <p> paragraphs get no numId — Signal 4 (text) fires correctly via A. pattern
  • Pattern C (<ol><li>) — FIXED: LibreOffice assigns numId=1/2/3 at ilvl=0 for list items — Signal 1 was firing as part. Fix: require PART text confirmation when ilvl=0

No known-ambiguous conflicts remain after the fix. All 6 integration tests pass.

Test plan

pnpm test:integration src/parser/docx/libreoffice.integration.test.ts
pnpm test
pnpm lint

Closes #57

Summary by CodeRabbit

  • Bug Fixes

    • DOCX parsing now more accurately recognizes PART headings, avoiding misclassification of numbered lists as parts (including LibreOffice exports).
    • Part heading detection strengthened to require the expected PART pattern before classifying a node as a part.
  • Tests

    • Added regression and integration tests covering part classification and vanish-paragraph behavior.
    • Added a sample fixture to validate real-world DOCX/LibreOffice parsing scenarios.

Review Change Stack

Test plan results

  • pnpm test:integration src/parser/docx/libreoffice.integration.test.ts — 6/6 passed
  • pnpm test — 424 tests passed, no regressions
  • pnpm lint — ESLint + tsc + prettier clean

thewrz added 2 commits May 17, 2026 11:35
Adds libreoffice.integration.test.ts with 6 structural assertions covering
PART/article/pr1 classification of a committed LibreOffice-generated DOCX.

Discovery revealed Signal 1 misclassifies <ol><li> items as PART nodes when
LibreOffice assigns numId > 0, ilvl=0 (same level as CSI PART headings).
Fix: add isPartHeading() guard in heuristics.ts; trySignal1 in inference.ts
now requires PART text confirmation when ilvl=0.

Regression tests added for both the guard and the LibreOffice ol pattern.
Existing tests updated to use PART-conformant text for ilvl=0 paragraphs.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

📝 Walkthrough

Walkthrough

Adds a PART-heading regex and exported isPartHeading helper, uses it in Signal 1 paragraph classification to require PART text for part nodes, and adds unit + conditional LibreOffice integration tests plus an HTML fixture exercising real-world cases.

Changes

PART heading pattern and Signal 1 classification guard

Layer / File(s) Summary
PART heading pattern definition
src/parser/docx/heuristics.ts
PART_HEADING_PATTERN constant (^PART\s+\d+ case-insensitive regex) and exported isPartHeading() function provide a shared guard for validating PART heading text.
Signal 1 classification guard
src/parser/docx/inference.ts
Imports isPartHeading and adds a validation check in trySignal1: when indent-based classification maps to part, the paragraph is only accepted if text matches the PART pattern, preventing misclassification of numbered list items.
Unit test coverage
src/parser/docx/index.test.ts, src/parser/docx/inference.test.ts
Updates existing signal-1 test to include PART text, adds regression test verifying rejection of numbered items without PART pattern, updates test fixture to include "PART 1 –" prefix, and adjusts vanish-paragraph test to include PART text while verifying classification consistency.
LibreOffice integration test and fixtures
src/parser/docx/libreoffice.integration.test.ts, tests/fixtures/libreoffice/csi-spec-sample.html
Adds conditional integration test that parses a LibreOffice DOCX fixture (when present), validates tree structure (3 PART nodes, 2+ articles per PART, at least one pr1 node), includes regression checks for ordered-list misclassification, and provides HTML fixture with three-part sample document structure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • wrzonance/SpecR#21: Implements the 5-signal hierarchy inference engine that includes the trySignal1 logic now guarded by the PART heading pattern validation added in this PR.

Poem

🐰 I sniffed the headings, sharp and spry,
PARTs that hide in numbered sky.
A regex hop, a tidy cheer,
LibreOffice mischief disappears.
Now parts and lists stay clear — huzzah!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding LibreOffice fixture and integration tests for DOCX parser resilience.
Linked Issues check ✅ Passed PR implements key objectives from #57: collects LibreOffice DOCX sample, adds regression tests, documents signal failures, hardens inference heuristics with isPartHeading guard, and provides clear CSI PART heading validation.
Out of Scope Changes check ✅ Passed All changes are scoped to #57 objectives: LibreOffice fixture/integration tests, isPartHeading guard for Signal 1 false-positive fix, and test updates. Google Docs explicitly noted as out of scope.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/issue-57-docx-resilience

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/parser/docx/heuristics.ts (1)

25-37: ⚡ Quick win

Avoid regex drift between Signal 1 guard and Signal 4 matching.

The PART pattern is now duplicated in two places. Reusing one shared constant for both isPartHeading and TEXT_SIGNALS prevents future divergence.

♻️ Proposed refactor
+const PART_HEADING_PATTERN = /^PART\s+\d+/i;
+
 // All patterns anchored to ^ — prevents mid-word matches (e.g. "3i)" in product codes).
 // Ordered most-specific first.
 const TEXT_SIGNALS: readonly TextSignalEntry[] = [
-  { pattern: /^PART\s+\d+/i, nodeType: 'part', normalizedIlvl: 0 },
+  { pattern: PART_HEADING_PATTERN, nodeType: 'part', normalizedIlvl: 0 },
   { pattern: /^\d+\.\d+\s+/, nodeType: 'article', normalizedIlvl: 1 },
   { pattern: /^[A-Z]\.\s/, nodeType: 'pr1', normalizedIlvl: 2 },
   { pattern: /^\d+\.\s/, nodeType: 'pr2', normalizedIlvl: 3 },
   { pattern: /^[a-z]\.\s/, nodeType: 'pr3', normalizedIlvl: 4 },
   { pattern: /^\d+\)\s/, nodeType: 'pr4', normalizedIlvl: 5 },
   { pattern: /^[a-z]\)\s/, nodeType: 'pr5', normalizedIlvl: 6 },
 ];
@@
-const PART_HEADING_PATTERN = /^PART\s+\d+/i;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/parser/docx/heuristics.ts` around lines 25 - 37, The PART heading regex
is duplicated; centralize it and reuse it in both isPartHeading and TEXT_SIGNALS
by exporting the shared constant PART_HEADING_PATTERN (or a newly exported name
like EXPORTED_PART_HEADING_PATTERN) and replacing the duplicate literal in
TEXT_SIGNALS with that constant; update the isPartHeading function to use the
exported constant and remove the other copy so both Signal 1 (isPartHeading) and
Signal 4 (TEXT_SIGNALS) reference the single shared pattern (symbols:
PART_HEADING_PATTERN, isPartHeading, TEXT_SIGNALS).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/parser/docx/libreoffice.integration.test.ts`:
- Around line 45-50: The existing test comment is stale: update the explanatory
comment above the test to state that the isPartHeading guard now blocks Signal 1
from misclassifying non-PART text, so ordered-list (<ol><li>) items with "1. "
are handled by the non-part pattern (Signal 4) and not treated as 'part';
reference the guards and signals by name (isPartHeading, Signal 1, Signal 4) and
remove or rephrase the sentence that claims "Signal 1 wins in the hit array" so
the note reflects current guard behavior and avoids confusion during future
regressions.

---

Nitpick comments:
In `@src/parser/docx/heuristics.ts`:
- Around line 25-37: The PART heading regex is duplicated; centralize it and
reuse it in both isPartHeading and TEXT_SIGNALS by exporting the shared constant
PART_HEADING_PATTERN (or a newly exported name like
EXPORTED_PART_HEADING_PATTERN) and replacing the duplicate literal in
TEXT_SIGNALS with that constant; update the isPartHeading function to use the
exported constant and remove the other copy so both Signal 1 (isPartHeading) and
Signal 4 (TEXT_SIGNALS) reference the single shared pattern (symbols:
PART_HEADING_PATTERN, isPartHeading, TEXT_SIGNALS).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 2bb54f44-9b1b-4b1c-a363-f558a2bd9b76

📥 Commits

Reviewing files that changed from the base of the PR and between 85ef92a and e8f13de.

⛔ Files ignored due to path filters (1)
  • tests/fixtures/libreoffice/csi-spec-sample.docx is excluded by !**/*.docx
📒 Files selected for processing (6)
  • src/parser/docx/heuristics.ts
  • src/parser/docx/index.test.ts
  • src/parser/docx/inference.test.ts
  • src/parser/docx/inference.ts
  • src/parser/docx/libreoffice.integration.test.ts
  • tests/fixtures/libreoffice/csi-spec-sample.html

Comment thread src/parser/docx/libreoffice.integration.test.ts
@thewrz thewrz merged commit fee597a into main May 18, 2026
5 checks passed
@thewrz thewrz deleted the feat/issue-57-docx-resilience branch May 18, 2026 04:10
thewrz added a commit that referenced this pull request May 18, 2026
Status table: add 1c-iii..1c-viii and 1c-sec-i/ii rows covering PRs
#69 #70 #71 #72 #74 #75 #76. Updates 'Active development' subtitle
to reflect Phase 1c being complete.

Parsing section: add plaintext signal hardening (#70), parse-anomaly
warnings (#75), and DOCX resilience suite (#72) bullets.

MCP section: note POST /mcp rate limiting (#69).

Not Yet Built: strike completed items (DOCX cross-ref extraction in
PR #76, parse worker concurrency cap in PR #71). Add new known gap:
REST persistTree ignores extracted refs (follow-up to #53).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test(parser): DOCX resilience — LibreOffice, Google Docs, older Word exports

1 participant