Skip to content

feat(parser): text anomaly warnings — meta.warnings on CsiTree#75

Merged
thewrz merged 4 commits into
mainfrom
feat/issue-68
May 18, 2026
Merged

feat(parser): text anomaly warnings — meta.warnings on CsiTree#75
thewrz merged 4 commits into
mainfrom
feat/issue-68

Conversation

@thewrz
Copy link
Copy Markdown
Contributor

@thewrz thewrz commented May 18, 2026

Summary

Adds structured anomaly warnings to the text parser, giving spec writers observability into silent failure modes (analogous to meta.conflicts in the DOCX parser).

Three warning types surfaced on CsiTree.warnings:

  • root-continuation — continuation text dropped before first structural heading (cap at 5 to prevent pathological PDFs from producing massive warning arrays)
  • empty-part — a part node with zero article children
  • no-structure-found — tree has zero parts (entire doc parsed as continuations)

When any warning is emitted, parseText adds 'parse-warnings' to capabilities. Warnings ride sparsely on CsiTree.warnings (undefined when none). The REST job result envelope exposes warnings[] alongside the existing nodeCount / capabilities fields. Warnings are not persisted to the database — they live in the parse job result envelope and disappear once consumed.

Implementation

  • src/ast/types.tsParseWarning, ParseWarningType, CsiTree.warnings? (optional)
  • src/ast/schemas.tsParseWarningSchema, ParseWarningTypeSchema, extended CsiTreeSchema
  • src/parser/text/index.tsbuildTree refactor returns BuildResult { parts, droppedAtRoot, partLineIndex }, new detectWarnings, WARNING_SUGGESTIONS constant map, root-continuation cap constant
  • src/api/parse.ts — extends workerOutputSchema to allow warnings; surfaces warnings on job result

Test plan

  • pnpm test src/parser/text/warnings.test.ts — 7 unit tests pass (RED → GREEN)
  • pnpm test — all 442 unit tests pass (no regressions)
  • pnpm test:integration src/api/parse-warnings.integration.test.ts — integration test asserts result.warnings, result.capabilities.includes('parse-warnings'), and other parsing still works (nodeCount > 3)
  • pnpm test:integration src/api/parse.integration.test.ts — existing parse integration tests still pass
  • pnpm lint — clean
  • pnpm build — clean

Design

See docs/superpowers/specs/2026-05-18-issue-068-design.md (approved).

Out of scope

Closes #68

Summary by CodeRabbit

  • New Features

    • Parse-warnings capability added to detect and report anomalies (empty parts, missing structure, root continuations) and surface them in parse results.
  • Tests

    • Added integration test covering parse-warnings end-to-end.
    • Added unit tests covering warning detection and capping behavior.
  • Documentation

    • Updated design docs and test-run configuration guidance.
    • Updated fixture content used by warning tests.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: cd281e66-e47f-460d-a245-176ebbc2dd30

📥 Commits

Reviewing files that changed from the base of the PR and between 7690f8e and 5327f72.

📒 Files selected for processing (1)
  • src/api/parse.ts

📝 Walkthrough

Walkthrough

This PR implements parse-time anomaly detection for the text parser, surfacing structural failures (root continuations, empty parts, missing structure) as typed warnings in the CsiTree, propagating them through the REST API, and validating end-to-end with unit and integration tests. It also includes minor updates to design specifications for two related issues.

Changes

Parse-time anomaly surfacing (Issue #68)

Layer / File(s) Summary
Warning type contracts and schemas
src/ast/types.ts, src/ast/schemas.ts
ParseWarningType and ParseWarning interface with type, optional lineHint, and optional suggestion; CsiTree extended with optional warnings array; Zod schemas added for validation.
Text parser anomaly detection
src/parser/text/index.ts
Tree builder refactored to return BuildResult tracking dropped root continuations and per-part line indices. Adds detectWarnings aggregating three anomaly types: root-continuation, empty-part, no-structure-found. parseText computes warnings and conditionally includes them in CsiTree; sets capabilities to include 'parse-warnings' when warnings present.
API warning propagation
src/api/parse.ts
Worker output Zod schema extended to allow optional tree.warnings. Job completion result conditionally includes warnings from finalTree in payload.
Warning detection unit tests
src/parser/text/warnings.test.ts
Vitest coverage for all three warning types: no-structure-found on empty/prose-only documents, empty-part with line hints on consecutive parts, root-continuation with prose hints before first part, capping logic limiting root-continuation to 5 warnings, and clean-path verification that well-formed input has no warnings.
API integration test
src/api/parse-warnings.integration.test.ts, tests/fixtures/text/anomaly-empty-part.txt
Integration test booting Express server, submitting fixture via POST /parse, polling /parse/jobs/:jobId until completion, and asserting empty-part warning, parse-warnings capability, and node count in result.

Related design documentation updates

Layer / File(s) Summary
Issue #27 DOCX refs extraction design
docs/superpowers/specs/2026-05-18-issue-027-design.md
Minor markdown adjustment in architecture section describing format-agnostic refs module wiring into DOCX parse path.
Issue #73 Vitest serialization design
docs/superpowers/specs/2026-05-18-issue-073-design.md
Design spec updated from fileParallelism: false to pool: 'forks' with poolOptions: { forks: { singleFork: true } } for integration-test serialization, with updated rationale, config diff, verification plan, and acceptance criteria.

Sequence Diagram

sequenceDiagram
  participant Client
  participant ExpressServer
  participant ParseWorker
  participant Database

  Client->>ExpressServer: POST /parse (multipart text fixture)
  ExpressServer->>ParseWorker: enqueue job / invoke worker
  Client->>ExpressServer: poll GET /parse/jobs/:jobId
  ParseWorker->>ExpressServer: job complete (finalTree with optional warnings)
  ExpressServer->>Database: write spec row (specId) / update job result
  ExpressServer->>Client: 202 then final job result (includes warnings, capabilities, nodeCount)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Poem

A rabbit read the spec with care,
Found parts misplaced and prose in air—
Now warnings hop where mysteries hide,
Line-hints show the path inside. 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(parser): text anomaly warnings — meta.warnings on CsiTree' directly and concisely summarizes the main change: adding typed anomaly warnings to the text parser and surfacing them on the CsiTree.
Linked Issues check ✅ Passed The PR fully implements all coding requirements from issue #68: ParseWarning and ParseWarningType types, CsiTree.warnings optional field, three warning types (root-continuation, empty-part, no-structure-found), anomaly detection in parseText, parse-warnings capability, and REST API integration with warnings in job results.
Out of Scope Changes check ✅ Passed All code changes are scope-aligned with issue #68; the two design-spec documents are supporting documentation, the integration test validates the feature, and test fixtures support the test suite.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/issue-68

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/api/parse.ts (1)

108-115: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate warnings against ParseWarningSchema in worker output (Line 114).

warnings: z.array(z.unknown()).optional() accepts any payload shape, allowing malformed worker warnings to pass validation. Use the existing ParseWarningSchema contract instead—it's already defined in src/ast/schemas.ts and correctly used in CsiNodeSchema.

Proposed fix
 import { z } from 'zod';
+import { ParseWarningSchema } from '../ast/schemas.js';
@@
 const workerOutputSchema = z.object({
   tree: z.object({
@@
-    warnings: z.array(z.unknown()).optional(),
+    warnings: z.array(ParseWarningSchema).optional(),
   }),
   capabilities: z.array(z.string()).optional(),
 });

Per coding guidelines: "All external input validation uses Zod: request bodies, env vars, parsed XML/OOXML, and database results".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/parse.ts` around lines 108 - 115, Replace the overly-permissive
warnings schema in workerOutputSchema with the canonical ParseWarningSchema so
worker outputs are strictly validated: update the definition inside
workerOutputSchema (the tree.warnings field) to use ParseWarningSchema imported
from src/ast/schemas.ts (the same schema used by CsiNodeSchema) and keep it
optional/array-typed as appropriate; ensure to add the import for
ParseWarningSchema at the top of src/api/parse.ts and run existing
tests/validation to confirm compatibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/api/parse.ts`:
- Around line 108-115: Replace the overly-permissive warnings schema in
workerOutputSchema with the canonical ParseWarningSchema so worker outputs are
strictly validated: update the definition inside workerOutputSchema (the
tree.warnings field) to use ParseWarningSchema imported from src/ast/schemas.ts
(the same schema used by CsiNodeSchema) and keep it optional/array-typed as
appropriate; ensure to add the import for ParseWarningSchema at the top of
src/api/parse.ts and run existing tests/validation to confirm compatibility.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b0250529-b793-4956-83b2-8b2a67cd8338

📥 Commits

Reviewing files that changed from the base of the PR and between fee597a and 33b2af1.

📒 Files selected for processing (11)
  • .gitignore
  • docs/superpowers/specs/2026-05-18-issue-027-design.md
  • docs/superpowers/specs/2026-05-18-issue-068-design.md
  • docs/superpowers/specs/2026-05-18-issue-073-design.md
  • src/api/parse-warnings.integration.test.ts
  • src/api/parse.ts
  • src/ast/schemas.ts
  • src/ast/types.ts
  • src/parser/text/index.ts
  • src/parser/text/warnings.test.ts
  • tests/fixtures/text/anomaly-empty-part.txt

thewrz added 3 commits May 18, 2026 11:09
…-warnings capability

Surface three structural anomalies as typed warnings on the parsed text tree:
- root-continuation: continuation text dropped before first structural heading (cap at 5)
- empty-part: PART node with zero article children
- no-structure-found: tree has zero parts

Warnings ride sparsely on CsiTree.warnings (undefined when none). When present,
parseText adds 'parse-warnings' to capabilities and the REST job result envelope
exposes warnings[] alongside the existing nodeCount / capabilities fields.

This is the observability layer for the text parser — analogous to meta.conflicts
in the DOCX parser. Warnings are not persisted; they live in the parse job result
envelope and disappear once consumed.

Closes #68
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
docs/superpowers/specs/2026-05-18-issue-027-design.md (1)

25-39: ⚡ Quick win

Add language specifier to fenced code block.

The architecture diagram code block should include a language identifier for proper rendering and consistency with other code blocks in the document. As per static analysis tools, fenced code blocks should have a language specified.

📝 Proposed fix
-```
+```text
 src/parser/refs/                ← new format-agnostic module
 ├── index.ts                    barrel
 ├── rules.ts                    SECTION_REF_RULES + STANDARD_ORG_PATTERNS + buildStandardRefRules
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-18-issue-027-design.md` around lines 25 - 39,
The fenced architecture diagram block in
docs/superpowers/specs/2026-05-18-issue-027-design.md is missing a language
specifier; update the opening fence to include an appropriate language (e.g.,
```text) so static analysis and renderers treat it consistently, leaving the
diagram content unchanged and keeping the block around the src/parser/refs/ ...
src/parser/index.ts snippet intact.
src/parser/text/warnings.test.ts (1)

72-105: 💤 Low value

Consider consolidating duplicate clean-path tests.

Tests 6 and 7 use nearly identical input fixtures and both verify the clean path (no anomalies). Test 7 adds a specific assertion that capabilities equals ['read-only'], but it omits the tree.warnings check from test 6. Combining them into a single comprehensive test would reduce duplication and improve maintainability.

♻️ Suggested consolidation
- it('no anomalies: well-formed UFGS structure → tree.warnings undefined, capabilities does NOT include parse-warnings', () => {
+ it('no anomalies: well-formed UFGS structure → tree.warnings undefined, capabilities remain ["read-only"]', () => {
    const text = [
      'SECTION 09 91 00 - PAINTING',
      'PART 1 - GENERAL',
      '1.1 SCOPE',
      'Hello.',
      'PART 2 - PRODUCTS',
      '2.1 MATERIALS',
-     'Product description.',
+     'Materials.',
      'PART 3 - EXECUTION',
      '3.1 INSTALLATION',
-     'Installation steps.',
+     'Install.',
    ].join('\n');
    const result = parseText(text);
    expect(result.tree.warnings).toBeUndefined();
    expect(result.capabilities).not.toContain('parse-warnings');
+   expect(result.capabilities).toEqual(['read-only']);
  });
-
- it('capabilities array unchanged on no-anomaly path (still ["read-only"])', () => {
-   const text = [
-     'SECTION 09 91 00 - PAINTING',
-     'PART 1 - GENERAL',
-     '1.1 SCOPE',
-     'Hello.',
-     'PART 2 - PRODUCTS',
-     '2.1 MATERIALS',
-     'Materials.',
-     'PART 3 - EXECUTION',
-     '3.1 INSTALLATION',
-     'Install.',
-   ].join('\n');
-   const result = parseText(text);
-   expect(result.capabilities).toEqual(['read-only']);
- });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/parser/text/warnings.test.ts` around lines 72 - 105, Two tests repeat the
clean/no-anomaly path; combine them into one comprehensive test: replace the two
it(...) blocks ('no anomalies: well-formed UFGS structure → tree.warnings
undefined, capabilities does NOT include parse-warnings' and 'capabilities array
unchanged on no-anomaly path (still ["read-only"])') with a single test that
calls parseText(...) once and asserts both that result.tree.warnings is
undefined and that result.capabilities strictly equals ['read-only'] (and does
not contain 'parse-warnings'); locate usages by the test descriptions and the
parseText call to update the spec accordingly and remove the duplicate
fixture/test block.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/parse.ts`:
- Line 114: Replace the unsafe z.array(z.unknown()).optional() for worker
warnings with the proper ParseWarningSchema: import ParseWarningSchema from
src/ast/schemas.ts (or the named export) at the top of src/api/parse.ts and
change the schema entry for "warnings" to use
z.array(ParseWarningSchema).optional() so all worker output is validated against
the concrete warning schema (refer to the warnings field and the
ParseWarningSchema symbol to locate the change).

---

Nitpick comments:
In `@docs/superpowers/specs/2026-05-18-issue-027-design.md`:
- Around line 25-39: The fenced architecture diagram block in
docs/superpowers/specs/2026-05-18-issue-027-design.md is missing a language
specifier; update the opening fence to include an appropriate language (e.g.,
```text) so static analysis and renderers treat it consistently, leaving the
diagram content unchanged and keeping the block around the src/parser/refs/ ...
src/parser/index.ts snippet intact.

In `@src/parser/text/warnings.test.ts`:
- Around line 72-105: Two tests repeat the clean/no-anomaly path; combine them
into one comprehensive test: replace the two it(...) blocks ('no anomalies:
well-formed UFGS structure → tree.warnings undefined, capabilities does NOT
include parse-warnings' and 'capabilities array unchanged on no-anomaly path
(still ["read-only"])') with a single test that calls parseText(...) once and
asserts both that result.tree.warnings is undefined and that result.capabilities
strictly equals ['read-only'] (and does not contain 'parse-warnings'); locate
usages by the test descriptions and the parseText call to update the spec
accordingly and remove the duplicate fixture/test block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0c5cc3a6-6ec2-4cc1-9449-0e9438f75ec6

📥 Commits

Reviewing files that changed from the base of the PR and between 33b2af1 and 7690f8e.

📒 Files selected for processing (9)
  • docs/superpowers/specs/2026-05-18-issue-027-design.md
  • docs/superpowers/specs/2026-05-18-issue-073-design.md
  • src/api/parse-warnings.integration.test.ts
  • src/api/parse.ts
  • src/ast/schemas.ts
  • src/ast/types.ts
  • src/parser/text/index.ts
  • src/parser/text/warnings.test.ts
  • tests/fixtures/text/anomaly-empty-part.txt
✅ Files skipped from review due to trivial changes (1)
  • tests/fixtures/text/anomaly-empty-part.txt
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/ast/types.ts
  • src/ast/schemas.ts
  • src/parser/text/index.ts
  • src/api/parse-warnings.integration.test.ts

Comment thread src/api/parse.ts Outdated
…nown()

Worker output is external input — per error-handling guidelines, all
external input is Zod-validated with concrete schemas. The earlier
z.array(z.unknown()) defeated validation and allowed arbitrary warning
structures into the job-result envelope.

Per CodeRabbit review on PR #75.
@thewrz thewrz merged commit 92444a8 into main May 18, 2026
5 checks passed
@thewrz thewrz deleted the feat/issue-68 branch May 18, 2026 18:28
thewrz added a commit that referenced this pull request May 18, 2026
Status table: add 1c-iii..1c-viii and 1c-sec-i/ii rows covering PRs
#69 #70 #71 #72 #74 #75 #76. Updates 'Active development' subtitle
to reflect Phase 1c being complete.

Parsing section: add plaintext signal hardening (#70), parse-anomaly
warnings (#75), and DOCX resilience suite (#72) bullets.

MCP section: note POST /mcp rate limiting (#69).

Not Yet Built: strike completed items (DOCX cross-ref extraction in
PR #76, parse worker concurrency cap in PR #71). Add new known gap:
REST persistTree ignores extracted refs (follow-up to #53).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(parser): parse anomaly surfacing — meta.warnings on CsiTree, parse-warnings capability

1 participant