fix(extract): return tables as paragraph-granular blocks (SD-2672)#2925
Conversation
doc.extract() was flattening tables into one joined string, which broke RAG chunking and made table citations unreachable via scrollToElement. Walk tables directly and emit one block per paragraph-like descendant of each origin cell, tagged with tableContext so consumers can group back to cell, row, or whole table. - gridBefore/gridAfter placeholder cells are skipped via the __placeholder attr; they are layout artifacts with no user content. - Block SDTs (structuredContentBlock) are transparent, so tables wrapped in content controls are not re-flattened through the wrapper's textContent. - Cell paths use physical row-and-cell child indexes so deterministic fallback nodeIds agree with buildBlockIndex, keeping the scrollToElement round-trip stable for paragraphs that lack paraId and sdBlockId inside horizontally merged tables. Tested: 13 behavior tests (7 existing SD-2525 + 6 new SD-2672), 5 new adapter unit tests, plus the full document-api-adapters suite (3105 tests) and document-api bun suite (1362 tests).
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 28bfa07b98
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The new table walker only emitted blocks for recognized types and silently dropped anything else, including their block children. That regressed coverage versus the old textContent walk for `documentSection`, `documentPartObject`, and `shapeContainer`, which all declare block-level content but aren't in EMITTABLE_BLOCK_TYPES. Treat any unrecognized block with block-level children as transparent and recurse into it, so paragraphs nested inside these wrappers still surface with their enclosing tableContext. Adds a unit test covering a `documentSection` inside a table cell.
…SD-2672) The adapter unit tests hit the algorithm via schema-constructed PM docs, which skips the importer entirely. This adds a second layer of tests that load real Word-authored .docx files, run them through the full import pipeline, and assert extract output. Closes the gap the code review flagged for a customer-facing legal RAG contract. Fixtures authored via Word COM + local OOXML patching: - sd-2672-plain-3x3.docx: baseline table, no merges or placeholders - sd-2672-merged-table.docx: colspan=2 and rowspan=2 anchors - sd-2672-rtl-table.docx: bidiVisual RTL table - sd-2672-gridbefore-vmerge.docx: w:gridBefore + w:vMerge=restart/continue - sd-2672-sdt-table.docx: table wrapped in a w:sdt block (content control) - sd-2672-nested-table.docx: 2x2 table inside cell (1,1) of outer table - sd-2672-multipara-cell.docx: cell (0,0) with two paragraphs The build-sd-2672-fixtures.mjs script regenerates the patched variants from the Word-authored base, using JSZip + regex/XmlDocument surgery. Tests assert: per-cell content lands at correct logical grid coords, merged anchors carry rowspan/colspan, RTL tables still report columns 0..N-1, gridBefore placeholders don't emit phantom blocks, SDT wrappers are transparent, nested tables get a fresh tableOrdinal with parent coordinates, multi-paragraph cells emit one block per paragraph with shared tableContext, and scrollToElement round-trips a merged-cell paragraph nodeId.
The script was added alongside the fixtures to regenerate the OOXML-patched variants from a Word-authored base. It isn't carrying its weight: fixtures are committed as static binaries, the regex-based XML patching is fragile to Word COM output changes, and the commit history already documents how each fixture was constructed. If we need a new edge-case fixture later, hand-authoring it once is simpler than maintaining a generator.
…ntent-from-extract-is-returned-as-one-joined-string # Conflicts: # apps/docs/document-api/reference/_generated-manifest.json # apps/docs/document-api/reference/extract.mdx
d1f0ba0 to
6873edd
Compare
…eturned-as-one-joined-string
|
🎉 This PR is included in vscode-ext v2.3.0-next.48 |
|
🎉 This PR is included in esign v2.3.0-next.48 The release is available on GitHub release |
|
🎉 This PR is included in @superdoc-dev/react v1.2.0-next.46 The release is available on GitHub release |
|
🎉 This PR is included in template-builder v1.6.0-next.11 The release is available on GitHub release |
|
🎉 This PR is included in superdoc v1.28.0-next.11 The release is available on GitHub release |
|
🎉 This PR is included in superdoc-cli v0.8.0-next.11 The release is available on GitHub release |
|
🎉 This PR is included in superdoc-sdk v1.6.0-next.47 |
|
🎉 This PR is included in superdoc v1.29.0 The release is available on GitHub release |
|
🎉 This PR is included in @superdoc-dev/mcp v0.3.0-next.1 The release is available on GitHub release |
|
🎉 This PR is included in superdoc-sdk v1.8.0-next.1 |
doc.extract() was flattening tables into one joined string, so table content could not be chunked for RAG and its block IDs did not work with scrollToElement. This walks tables directly and emits one block per paragraph-like descendant of each origin cell, tagged with a tableContext so callers can group back to cell, row, or whole table.
One consumer-facing change worth naming: extraction is paragraph-granular inside tables rather than one object per cell. Group by tableOrdinal for a whole table, + rowIndex for a row, + columnIndex for a cell. Block SDTs no longer emit a wrapper block of their own; their children emit individually.
Verified: 13 behavior tests pass (7 existing SD-2525 + 6 new SD-2672), 5 new adapter unit tests cover placeholder skip, grid coords across merges, SDT transparency, and fallback-path consistency with buildBlockIndex. Full document-api-adapters suite (3105 tests) and document-api bun suite (1362 tests) still pass.
Closes SD-2672. Blocks IT-962 (customer intake).