Skip to content

fix(extract): return tables as paragraph-granular blocks (SD-2672)#2925

Merged
caio-pizzol merged 7 commits intomainfrom
caio/sd-2672-table-content-from-extract-is-returned-as-one-joined-string
Apr 23, 2026
Merged

fix(extract): return tables as paragraph-granular blocks (SD-2672)#2925
caio-pizzol merged 7 commits intomainfrom
caio/sd-2672-table-content-from-extract-is-returned-as-one-joined-string

Conversation

@caio-pizzol
Copy link
Copy Markdown
Contributor

doc.extract() was flattening tables into one joined string, so table content could not be chunked for RAG and its block IDs did not work with scrollToElement. This walks tables directly and emits one block per paragraph-like descendant of each origin cell, tagged with a tableContext so callers can group back to cell, row, or whole table.

  • Block SDTs are transparent in the extract walk, so tables wrapped in content controls no longer re-flatten through the wrapper's textContent.
  • gridBefore/gridAfter placeholder cells are skipped via the __placeholder attr, which avoids phantom empty blocks for any row that doesn't span the full grid.
  • Cell paths use physical row and cell child indexes so deterministic fallback nodeIds agree with buildBlockIndex, keeping scrollToElement stable for paragraphs that lack both paraId and sdBlockId inside horizontally merged tables.

One consumer-facing change worth naming: extraction is paragraph-granular inside tables rather than one object per cell. Group by tableOrdinal for a whole table, + rowIndex for a row, + columnIndex for a cell. Block SDTs no longer emit a wrapper block of their own; their children emit individually.

Verified: 13 behavior tests pass (7 existing SD-2525 + 6 new SD-2672), 5 new adapter unit tests cover placeholder skip, grid coords across merges, SDT transparency, and fallback-path consistency with buildBlockIndex. Full document-api-adapters suite (3105 tests) and document-api bun suite (1362 tests) still pass.

Closes SD-2672. Blocks IT-962 (customer intake).

doc.extract() was flattening tables into one joined string, which broke
RAG chunking and made table citations unreachable via scrollToElement.
Walk tables directly and emit one block per paragraph-like descendant
of each origin cell, tagged with tableContext so consumers can group
back to cell, row, or whole table.

- gridBefore/gridAfter placeholder cells are skipped via the
  __placeholder attr; they are layout artifacts with no user content.
- Block SDTs (structuredContentBlock) are transparent, so tables
  wrapped in content controls are not re-flattened through the
  wrapper's textContent.
- Cell paths use physical row-and-cell child indexes so deterministic
  fallback nodeIds agree with buildBlockIndex, keeping the
  scrollToElement round-trip stable for paragraphs that lack paraId
  and sdBlockId inside horizontally merged tables.

Tested: 13 behavior tests (7 existing SD-2525 + 6 new SD-2672),
5 new adapter unit tests, plus the full document-api-adapters suite
(3105 tests) and document-api bun suite (1362 tests).
@linear
Copy link
Copy Markdown

linear Bot commented Apr 23, 2026

@mintlify
Copy link
Copy Markdown

mintlify Bot commented Apr 23, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
SuperDoc 🟢 Ready View Preview Apr 23, 2026, 3:29 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28bfa07b98

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/super-editor/src/editors/v1/document-api-adapters/extract-adapter.ts Outdated
The new table walker only emitted blocks for recognized types and
silently dropped anything else, including their block children. That
regressed coverage versus the old textContent walk for `documentSection`,
`documentPartObject`, and `shapeContainer`, which all declare
block-level content but aren't in EMITTABLE_BLOCK_TYPES. Treat any
unrecognized block with block-level children as transparent and recurse
into it, so paragraphs nested inside these wrappers still surface with
their enclosing tableContext. Adds a unit test covering a
`documentSection` inside a table cell.
…SD-2672)

The adapter unit tests hit the algorithm via schema-constructed PM docs,
which skips the importer entirely. This adds a second layer of tests
that load real Word-authored .docx files, run them through the full
import pipeline, and assert extract output. Closes the gap the code
review flagged for a customer-facing legal RAG contract.

Fixtures authored via Word COM + local OOXML patching:
- sd-2672-plain-3x3.docx: baseline table, no merges or placeholders
- sd-2672-merged-table.docx: colspan=2 and rowspan=2 anchors
- sd-2672-rtl-table.docx: bidiVisual RTL table
- sd-2672-gridbefore-vmerge.docx: w:gridBefore + w:vMerge=restart/continue
- sd-2672-sdt-table.docx: table wrapped in a w:sdt block (content control)
- sd-2672-nested-table.docx: 2x2 table inside cell (1,1) of outer table
- sd-2672-multipara-cell.docx: cell (0,0) with two paragraphs

The build-sd-2672-fixtures.mjs script regenerates the patched variants
from the Word-authored base, using JSZip + regex/XmlDocument surgery.

Tests assert: per-cell content lands at correct logical grid coords,
merged anchors carry rowspan/colspan, RTL tables still report columns
0..N-1, gridBefore placeholders don't emit phantom blocks, SDT wrappers
are transparent, nested tables get a fresh tableOrdinal with parent
coordinates, multi-paragraph cells emit one block per paragraph with
shared tableContext, and scrollToElement round-trips a merged-cell
paragraph nodeId.
The script was added alongside the fixtures to regenerate the OOXML-patched
variants from a Word-authored base. It isn't carrying its weight: fixtures
are committed as static binaries, the regex-based XML patching is fragile
to Word COM output changes, and the commit history already documents how
each fixture was constructed. If we need a new edge-case fixture later,
hand-authoring it once is simpler than maintaining a generator.
…ntent-from-extract-is-returned-as-one-joined-string

# Conflicts:
#	apps/docs/document-api/reference/_generated-manifest.json
#	apps/docs/document-api/reference/extract.mdx
@caio-pizzol caio-pizzol force-pushed the caio/sd-2672-table-content-from-extract-is-returned-as-one-joined-string branch 2 times, most recently from d1f0ba0 to 6873edd Compare April 23, 2026 17:10
@caio-pizzol caio-pizzol enabled auto-merge (squash) April 23, 2026 17:28
@caio-pizzol caio-pizzol disabled auto-merge April 23, 2026 17:28
@caio-pizzol caio-pizzol merged commit d0a36c2 into main Apr 23, 2026
68 checks passed
@caio-pizzol caio-pizzol deleted the caio/sd-2672-table-content-from-extract-is-returned-as-one-joined-string branch April 23, 2026 17:28
@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Apr 23, 2026

🎉 This PR is included in vscode-ext v2.3.0-next.48

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Apr 23, 2026

🎉 This PR is included in esign v2.3.0-next.48

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Apr 23, 2026

🎉 This PR is included in @superdoc-dev/react v1.2.0-next.46

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Apr 23, 2026

🎉 This PR is included in template-builder v1.6.0-next.11

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Apr 23, 2026

🎉 This PR is included in superdoc v1.28.0-next.11

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Apr 23, 2026

🎉 This PR is included in superdoc-cli v0.8.0-next.11

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Apr 23, 2026

🎉 This PR is included in superdoc-sdk v1.6.0-next.47

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Apr 24, 2026

🎉 This PR is included in superdoc v1.29.0

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Apr 25, 2026

🎉 This PR is included in @superdoc-dev/mcp v0.3.0-next.1

The release is available on GitHub release

@superdoc-bot
Copy link
Copy Markdown
Contributor

superdoc-bot Bot commented Apr 25, 2026

🎉 This PR is included in superdoc-sdk v1.8.0-next.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants