Skip to content

feat(mcp): add ooxml_package_part for OPC part metadata#7

Merged
caio-pizzol merged 3 commits into
mainfrom
caio/ooxml-package-part
May 12, 2026
Merged

feat(mcp): add ooxml_package_part for OPC part metadata#7
caio-pizzol merged 3 commits into
mainfrom
caio/ooxml-package-part

Conversation

@caio-pizzol
Copy link
Copy Markdown
Contributor

The XSD schema graph and the prose corpus don't answer "what kind of OPC part is /customXml/item1.xml?" That's package metadata: content type, source relationship type, root namespace, typical path. Agents working with .docx / .xlsx / .pptx packages need it constantly and currently have to reconstruct it from prose search.

Adds ooxml_package_part backed by a curated static dataset of 25 OPC part types in apps/mcp-server/src/opc-parts.ts. Covers Word (document, styles, settings, numbering, comments, footnotes, endnotes, header, footer), Excel (workbook, worksheet, shared strings), PowerPoint (presentation, slide, slide layout, slide master), and cross-cutting (core / extended / custom properties, theme, image, custom XML data storage and its properties part).

Four lookup modes: exact content_type, exact relationship_type, query substring, or no args → list-all. Where the spec prose and XSD target namespace disagree (custom XML data storage properties part is named .../customXmlDataProps in §15.2.6 but the XSD targets .../customXml), rootNamespace pins the XSD URI so the value composes cleanly with ooxml_element.

Static typed data, no DB. The set is small, static across ECMA editions, and curated; the PR diff is the audit primitive. Adding a new entry is appending to OPC_PARTS — the lookup index rebuilds lazily on first access.

Hyperlinks are intentionally out of scope: they're a relationship type, not a package part. If needed later they'd warrant a different model.

Review: confirm the curated set covers your common cases; flag any wrong content type / relationship URI / namespace pins (these were transcribed from Part 1 §11.3.x / §12.3.x / §13.3.x / §15.x). Ignore the rest of ooxml-tools.ts — additive only.

Verified: 71 pass / 3 skip / 0 fail. Format / lint / typecheck / build all clean. (The 3 skips are the xsd-cache-gated smoke tests in tests/ingest-xsd/, unrelated to this PR.)

The XSD schema graph answers "what's legal inside this XML body?"
The prose corpus answers "what does this spec section say?" Neither
answers "what kind of OPC part is /customXml/item1.xml?" That's a
package-level concern: content type, source relationship type, root
namespace, typical path. Agents working with .docx / .xlsx / .pptx
packages reach for this constantly and have nowhere structural to land.

Adds `ooxml_package_part` backed by a curated static dataset of 25 OPC
part types from ECMA-376 Part 1 §11.3.x (WML), §12.3.x (SML), §13.3.x
(PML), §14.2.7.10 (theme), and §15.x (cross-cutting). Word covers
document, styles, settings, numbering, comments, footnotes, endnotes,
header, footer; Excel covers workbook, worksheet, shared strings;
PowerPoint covers presentation, slide, slide layout, slide master;
cross-cutting covers core / extended / custom properties, theme, image,
custom XML data storage, custom XML data storage properties.

Four lookup modes: exact content_type, exact relationship_type, query
substring, or no args → list-all. Where the spec prose and the XSD
target namespace disagree (the custom XML data storage properties part
is named .../customXmlDataProps in §15.2.6 but the shipped XSD targets
.../customXml), rootNamespace pins the XSD URI so the value composes
cleanly with ooxml_element.

Static typed data in apps/mcp-server/src/opc-parts.ts, no DB. The set
is small, static across ECMA editions, and curated; the PR diff is the
audit primitive. Add a new entry by appending to OPC_PARTS; the lookup
index rebuilds lazily.

Tests cover dataset consistency (unique keys, non-empty required
fields, every family represented), exact and substring lookups, and
the four tool dispatch modes. No DB needed for any of them.
Three issues from PR review:

- relationship_type lookup collapsed shared rels. The .../relationships/
  officeDocument URI points at the main part for WML, SML, and PML, but
  the Map<string, OpcPart> index let later entries overwrite earlier
  ones, so a lookup returned only the Presentation part. Index is now
  Map<string, OpcPart[]>; the dispatcher renders multi-match as a list
  with a note that the relationship is shared across families and the
  caller has to disambiguate by the source part.
- Image content type was a wildcard display string. Real
  [Content_Types].xml entries record a specific media type per image
  (image/png, image/jpeg, ...) so an exact lookup against the display
  string never matched. contentType is now `string | string[]`; the
  Image Part enumerates the spec-§15.2.13 set (png, jpeg, gif, tiff,
  x-emf, x-wmf, bmp). Each entry is indexed; the formatter renders
  multi-content-type records under a plural label with a "+N more"
  indicator in the list view.
- initialize handler and apps/mcp-server/README.md still advertised
  two tool families and omitted ooxml_package_part, hurting agent
  discoverability. Both updated to list three tool families and
  describe the package-metadata corpus.

New tests cover (a) every enumerated image media type resolving exactly,
(b) the shared officeDocument relationship returning all three main
parts, and (c) the tool's multi-match rendering for shared rels.
Existing tests updated for the new helper name / array contract.
The previous fix updated the server README and initialize text but
missed the web-facing surfaces, which still advertised two tool
families. Bringing every surface in sync:

- apps/web/src/pages/Mcp.tsx: hero copy updated, added a Package
  metadata section, refreshed the trailing "what is MCP" paragraph.
- apps/web/public/llms.txt: feeds llms.txt/llms-full.txt that AI
  crawlers and the build-time SEO pipeline consume.
- apps/mcp-server/src/index.ts: header comment in the worker entry.
- README.md + CLAUDE.md: project-level docs.
- brand.md: brand-voice copy that lists the MCP as an AI-native
  differentiator.

No behavior change; everything in this commit is documentation /
agent-discoverability surface.
@caio-pizzol caio-pizzol merged commit 47b61c9 into main May 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants