fix(math): split multi-char m:r text per character to match Word (SD-2632)#2875
Conversation
…2632) Word's OMML2MML.XSL classifies each character in an m:r run individually — digits group into a single <mn> (with one optional decimal point between digits), operator characters each become their own <mo>, and everything else becomes its own <mi>. SuperDoc was emitting the entire run text as one element, so runs like "→∞" or "x+1" rendered as a single <mi>, losing operator spacing and semantics. tokenizeMathText implements the per-character classification. convertMathRun returns a single Element for one-atom runs and a DocumentFragment when multiple atoms are emitted, so siblings flow directly into the parent's <mrow> without an extra wrapper. m:fName is the documented exception — Word keeps multi-letter function names like "sin" or "lim" as one <mi> inside the function-name slot. convertFunction routes m:r children through convertMathRunWhole (no splitting), and a new collapseFunctionNameBases pass re-merges the base slot of any structural MathML element (munder/mover/msub/…) that Word nested inside m:fName — without this, "lim" inside m:limLow would incorrectly split to three <mi>. Also drops U+221E (∞) from OPERATOR_CHARS — it's a mathematical constant per Word's XSL, not an operator. MathObjectConverter's return type widens from Element|null to Node|null so convertMathRun can return a DocumentFragment. All other converters already return Element, which is assignable to Node — no other changes. Verified against real Word-native fixtures: `→∞` in the limit-tests fixture case 1 now renders as <mi>n</mi><mo>→</mo><mi>∞</mi> (matches Word OMML2MML.XSL byte-for-byte), and nested limits keep their function names intact. Ref ECMA-376 §22.1.2.116, Annex L.6.1.13, §22.1.2.58.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9ceec55257
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…pts (SD-2632) Follow-up to /review feedback backed by fresh Word OMML2MML.XSL evidence: - `convertMathRunAsFunctionName` (renamed from `convertMathRunWhole` since "whole" no longer fits) now groups consecutive non-digit / non-operator characters into one <mi> while still splitting digits and operators. Word's XSL for `<m:fName><m:r>log_2</m:r></m:fName>` produces `<mi>log</mi><mo>_</mo><mn>2</mn>` — not `<mi>l</mi><mi>o</mi><mi>g</mi>…`. - `BASE_BEARING_ELEMENTS` gains `mmultiscripts` — Word emits it when an `m:sPre` sits inside `m:fName`; our base-collapse pass needs to know to merge the first-child <mi> run. - CONTRIBUTING.md now documents the widened `Node | null` return type. Tests added: - Direct `tokenizeMathText` edge cases: `.5` / `5.` / `1.2.3` / `2x+1` / consecutive operators / empty / standalone ∞. - m:fName mixed-content: `log_2` stays `<mi>log</mi><mo>_</mo><mn>2</mn>`. - Base collapse inside nested `m:sSub` under `m:fName`. - Base collapse inside nested `m:sPre` (mmultiscripts) under `m:fName`. - Behavior test tightened to pin the full 3-atom sequence for `n→∞`. Disputed during review and deferred with evidence: - Opus claim that standalone `m:limLow` with "lim" base regresses to italic: Word XSL itself splits "lim" per-char in that shape (with or without m:sty=p), so our output matches Word. - Codex claim that Arabic-Indic digits should be `<mn>`: Word XSL also classifies them as `<mi>`, so our behavior matches. - Non-BMP surrogate-pair support: edge case in extreme mathematical alphanumerics; Word XSL itself errored on U+1D465. Separate ticket worth.
Addresses Codex bot review on PR #2875. Astral-plane mathematical alphanumerics (e.g. U+1D465 mathematical italic x, U+1D7D9 mathematical double-struck 1) are UTF-16 surrogate pairs. Walking text by code unit split them into two half-pair <mi> atoms with invalid content. `codePointUnitLength` returns 2 when the current position starts a surrogate pair so tokenizeMathText and tokenizeFunctionNameText step across the full code point.
…s-render-as-mi-instead-of
…s-render-as-mi-instead-of
|
🎉 This PR is included in vscode-ext v2.3.0-next.31 |
|
🎉 This PR is included in @superdoc-dev/react v1.2.0-next.28 The release is available on GitHub release |
|
🎉 This PR is included in superdoc-cli v0.7.0-next.32 The release is available on GitHub release |
|
🎉 This PR is included in superdoc v1.26.0-next.31 The release is available on GitHub release |
|
🎉 This PR is included in superdoc-sdk v1.6.0-next.30 |
Word's OMML→MathML converter classifies each character in an
m:rrun individually — consecutive digits group into one<mn>, each operator character becomes its own<mo>, everything else becomes its own<mi>. SuperDoc was emitting the full run text as a single element, so→∞came out as<mi>→∞</mi>and lost operator spacing and semantics.tokenizeMathTextdoes per-char classification with digit grouping (one optional decimal point).convertMathRunreturns aDocumentFragmentwhen a run splits, so siblings flow directly into the parent<mrow>with no extra wrapper.m:fNameis the documented exception — multi-letter function names likesin/limstay whole.convertFunctionroutesm:rchildren through a newconvertMathRunWholehelper, andcollapseFunctionNameBasesre-merges the base slot of any structural element Word nested insidem:fName(e.g.m:limLow→<munder>).MathObjectConverter's return widens fromElement|nulltoNode|nullso the fragment return works without casting. All 19 other converters already returnElement, which is assignable.U+221E(∞) fromOPERATOR_CHARS— it's a constant per Word, not an operator.Verified against Word's own
OMML2MML.XSLacross 15+ input shapes — byte-identical output for bare math, and matching whole-name behavior insidem:fName.Rejected: narrowing the split to "only when the run contains an operator" (Option B in investigation). It would have fixed the reported
→∞case without cascading, but left SuperDoc rendering<mi>sin</mi>vs Word<mi>s</mi><mi>i</mi><mi>n</mi>for bare function names — and downstream rendering issues (operator spacing inside the run) would keep surfacing. Chose full parity.Review: the
collapseFunctionNameBasespass is the subtlest part — it walks the post-split tree and re-merges<mi>siblings in structural bases. Confirm theBASE_BEARING_ELEMENTSset covers every shape Word nests underm:fNamethat we've actually seen. Also worth checking that a same-variant requirement is correct (currently only merges when all siblings share the same mathvariant).Verified: 172/172 unit tests, 79/79 behavior tests, type-check clean. Pre-existing 3
m:groupChrtests updated — theirmunder.querySelector('mo')descendant-search now picks up split operators inside the base expression, so they were re-scoped to:scope > mo(the group character, which is what they were actually testing).