fix(web): strip CJK punctuation from auto-linked URLs#479
Conversation
remark-gfm auto-links bare URLs but only handles ASCII trailing punctuation. When a URL is followed by CJK punctuation like ,or 。 without whitespace, the punctuation gets included in the link. Add a remark plugin that walks the MDAST after GFM and moves any trailing CJK/fullwidth punctuation out of the link node into a sibling text node. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Deploying hapi with
|
| Latest commit: |
b92ecee
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://893acafb.hapi-bqd.pages.dev |
| Branch Preview URL: | https://fix-cjk-autolink-punctuation.hapi-bqd.pages.dev |
via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
There was a problem hiding this comment.
Findings
- [Major] Transformer rewrites non-autolink markdown links and can truncate legitimate CJK URL suffixes. It currently runs on every
linknode and the strip set includes paired suffixes like)】」》, so an explicitly-authored link ending with one of those characters will be silently mutated. Evidenceweb/src/lib/remark-strip-cjk-autolink.ts:14,web/src/lib/remark-strip-cjk-autolink.ts:29.
Summary
- Review mode: initial
- 1 major regression risk found in the new markdown autolink transformer.
Testing
- Not run (automation). Suggested additions: explicit markdown-link regression coverage and a case where the URL legitimately ends with a fullwidth closing bracket/parenthesis.
HAPI Bot
| for (let i = 0; i < node.children.length; i++) { | ||
| const child = node.children[i] | ||
|
|
||
| if (child.type === 'link' && typeof child.url === 'string') { |
There was a problem hiding this comment.
[MAJOR] This rewrites every link node, not just GFM autolinks, and the strip set includes paired suffixes like )】」》. That means an explicitly-authored markdown link ending with one of those characters will be silently truncated and get an extra text node appended. Please make the transform opt-in to autolink-shaped nodes before mutating, and avoid stripping paired brackets unless you can prove they are delimiters.
Suggested fix:
const TRAILING_CJK_PUNCT = /[,。、;:!?\u3000\uFF0C\uFF0E\uFF1B\uFF1A\uFF01\uFF1F]+$/
const textChild = child.children?.length === 1 ? child.children[0] : undefined
if (
child.type !== "link" ||
typeof child.url !== "string" ||
textChild?.type !== "text" ||
textChild.value !== child.url
) {
visitLinks(child)
continue
}…inks Only process links where the text content matches the URL (auto-links). Explicit markdown links like [text](url) are left untouched, preventing unintended mutation of deliberately authored URLs. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
There was a problem hiding this comment.
Findings
- [Major] The new strip set still removes paired fullwidth delimiters from any autolink-shaped URL, so valid URLs that legitimately end with
),】,」, or similar characters will be silently truncated. Evidenceweb/src/lib/remark-strip-cjk-autolink.ts:14,web/src/lib/remark-strip-cjk-autolink.ts:39.
Summary
- Review mode: follow-up after new commits
- 1 major regression risk remains in the autolink punctuation stripper.
Testing
- Not run (automation; PR code not executed). Suggested addition: cover an autolink whose real path ends with a fullwidth closing delimiter.
HAPI Bot
| */ | ||
|
|
||
| // Common CJK / fullwidth punctuation that should never be part of a URL. | ||
| const TRAILING_CJK_PUNCT = /[,。、;:!?()【】「」『』《》〈〉\u3000\uFF0C\uFF0E\uFF1B\uFF1A\uFF01\uFF1F\uFF08\uFF09]+$/ |
There was a problem hiding this comment.
[MAJOR] TRAILING_CJK_PUNCT still includes paired fullwidth delimiters such as )】」』》〉. For bare autolinks, those characters can be legitimate path suffixes, so this transform will silently truncate valid URLs and move the last character into plain text. Please narrow the strip set to punctuation that is unambiguously trailing prose punctuation, and add a regression test for an autolink that legitimately ends with one of the removed delimiters.
Suggested fix:
const TRAILING_CJK_PUNCT = /[,。、;:!?\u3000\uFF0C\uFF0E\uFF1B\uFF1A\uFF01\uFF1F]+$/
it('does not strip fullwidth delimiters that are part of the URL', () => {
const tree = makeAutolink('https://example.com/路径)')
transform(tree)
expect(tree.children[0].children[1].url).toBe('https://example.com/路径)')
})7 characters were listed twice (once as literals, once as \uXXXX escapes). Keep only the literals and the 2 unique escapes (\u3000 ideographic space, \uFF0E fullwidth full stop). via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
There was a problem hiding this comment.
Findings
- [Major] Valid autolinks ending with fullwidth closing delimiters are still truncated.
TRAILING_CJK_PUNCTstill includes characters like),】, and」, so a real URL such ashttps://example.com/路径)will be rewritten even when that last character is part of the path. Evidenceweb/src/lib/remark-strip-cjk-autolink.ts:16,web/src/lib/remark-strip-cjk-autolink.ts:41, missing regression coverage inweb/src/lib/remark-strip-cjk-autolink.test.ts:86.
Suggested fix:const TRAILING_CJK_PUNCT = /[,。、;:!?\u3000\uFF0E]+$/ it('does not strip fullwidth closing delimiters that are part of the URL', () => { const tree = makeAutolink('https://example.com/路径)') transform(tree) expect(tree.children[0].children[1].url).toBe('https://example.com/路径)') expect(tree.children[0].children.length).toBe(2) })
Summary
- Review mode: follow-up after new commits
- 1 major issue remains on the current head.
Testing
- Not run (automation). Suggested addition: autolinks that legitimately end with
),】, or」.
HAPI Bot
| // Common CJK / fullwidth punctuation that should never be part of a URL. | ||
| // Includes: fullwidth comma/period/semicolon/colon/exclamation/question/parens, | ||
| // ideographic comma/period, CJK brackets, ideographic space, fullwidth full stop. | ||
| const TRAILING_CJK_PUNCT = /[,。、;:!?()【】「」『』《》〈〉\u3000\uFF0E]+$/ |
There was a problem hiding this comment.
[MAJOR] Valid autolinks ending with fullwidth closing delimiters are still truncated here. TRAILING_CJK_PUNCT still includes characters like ), 】, and 」, so a real URL such as https://example.com/路径) will be rewritten even when that last character is part of the path.
Suggested fix:
const TRAILING_CJK_PUNCT = /[,。、;:!?\u3000\uFF0E]+$/
it('does not strip fullwidth closing delimiters that are part of the URL', () => {
const tree = makeAutolink('https://example.com/路径)')
transform(tree)
expect(tree.children[0].children[1].url).toBe('https://example.com/路径)')
expect(tree.children[0].children.length).toBe(2)
})* fix(web): strip CJK punctuation from auto-linked URLs (tiann#478) remark-gfm auto-links bare URLs but only handles ASCII trailing punctuation. When a URL is followed by CJK punctuation like ,or 。 without whitespace, the punctuation gets included in the link. Add a remark plugin that walks the MDAST after GFM and moves any trailing CJK/fullwidth punctuation out of the link node into a sibling text node. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run> * fix: add non-null assertion for link.children in test via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run> * fix: only strip CJK punctuation from auto-linked URLs, not explicit links Only process links where the text content matches the URL (auto-links). Explicit markdown links like [text](url) are left untouched, preventing unintended mutation of deliberately authored URLs. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run> * fix: add non-null assertion for textChild.value via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run> * fix: remove duplicate unicode escapes from CJK punctuation regex 7 characters were listed twice (once as literals, once as \uXXXX escapes). Keep only the literals and the 2 unique escapes (\u3000 ideographic space, \uFF0E fullwidth full stop). via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run> --------- Co-authored-by: HAPI <noreply@hapi.run>
Summary
remark-strip-cjk-autolinkplugin that walks the MDAST afterremark-gfmand moves trailing CJK/fullwidth punctuation (,、。、!etc.) out of auto-linked URL nodes into sibling text nodesMARKDOWN_PLUGINSright afterremarkGfmCloses #478
Example
Before:
https://github.com/issues/477,涵盖了→ entire477,is part of the linkAfter:
https://github.com/issues/477is the link,,涵盖了is plain textTest plan