Skip to content

fix(web): strip CJK punctuation from auto-linked URLs#479

Merged
hqhq1025 merged 5 commits intomainfrom
fix/cjk-autolink-punctuation
Apr 17, 2026
Merged

fix(web): strip CJK punctuation from auto-linked URLs#479
hqhq1025 merged 5 commits intomainfrom
fix/cjk-autolink-punctuation

Conversation

@hqhq1025
Copy link
Copy Markdown
Collaborator

Summary

  • Add remark-strip-cjk-autolink plugin that walks the MDAST after remark-gfm and moves trailing CJK/fullwidth punctuation ( etc.) out of auto-linked URL nodes into sibling text nodes
  • Register the plugin in MARKDOWN_PLUGINS right after remarkGfm
  • No external dependencies — uses a simple recursive tree walker with typed MDAST interfaces

Closes #478

Example

Before: https://github.com/issues/477,涵盖了 → entire 477, is part of the link

After: https://github.com/issues/477 is the link, ,涵盖了 is plain text

Test plan

  • Unit tests cover: trailing fullwidth comma, ideographic full stop, multiple trailing CJK chars, no false positives on clean URLs, no stripping of CJK path segments
  • Verify in the web UI that URLs followed by Chinese punctuation render with correct link boundaries

remark-gfm auto-links bare URLs but only handles ASCII trailing
punctuation. When a URL is followed by CJK punctuation like ,or 。
without whitespace, the punctuation gets included in the link. Add a
remark plugin that walks the MDAST after GFM and moves any trailing
CJK/fullwidth punctuation out of the link node into a sibling text node.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 16, 2026

Deploying hapi with  Cloudflare Pages  Cloudflare Pages

Latest commit: b92ecee
Status: ✅  Deploy successful!
Preview URL: https://893acafb.hapi-bqd.pages.dev
Branch Preview URL: https://fix-cjk-autolink-punctuation.hapi-bqd.pages.dev

View logs

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Transformer rewrites non-autolink markdown links and can truncate legitimate CJK URL suffixes. It currently runs on every link node and the strip set includes paired suffixes like )】」》, so an explicitly-authored link ending with one of those characters will be silently mutated. Evidence web/src/lib/remark-strip-cjk-autolink.ts:14, web/src/lib/remark-strip-cjk-autolink.ts:29.

Summary

  • Review mode: initial
  • 1 major regression risk found in the new markdown autolink transformer.

Testing

  • Not run (automation). Suggested additions: explicit markdown-link regression coverage and a case where the URL legitimately ends with a fullwidth closing bracket/parenthesis.

HAPI Bot

for (let i = 0; i < node.children.length; i++) {
const child = node.children[i]

if (child.type === 'link' && typeof child.url === 'string') {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This rewrites every link node, not just GFM autolinks, and the strip set includes paired suffixes like )】」》. That means an explicitly-authored markdown link ending with one of those characters will be silently truncated and get an extra text node appended. Please make the transform opt-in to autolink-shaped nodes before mutating, and avoid stripping paired brackets unless you can prove they are delimiters.

Suggested fix:

const TRAILING_CJK_PUNCT = /[\u3000\uFF0C\uFF0E\uFF1B\uFF1A\uFF01\uFF1F]+$/

const textChild = child.children?.length === 1 ? child.children[0] : undefined
if (
    child.type !== "link" ||
    typeof child.url !== "string" ||
    textChild?.type !== "text" ||
    textChild.value !== child.url
) {
    visitLinks(child)
    continue
}

hqhq1025 and others added 2 commits April 17, 2026 01:30
…inks

Only process links where the text content matches the URL (auto-links).
Explicit markdown links like [text](url) are left untouched, preventing
unintended mutation of deliberately authored URLs.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] The new strip set still removes paired fullwidth delimiters from any autolink-shaped URL, so valid URLs that legitimately end with , , , or similar characters will be silently truncated. Evidence web/src/lib/remark-strip-cjk-autolink.ts:14, web/src/lib/remark-strip-cjk-autolink.ts:39.

Summary

  • Review mode: follow-up after new commits
  • 1 major regression risk remains in the autolink punctuation stripper.

Testing

  • Not run (automation; PR code not executed). Suggested addition: cover an autolink whose real path ends with a fullwidth closing delimiter.

HAPI Bot

*/

// Common CJK / fullwidth punctuation that should never be part of a URL.
const TRAILING_CJK_PUNCT = /[,。、;:!?()【】「」『』《》〈〉\u3000\uFF0C\uFF0E\uFF1B\uFF1A\uFF01\uFF1F\uFF08\uFF09]+$/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] TRAILING_CJK_PUNCT still includes paired fullwidth delimiters such as )】」』》〉. For bare autolinks, those characters can be legitimate path suffixes, so this transform will silently truncate valid URLs and move the last character into plain text. Please narrow the strip set to punctuation that is unambiguously trailing prose punctuation, and add a regression test for an autolink that legitimately ends with one of the removed delimiters.

Suggested fix:

const TRAILING_CJK_PUNCT = /[\u3000\uFF0C\uFF0E\uFF1B\uFF1A\uFF01\uFF1F]+$/

it('does not strip fullwidth delimiters that are part of the URL', () => {
    const tree = makeAutolink('https://example.com/路径)')
    transform(tree)

    expect(tree.children[0].children[1].url).toBe('https://example.com/路径)')
})

7 characters were listed twice (once as literals, once as \uXXXX
escapes). Keep only the literals and the 2 unique escapes (\u3000
ideographic space, \uFF0E fullwidth full stop).

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Valid autolinks ending with fullwidth closing delimiters are still truncated. TRAILING_CJK_PUNCT still includes characters like , , and , so a real URL such as https://example.com/路径) will be rewritten even when that last character is part of the path. Evidence web/src/lib/remark-strip-cjk-autolink.ts:16, web/src/lib/remark-strip-cjk-autolink.ts:41, missing regression coverage in web/src/lib/remark-strip-cjk-autolink.test.ts:86.
    Suggested fix:
    const TRAILING_CJK_PUNCT = /[\u3000\uFF0E]+$/
    
    it('does not strip fullwidth closing delimiters that are part of the URL', () => {
        const tree = makeAutolink('https://example.com/路径)')
        transform(tree)
    
        expect(tree.children[0].children[1].url).toBe('https://example.com/路径)')
        expect(tree.children[0].children.length).toBe(2)
    })

Summary

  • Review mode: follow-up after new commits
  • 1 major issue remains on the current head.

Testing

  • Not run (automation). Suggested addition: autolinks that legitimately end with , , or .

HAPI Bot

// Common CJK / fullwidth punctuation that should never be part of a URL.
// Includes: fullwidth comma/period/semicolon/colon/exclamation/question/parens,
// ideographic comma/period, CJK brackets, ideographic space, fullwidth full stop.
const TRAILING_CJK_PUNCT = /[,。、;:!?()【】「」『』《》〈〉\u3000\uFF0E]+$/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] Valid autolinks ending with fullwidth closing delimiters are still truncated here. TRAILING_CJK_PUNCT still includes characters like , , and , so a real URL such as https://example.com/路径) will be rewritten even when that last character is part of the path.

Suggested fix:

const TRAILING_CJK_PUNCT = /[\u3000\uFF0E]+$/

it('does not strip fullwidth closing delimiters that are part of the URL', () => {
    const tree = makeAutolink('https://example.com/路径)')
    transform(tree)

    expect(tree.children[0].children[1].url).toBe('https://example.com/路径)')
    expect(tree.children[0].children.length).toBe(2)
})

@hqhq1025 hqhq1025 merged commit 5d1e616 into main Apr 17, 2026
4 of 5 checks passed
@hqhq1025 hqhq1025 deleted the fix/cjk-autolink-punctuation branch April 17, 2026 03:02
Hwwwww-dev pushed a commit to Hwwwww-dev/hapi that referenced this pull request Apr 17, 2026
* fix(web): strip CJK punctuation from auto-linked URLs (tiann#478)

remark-gfm auto-links bare URLs but only handles ASCII trailing
punctuation. When a URL is followed by CJK punctuation like ,or 。
without whitespace, the punctuation gets included in the link. Add a
remark plugin that walks the MDAST after GFM and moves any trailing
CJK/fullwidth punctuation out of the link node into a sibling text node.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>

* fix: add non-null assertion for link.children in test

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>

* fix: only strip CJK punctuation from auto-linked URLs, not explicit links

Only process links where the text content matches the URL (auto-links).
Explicit markdown links like [text](url) are left untouched, preventing
unintended mutation of deliberately authored URLs.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>

* fix: add non-null assertion for textChild.value

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>

* fix: remove duplicate unicode escapes from CJK punctuation regex

7 characters were listed twice (once as literals, once as \uXXXX
escapes). Keep only the literals and the 2 unique escapes (\u3000
ideographic space, \uFF0E fullwidth full stop).

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>

---------

Co-authored-by: HAPI <noreply@hapi.run>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(web): markdown auto-link includes trailing CJK punctuation in URL

1 participant