Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/escape XML characters in sitemap url #77422

Open
wants to merge 2 commits into
base: canary
Choose a base branch
from

Conversation

Austin1serb
Copy link

What?

Fixes an issue where sitemap.[ts/js] generates invalid XML when URLs contain special characters like &, <, >, ", and '. While the source file is JavaScript or TypeScript, the final output is XML — which requires strict character escaping.

See: #77340

Why?

Unescaped characters break XML parsing, causing search engines to reject the sitemap or ignore URLs. While some users manually escape these (&&amp;), this leads to brittle workarounds and potential double-escaping.

How?

  • Added a new utility: escapeXmlValue()
    • Escapes special XML characters
    • Prevents double-escaping by detecting existing valid entities (e.g., &amp;)
  • Applied this utility in resolveSitemap() for <loc> values
  • Added tests to ensure:
    • Proper escaping of raw characters
    • No double-escaping of already-escaped entities

This fix maintains backward compatibility for users who have already manually escaped their URLs.

Fixes #77340

@ijjk
Copy link
Member

ijjk commented Mar 22, 2025

Allow CI Workflow Run

  • approve CI run for commit: 8030db3

Note: this should only be enabled once the PR is ready to go and can only be enabled by a maintainer

* Others like &copy; or &nbsp; are escaped to prevent invalid XML.

* - Prevents double-escaping of known entities like &amp;
*/
Copy link
Author

@Austin1serb Austin1serb Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Added the Function to this utils file because it was already being used/imported in resolve-route-data.ts.

Copy link
Author

@Austin1serb Austin1serb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Note on unrecognized HTML entities
This fix only escapes standard XML-safe characters (&, <, >, ", ') and prevents double-escaping of valid entities like &.

It does not warn if the input contains unsupported HTML entities like ©,  , etc. — those will be escaped as &copy;, which renders literally in XML.

If needed, a warning or validation layer could be added in the future to detect and notify about non-XML entities. For now, this is intentionally left silent to avoid unnecessary noise or breaking existing behavior.

@@ -99,6 +99,43 @@ describe('resolveRouteData', () => {
"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the tests in the current generate sitemap test block

@ijjk ijjk requested a review from huozhi March 23, 2025 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generated sitemap is not escaped and is invalid if a URL has the character &
2 participants