Skip to content

Add utility functions to test for legal XML characters and names #139489

@serhiy-storchaka

Description

@serhiy-storchaka

Feature or enhancement

It is well known fact that some characters (such as < or &) must be escaped in XML and HTML, and that attribute values must be quoted. It is less known fact (unless you specially looked for it) that not all characters can be included in XML, even if escaped. For example, XML cannot contain the null character.

There is also restriction on names of elements and attributes. They cannot contain <, >, /, !, ?, spaces and many other characters. But unlike to Python identifiers, -, :, ., etc are acceptable. The list of valid and invalid characters is pretty long. In the case when the user input is used for element or attribute names without validation, this can even lead to XML injection vulnerability (CVE-2025-9375)

So, I think that it would be useful to provide standard functions to validate XML characters and names in the stdlib. xml.sax.saxutils looks an appropriate place, it already has escape() and quoteattr() utilities.

We can also provide functions to "sanitize" XML characters, similar to sanitize_xml() in Lib/test/libregrtest/utils.py, but more general. This is similar to old issue #63014.

Now, the problem is that there are two standards of XML: 1.0 and 1.1. The former is much more popular. And they have different definitions of legal characters. There are also restricted characters in XML 1.1 which cannot be used in "well-formed" documents and parsed entities. There is also a set of characters (version depending) using which legal but is discouraged. Should we have several functions or several parameters to specify the XML version and other options?

https://www.w3.org/TR/xml/#charsets
https://www.w3.org/TR/xml11/#charsets

Fortunately, the syntax for names is the same in XML 1.0 and 1.1.

https://www.w3.org/TR/xml/#NT-Name
https://www.w3.org/TR/xml11/#NT-Name

We can also add a similar set of functions for HTML.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytopic-XMLtype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions