-
-
Notifications
You must be signed in to change notification settings - Fork 33.1k
Description
Feature or enhancement
It is well known fact that some characters (such as <
or &
) must be escaped in XML and HTML, and that attribute values must be quoted. It is less known fact (unless you specially looked for it) that not all characters can be included in XML, even if escaped. For example, XML cannot contain the null character.
There is also restriction on names of elements and attributes. They cannot contain <
, >
, /
, !
, ?
, spaces and many other characters. But unlike to Python identifiers, -
, :
, .
, etc are acceptable. The list of valid and invalid characters is pretty long. In the case when the user input is used for element or attribute names without validation, this can even lead to XML injection vulnerability (CVE-2025-9375)
So, I think that it would be useful to provide standard functions to validate XML characters and names in the stdlib. xml.sax.saxutils
looks an appropriate place, it already has escape()
and quoteattr()
utilities.
We can also provide functions to "sanitize" XML characters, similar to sanitize_xml()
in Lib/test/libregrtest/utils.py
, but more general. This is similar to old issue #63014.
Now, the problem is that there are two standards of XML: 1.0 and 1.1. The former is much more popular. And they have different definitions of legal characters. There are also restricted characters in XML 1.1 which cannot be used in "well-formed" documents and parsed entities. There is also a set of characters (version depending) using which legal but is discouraged. Should we have several functions or several parameters to specify the XML version and other options?
https://www.w3.org/TR/xml/#charsets
https://www.w3.org/TR/xml11/#charsets
Fortunately, the syntax for names is the same in XML 1.0 and 1.1.
https://www.w3.org/TR/xml/#NT-Name
https://www.w3.org/TR/xml11/#NT-Name
We can also add a similar set of functions for HTML.