Skip to content

bin_xml_escape: supplementary plane characters (U+10000+) incorrectly escaped due to wrong unicode escape in regex #14483

@EternalRights

Description

@EternalRights

Was looking at the JUnit XML output for a test suite that has emoji in test names and noticed the names were getting mangled. Traced it to bin_xml_escape in src/_pytest/junitxml.py line 59:

illegal_xml_re = (
    "[^\u0009\u000a\u000d\u0020-\u007e\u0080-\ud7ff\ue000-\ufffd\u10000-\u10ffff]"
)

The \u10000-\u10ffff part is wrong. Python's \u escape only takes 4 hex digits, so \u10000 gets parsed as \u1000 (Myanmar letter KA, U+1000) followed by the literal character 0, and \u10ffff becomes \u10ff followed by ff. The intent was to cover the supplementary plane range U+10000 to U+10FFFF, which per the XML spec are valid characters. But since \U (8-digit) wasn't used, the entire supplementary plane is missing from the "valid" set.

Result: every supplementary plane character (U+10000 and above) gets incorrectly escaped. For example:

from _pytest.junitxml import bin_xml_escape

bin_xml_escape("test_😀_passes")  # returns 'test_#x1F600_passes' instead of 'test_😀_passes'
bin_xml_escape("test_𠀀")         # returns 'test_#x20000' instead of 'test_𠀀'

That's 1,048,576 valid XML characters being treated as illegal -- all emoji, CJK Extension B-F, mathematical symbols, musical symbols, etc.

This is a regression from commit 1653c49. The old code built the regex dynamically with chr() and worked fine:

_legal_ranges = ((0x20, 0x7E), (0x80, 0xD7FF), (0xE000, 0xFFFD), (0x10000, 0x10FFFF))
_legal_xml_re = ["{}-{}".format(chr(low), chr(high)) for (low, high) in _legal_ranges]

The refactoring to a static string accidentally broke the supplementary plane range. The fix is changing \u10000-\u10ffff to \U00010000-\U0010ffff.

Also noticed the existing test in test_junitxml.py has the supplementary plane code points commented out:

valid = (0x9, 0xA, 0x20)
# 0xD, 0xD7FF, 0xE000, 0xFFFD, 0x10000, 0x10FFFF)

So the bug was never caught by tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions