Was looking at the JUnit XML output for a test suite that has emoji in test names and noticed the names were getting mangled. Traced it to bin_xml_escape in src/_pytest/junitxml.py line 59:
illegal_xml_re = (
"[^\u0009\u000a\u000d\u0020-\u007e\u0080-\ud7ff\ue000-\ufffd\u10000-\u10ffff]"
)
The \u10000-\u10ffff part is wrong. Python's \u escape only takes 4 hex digits, so \u10000 gets parsed as \u1000 (Myanmar letter KA, U+1000) followed by the literal character 0, and \u10ffff becomes \u10ff followed by ff. The intent was to cover the supplementary plane range U+10000 to U+10FFFF, which per the XML spec are valid characters. But since \U (8-digit) wasn't used, the entire supplementary plane is missing from the "valid" set.
Result: every supplementary plane character (U+10000 and above) gets incorrectly escaped. For example:
from _pytest.junitxml import bin_xml_escape
bin_xml_escape("test_😀_passes") # returns 'test_#x1F600_passes' instead of 'test_😀_passes'
bin_xml_escape("test_𠀀") # returns 'test_#x20000' instead of 'test_𠀀'
That's 1,048,576 valid XML characters being treated as illegal -- all emoji, CJK Extension B-F, mathematical symbols, musical symbols, etc.
This is a regression from commit 1653c49. The old code built the regex dynamically with chr() and worked fine:
_legal_ranges = ((0x20, 0x7E), (0x80, 0xD7FF), (0xE000, 0xFFFD), (0x10000, 0x10FFFF))
_legal_xml_re = ["{}-{}".format(chr(low), chr(high)) for (low, high) in _legal_ranges]
The refactoring to a static string accidentally broke the supplementary plane range. The fix is changing \u10000-\u10ffff to \U00010000-\U0010ffff.
Also noticed the existing test in test_junitxml.py has the supplementary plane code points commented out:
valid = (0x9, 0xA, 0x20)
# 0xD, 0xD7FF, 0xE000, 0xFFFD, 0x10000, 0x10FFFF)
So the bug was never caught by tests.
Was looking at the JUnit XML output for a test suite that has emoji in test names and noticed the names were getting mangled. Traced it to
bin_xml_escapeinsrc/_pytest/junitxml.pyline 59:The
\u10000-\u10ffffpart is wrong. Python's\uescape only takes 4 hex digits, so\u10000gets parsed as\u1000(Myanmar letter KA, U+1000) followed by the literal character0, and\u10ffffbecomes\u10fffollowed byff. The intent was to cover the supplementary plane range U+10000 to U+10FFFF, which per the XML spec are valid characters. But since\U(8-digit) wasn't used, the entire supplementary plane is missing from the "valid" set.Result: every supplementary plane character (U+10000 and above) gets incorrectly escaped. For example:
That's 1,048,576 valid XML characters being treated as illegal -- all emoji, CJK Extension B-F, mathematical symbols, musical symbols, etc.
This is a regression from commit 1653c49. The old code built the regex dynamically with
chr()and worked fine:The refactoring to a static string accidentally broke the supplementary plane range. The fix is changing
\u10000-\u10ffffto\U00010000-\U0010ffff.Also noticed the existing test in
test_junitxml.pyhas the supplementary plane code points commented out:So the bug was never caught by tests.