bin_xml_escape: supplementary plane characters (U+10000+) incorrectly escaped due to wrong unicode escape in regex

Was looking at the JUnit XML output for a test suite that has emoji in test names and noticed the names were getting mangled. Traced it to `bin_xml_escape` in `src/_pytest/junitxml.py` line 59:

```python
illegal_xml_re = (
    "[^\u0009\u000a\u000d\u0020-\u007e\u0080-\ud7ff\ue000-\ufffd\u10000-\u10ffff]"
)
```

The `\u10000-\u10ffff` part is wrong. Python's `\u` escape only takes 4 hex digits, so `\u10000` gets parsed as `\u1000` (Myanmar letter KA, U+1000) followed by the literal character `0`, and `\u10ffff` becomes `\u10ff` followed by `ff`. The intent was to cover the supplementary plane range U+10000 to U+10FFFF, which per the XML spec are valid characters. But since `\U` (8-digit) wasn't used, the entire supplementary plane is missing from the "valid" set.

Result: every supplementary plane character (U+10000 and above) gets incorrectly escaped. For example:

```python
from _pytest.junitxml import bin_xml_escape

bin_xml_escape("test_😀_passes")  # returns 'test_#x1F600_passes' instead of 'test_😀_passes'
bin_xml_escape("test_𠀀")         # returns 'test_#x20000' instead of 'test_𠀀'
```

That's 1,048,576 valid XML characters being treated as illegal -- all emoji, CJK Extension B-F, mathematical symbols, musical symbols, etc.

This is a regression from commit 1653c49b1b. The old code built the regex dynamically with `chr()` and worked fine:

```python
_legal_ranges = ((0x20, 0x7E), (0x80, 0xD7FF), (0xE000, 0xFFFD), (0x10000, 0x10FFFF))
_legal_xml_re = ["{}-{}".format(chr(low), chr(high)) for (low, high) in _legal_ranges]
```

The refactoring to a static string accidentally broke the supplementary plane range. The fix is changing `\u10000-\u10ffff` to `\U00010000-\U0010ffff`.

Also noticed the existing test in `test_junitxml.py` has the supplementary plane code points commented out:

```python
valid = (0x9, 0xA, 0x20)
# 0xD, 0xD7FF, 0xE000, 0xFFFD, 0x10000, 0x10FFFF)
```

So the bug was never caught by tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bin_xml_escape: supplementary plane characters (U+10000+) incorrectly escaped due to wrong unicode escape in regex #14483

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

bin_xml_escape: supplementary plane characters (U+10000+) incorrectly escaped due to wrong unicode escape in regex #14483

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions