HtmlUtils.htmlUnescape() incorrect for numeric character references >= &#x10000; / &#65536;

`HtmlUtils.htmlUnescape()` returns incorrect values for numeric character references >= &amp;#x10000; / &amp;#65536;. Only code points in the Basic Multilingual Plane 0x0000 - 0xFFFF are correct.

The reason for this is that `HtmlCharacterEntityDecoder.processNumberedReference()` parses an integer from a decimal or hex String and then casts it to a char.

That seemed wrong to me, but I've been working with HTML 5 recently, so perhaps my expectation was wrong.

The Javadoc says `htmlUnescape()` is compliant with HTML 4.01, and links to the spec.

> Handles complete character set defined in HTML 4.01 recommendation and all reference types (decimal, hex, and entity).

`Section 5.3.1 Numeric character reference` talks about the "code position" and refers to "ISO 10646 character numbers". None of the examples are larger than 0xFFFF.

Following the link from "code position" there goes to section `5.1 The Document Character Set`. From there:

> HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]

> The character set defined in [ISO10646] is character-by-character equivalent to Unicode ([UNICODE])


`Section 20 SGML Declaration of HTML 4` says

> The total number of codepoints allowed in the document character set of this SGML declaration includes the first 17 planes of [[ISO10646]](https://www.w3.org/TR/html4/references.html#ref-ISO10646) (17 times 65536)


So HTML 4 numeric character references do correspond to Unicode. Therefore, numeric character references >= &amp;#x10000; should be mapped to a surrogate pair of Java chars.

Fixing the code should be straightforward, in `HtmlCharacterEntityDecoder.processNumberedReference()`:

```Java
this.decodedMessage.append((char) value);
```

becomes something like:

```Java
if (value > Character.MAX_CODE_POINT) {
    return false;
}
this.decodedMessage.appendCodePoint(value);
```

See also
https://www.w3.org/TR/html4/charset.html#h-5.1
https://www.w3.org/TR/html4/sgml/sgmldecl.html
https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HtmlUtils.htmlUnescape() incorrect for numeric character references >= 𐀀 / 𐀀 #35426

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HtmlUtils.htmlUnescape() incorrect for numeric character references >= &#x10000; / &#65536; #35426

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

HtmlUtils.htmlUnescape() incorrect for numeric character references >= 𐀀 / 𐀀 #35426