Skip to content

HtmlUtils.htmlUnescape() incorrect for numeric character references >= 𐀀 / 𐀀 #35426

@magictractor

Description

@magictractor

HtmlUtils.htmlUnescape() returns incorrect values for numeric character references >= 𐀀 / 𐀀. Only code points in the Basic Multilingual Plane 0x0000 - 0xFFFF are correct.

The reason for this is that HtmlCharacterEntityDecoder.processNumberedReference() parses an integer from a decimal or hex String and then casts it to a char.

That seemed wrong to me, but I've been working with HTML 5 recently, so perhaps my expectation was wrong.

The Javadoc says htmlUnescape() is compliant with HTML 4.01, and links to the spec.

Handles complete character set defined in HTML 4.01 recommendation and all reference types (decimal, hex, and entity).

Section 5.3.1 Numeric character reference talks about the "code position" and refers to "ISO 10646 character numbers". None of the examples are larger than 0xFFFF.

Following the link from "code position" there goes to section 5.1 The Document Character Set. From there:

HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]

The character set defined in [ISO10646] is character-by-character equivalent to Unicode ([UNICODE])

Section 20 SGML Declaration of HTML 4 says

The total number of codepoints allowed in the document character set of this SGML declaration includes the first 17 planes of [ISO10646] (17 times 65536)

So HTML 4 numeric character references do correspond to Unicode. Therefore, numeric character references >= 𐀀 should be mapped to a surrogate pair of Java chars.

Fixing the code should be straightforward, in HtmlCharacterEntityDecoder.processNumberedReference():

this.decodedMessage.append((char) value);

becomes something like:

if (value > Character.MAX_CODE_POINT) {
    return false;
}
this.decodedMessage.appendCodePoint(value);

See also
https://www.w3.org/TR/html4/charset.html#h-5.1
https://www.w3.org/TR/html4/sgml/sgmldecl.html
https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions