-
Notifications
You must be signed in to change notification settings - Fork 38.7k
Description
HtmlUtils.htmlUnescape()
returns incorrect values for numeric character references >= 𐀀 / 𐀀. Only code points in the Basic Multilingual Plane 0x0000 - 0xFFFF are correct.
The reason for this is that HtmlCharacterEntityDecoder.processNumberedReference()
parses an integer from a decimal or hex String and then casts it to a char.
That seemed wrong to me, but I've been working with HTML 5 recently, so perhaps my expectation was wrong.
The Javadoc says htmlUnescape()
is compliant with HTML 4.01, and links to the spec.
Handles complete character set defined in HTML 4.01 recommendation and all reference types (decimal, hex, and entity).
Section 5.3.1 Numeric character reference
talks about the "code position" and refers to "ISO 10646 character numbers". None of the examples are larger than 0xFFFF.
Following the link from "code position" there goes to section 5.1 The Document Character Set
. From there:
HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]
The character set defined in [ISO10646] is character-by-character equivalent to Unicode ([UNICODE])
Section 20 SGML Declaration of HTML 4
says
The total number of codepoints allowed in the document character set of this SGML declaration includes the first 17 planes of [ISO10646] (17 times 65536)
So HTML 4 numeric character references do correspond to Unicode. Therefore, numeric character references >= 𐀀 should be mapped to a surrogate pair of Java chars.
Fixing the code should be straightforward, in HtmlCharacterEntityDecoder.processNumberedReference()
:
this.decodedMessage.append((char) value);
becomes something like:
if (value > Character.MAX_CODE_POINT) {
return false;
}
this.decodedMessage.appendCodePoint(value);
See also
https://www.w3.org/TR/html4/charset.html#h-5.1
https://www.w3.org/TR/html4/sgml/sgmldecl.html
https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html