`PrettyPrintWriter` fails to serialize characters in the Unicode Supplementary Multilingual Plane in XML 1.0 mode and XML 1.1 mode #337

basil · 2023-05-02T23:36:08Z

PrettyPrintWriter fails to properly serialize characters in the Unicode Supplementary Multilingual Plane (SMP) in XML 1.0 mode and XML 1.1 mode (quirks mode works) with the following exception:

com.thoughtworks.xstream.io.StreamException: Invalid character 0xd83e in XML stream
        at com.thoughtworks.xstream.io.xml.PrettyPrintWriter.writeText(PrettyPrintWriter.java:250)
        at com.thoughtworks.xstream.io.xml.PrettyPrintWriter.writeText(PrettyPrintWriter.java:205)
        at com.thoughtworks.xstream.io.xml.PrettyPrintWriter.setValue(PrettyPrintWriter.java:187)
        at com.thoughtworks.xstream.io.xml.PrettyPrintWriterTest.testSupportsSupplementaryMultilingualPlaneInXml1_0Mode(PrettyPrintWriterTest.java:310)

The root cause of the problem is incorrect iteration over Unicode code points. The current implementation iterates over the UTF-16 representation of the characters rather than iterating over each code point. Characters in the Supplementary Multilingual Plane are encoded in UTF-16 as two digits. For example U+1F98A is encoded in UTF-16 as 0xD83E 0xDD8A. Java provides a dedicated API to iterate over code points, but XStream makes the erroneous assumption that a code point and a character are equivalent, likely because it was never tested outside of quirks mode with characters in the Supplementary Multilingual Plane. This PR fixes the problem by using the Java API for iterating over code points, thus removing the faulty assumption that a code point and a character are equivalent.

The new quirks mode test passes before and after the changes to PrettyPrintWriter. The new XML 1.0 mode and XML 1.1 mode tests fail before the changes to PrettyPrintWriter with the exception given above. The new XML 1.0 mode and XML 1.0 mode tests pass after the changes to PrettyPrintWriter.

Fixes #336

…mentary Multilingual Plane in XML 1.0 mode and XML 1.1 mode

jglick · 2023-05-03T13:58:21Z

xstream/src/java/com/thoughtworks/xstream/io/xml/PrettyPrintWriter.java

-        final int length = text.length();
-        for (int i = 0; i < length; i++) {
-            final char c = text.charAt(i);
+        text.codePoints().forEach(c -> {


I guess 1fcfa0b makes this (@since 9) safe.

I have no idea what you are talking about in this review comment. The method is present in Java 8.

Perhaps. Was just going by https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#codePoints() which says 9. At any rate I would hope the CI build would fail if this were not permitted.

Perhaps

Do you have any evidence for this claim which is casting doubt on the correctness of this change and potentially making it harder for subsequent reviewers to approve? If you do not, I would suggest that you refrain from making such review comments.

https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints--

The above link. It seems the @since tags are contradictory, unless the JDK team has a policy of noting when an override of a default method was added (which would seem strange to me since that should not change the API surface).

https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints-- is present in Java 8 and this code compiles successfully on Java 8. As far as I can tell there is no action item here, and this whole review comment was unnecessary and served only to chew up some of my time to refute an unverified claim as well as potentially confusing future reviewers.

XStream 1.5.x will target Java 11. No point any longer to use Java 8 as minimum.

But XStream 1.4 still uses Java 8, and we want this critical bug fix in that line. Anyway, this change works in Java 8, so this whole thread is pointless. I have no idea why this review feedback was left in the first place.

codePoints() was added to CharSequence interface as a default method in Java 8.
In Java 9, an override of this method was added to String (which implements CharSequence).

So, it should work for both Java 8 and 9, but it can be slightly faster for Strings in Java 9+ due to optimised version added to String in Java 9.

jglick · 2023-05-03T13:59:35Z

xstream/src/java/com/thoughtworks/xstream/io/xml/PrettyPrintWriter.java

@@ -238,7 +236,7 @@ private void writeText(final String text, final boolean isAttribute) {
            case '\t':
            case '\n':
                if (!isAttribute) {
-                    writer.write(c);
+                    writer.write(Character.toChars(c));


(Unnecessary in this case I think.)

Unnecessary in this case I think.

How would it compile without this hunk?

Right, I just meant in this case we know the character will be a single char. Not important.

Right, and I knew that when deciding to use Character.toChars(c) in this case and the case below rather than prematurely optimizing by casting the int to a char.

This review comment was unnecessary in this case I think.

jglick · 2023-05-03T14:00:40Z

xstream/src/java/com/thoughtworks/xstream/io/xml/PrettyPrintWriter.java

@@ -251,7 +249,7 @@ private void writeText(final String text, final boolean isAttribute) {
                                + " in XML stream");
                        }
                    }
-                    writer.write(c);
+                    writer.write(Character.toChars(c));


Note that this could be slightly less efficient since it allocates a char[]. It does not seem that the method overall is optimized.

Is there an action item here? If not, then what is the purpose of this comment?

It is not an action item, solely to note for any other reviewers that this change could affect performance, if that is even a consideration.

basil · 2023-05-03T20:41:47Z

Why is this not assigned to the 1.4 milestone? This is a critical bug fix that we want in 1.4.

joehni · 2023-05-13T21:39:34Z

Because 1.5.x is dropping compatibility to Java 10 to 1.4.

basil · 2023-05-15T20:16:54Z

I think it would make more sense for the 1.4.x line to require Java 8 or newer or to backport this fix to the 1.4.x line with a for-loop based implementation that can run on Java 7 or earlier.

PrettyPrintWriter fails to serialize characters in the Unicode Supple…

32e52a6

…mentary Multilingual Plane in XML 1.0 mode and XML 1.1 mode

basil mentioned this pull request May 2, 2023

[JENKINS-71182] Correct Unicode behavior of XML_1_1 jenkinsci/jenkins#7924

Merged

6 tasks

jglick approved these changes May 3, 2023

View reviewed changes

joehni self-assigned this May 3, 2023

joehni added this to the 1.5.x milestone May 3, 2023

joehni force-pushed the master branch from 8df20b7 to bcc0a9f Compare May 15, 2024 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PrettyPrintWriter` fails to serialize characters in the Unicode Supplementary Multilingual Plane in XML 1.0 mode and XML 1.1 mode #337

`PrettyPrintWriter` fails to serialize characters in the Unicode Supplementary Multilingual Plane in XML 1.0 mode and XML 1.1 mode #337

basil commented May 2, 2023 •

edited

Loading

jglick May 3, 2023

basil May 3, 2023

jglick May 3, 2023 •

edited

Loading

basil May 3, 2023

jglick May 3, 2023

basil May 3, 2023

joehni May 3, 2023

basil May 3, 2023

kdebski85 Jul 14, 2023

jglick May 3, 2023

basil May 3, 2023

jglick May 3, 2023

basil May 3, 2023

jglick May 3, 2023

basil May 3, 2023

jglick May 3, 2023

basil commented May 3, 2023

joehni commented May 13, 2023

basil commented May 15, 2023

PrettyPrintWriter fails to serialize characters in the Unicode Supplementary Multilingual Plane in XML 1.0 mode and XML 1.1 mode #337

Are you sure you want to change the base?

PrettyPrintWriter fails to serialize characters in the Unicode Supplementary Multilingual Plane in XML 1.0 mode and XML 1.1 mode #337

Conversation

basil commented May 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jglick May 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

basil commented May 3, 2023

joehni commented May 13, 2023

basil commented May 15, 2023

`PrettyPrintWriter` fails to serialize characters in the Unicode Supplementary Multilingual Plane in XML 1.0 mode and XML 1.1 mode #337

`PrettyPrintWriter` fails to serialize characters in the Unicode Supplementary Multilingual Plane in XML 1.0 mode and XML 1.1 mode #337

basil commented May 2, 2023 •

edited

Loading

jglick May 3, 2023 •

edited

Loading