Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

lucacasonato · 2022-04-12T00:10:49Z

Previously #828 added support for deserializing lone leading and
trailing surrogates into WTF-8 encoded bytes when deserializing a string
as bytes. This commit extends this to cover the case of a leading
surrogate followed by code units that are not trailing surrogates. This
allows for deserialization of "\ud83c\ud83c" (two leading surrogates),
or "\ud83c\u0061" (a leading surrogate followed by "a").

The docs also now make it clear that we are serializing the invalid code
points as WTF-8. This reference to WTF-8 signals to the user that they
can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.

Follow up to #830 (review).

lucacasonato · 2022-04-12T00:12:20Z

src/read.rs

+                        // TODO: the error message is wrong, this is a lone
+                        // _trailing_ surrogate
                        error(read, ErrorCode::LoneLeadingSurrogateInHexEscape)


Worth adding another error code? Would doing so even be semver compatible? (unrelated to this PR)

lucacasonato · 2022-04-12T00:13:53Z

tests/test.rs

+}
+
+#[test]
+fn test_byte_buf_de_surrogate_pair() {


There was no test for parsing valid surrogate pairs into byte bufs that I could find, so I added one.

Previously serde-rs#828 added support for deserializing lone leading and trailing surrogates into WTF-8 encoded bytes when deserializing a string as bytes. This commit extends this to cover the case of a leading surrogate followed by code units that are not trailing surrogates. This allows for deserialization of "\ud83c\ud83c" (two leading surrogates), or "\ud83c\u0061" (a leading surrogate followed by "a"). The docs also now make it clear that we are serializing the invalid code points as WTF-8. This reference to WTF-8 signals to the user that they can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.

lucacasonato · 2022-05-18T19:28:28Z

@dtolnay Have you had a chance to look into this? It'd be great to get your review.

Closes serde-rs#877. This is a good time to make ByteBuf parsing more consistent as I'm rewriting it anyway. This commit integrates the changes from serde-rs#877 and also handles a leading surrogate followed by a surrogate pair correctly. This does not affect performance significantly. Co-authored-by: Luca Casonato <hello@lcas.dev>

lucacasonato commented Apr 12, 2022

View reviewed changes

lucacasonato force-pushed the wtf8_encoding branch from c3bfa51 to fbd1d68 Compare April 12, 2022 00:14

lucacasonato force-pushed the wtf8_encoding branch from fbd1d68 to f50e296 Compare May 18, 2022 19:28

purplesyringa mentioned this pull request Aug 12, 2024

Speed up \uXXXX parsing and improve WTF-8 handling #1175

Merged

dtolnay closed this in 96ae604 Aug 15, 2024

dtolnay merged commit 0f942e5 into serde-rs:master Aug 15, 2024

lucacasonato deleted the wtf8_encoding branch August 15, 2024 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

lucacasonato commented Apr 12, 2022 •

edited

Loading

lucacasonato Apr 12, 2022 •

edited

Loading

lucacasonato Apr 12, 2022

lucacasonato commented May 18, 2022 •

edited

Loading

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

Conversation

lucacasonato commented Apr 12, 2022 • edited Loading

lucacasonato Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

lucacasonato Apr 12, 2022

Choose a reason for hiding this comment

lucacasonato commented May 18, 2022 • edited Loading

lucacasonato commented Apr 12, 2022 •

edited

Loading

lucacasonato Apr 12, 2022 •

edited

Loading

lucacasonato commented May 18, 2022 •

edited

Loading