Deserialize lone surrogates into byte bufs #828

lucacasonato · 2021-11-24T13:51:08Z

This commit deserializes lone surrogates in strings that are encoded in
escape sequences instead of erroring on them.

As per #827 (comment)

Note the implementation here is ugly & unoptimized. I'll go fix this after
an initial review of the functionality.

This commit deserializes lone surrogates in strings that are encoded in escape sequences instead of erroring on them.

dtolnay

Thanks!

src/de.rs

src/read.rs

lucacasonato

I have addressed your review comments, and have added some test cases for the "non \u escape code after lone surrogate" cases.

dtolnay

Looks good — thanks.

Previously serde-rs#828 added support for deserializing lone leading and trailing surrogates into WTF-8 encoded bytes when deserializing a string as bytes. This commit extends this to cover the case of a leading surrogate followed by code units that are not trailing surrogates. This allows for deserialization of "\ud83c\ud83c" (two leading surrogates), or "\ud83c\u0061" (a leading surrogate followed by "a"). The docs also now make it clear that we are serializing the invalid code points as WTF-8. This reference to WTF-8 signals to the user that they can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.

Deserialize lone surrogates into byte bufs

849c684

This commit deserializes lone surrogates in strings that are encoded in escape sequences instead of erroring on them.

lucacasonato mentioned this pull request Nov 24, 2021

Logging strings with surrogate code points crashes deno denoland/deno#12226

Closed

dtolnay requested changes Nov 24, 2021

View reviewed changes

src/de.rs Outdated Show resolved Hide resolved

src/read.rs Outdated Show resolved Hide resolved

src/read.rs Show resolved Hide resolved

src/read.rs Show resolved Hide resolved

lucacasonato added 2 commits November 24, 2021 23:29

fix wording

4c28c57

fix parsing escape sequences after lone surrogates

07c740c

lucacasonato commented Nov 24, 2021

View reviewed changes

dtolnay approved these changes Nov 25, 2021

View reviewed changes

dtolnay merged commit 691466c into serde-rs:master Nov 25, 2021

lucacasonato deleted the lone_surrogate branch November 25, 2021 11:39

lucacasonato mentioned this pull request Apr 12, 2022

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

Open

lucacasonato mentioned this pull request May 18, 2022

Support serializing strings containing lone surrogates with deserialize_any #890

Open

helixbass mentioned this pull request Dec 10, 2023

Deserializing lone surrogate to ByteBuf fails when it's nested in an enum #1089

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserialize lone surrogates into byte bufs #828

Deserialize lone surrogates into byte bufs #828

lucacasonato commented Nov 24, 2021 •

edited

dtolnay left a comment

lucacasonato left a comment

dtolnay left a comment

Deserialize lone surrogates into byte bufs #828

Deserialize lone surrogates into byte bufs #828

Conversation

lucacasonato commented Nov 24, 2021 • edited

dtolnay left a comment

Choose a reason for hiding this comment

lucacasonato left a comment

Choose a reason for hiding this comment

dtolnay left a comment

Choose a reason for hiding this comment

lucacasonato commented Nov 24, 2021 •

edited