Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialize invalid UTF-8 into byte bufs as WTF-8 #877

Merged
merged 1 commit into from
Aug 15, 2024

Conversation

lucacasonato
Copy link
Contributor

@lucacasonato lucacasonato commented Apr 12, 2022

Previously #828 added support for deserializing lone leading and
trailing surrogates into WTF-8 encoded bytes when deserializing a string
as bytes. This commit extends this to cover the case of a leading
surrogate followed by code units that are not trailing surrogates. This
allows for deserialization of "\ud83c\ud83c" (two leading surrogates),
or "\ud83c\u0061" (a leading surrogate followed by "a").

The docs also now make it clear that we are serializing the invalid code
points as WTF-8. This reference to WTF-8 signals to the user that they
can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.

Follow up to #830 (review).

Comment on lines +886 to 888
// TODO: the error message is wrong, this is a lone
// _trailing_ surrogate
error(read, ErrorCode::LoneLeadingSurrogateInHexEscape)
Copy link
Contributor Author

@lucacasonato lucacasonato Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding another error code? Would doing so even be semver compatible? (unrelated to this PR)

}

#[test]
fn test_byte_buf_de_surrogate_pair() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no test for parsing valid surrogate pairs into byte bufs that I could find, so I added one.

Previously serde-rs#828 added support for deserializing lone leading and
trailing surrogates into WTF-8 encoded bytes when deserializing a string
as bytes. This commit extends this to cover the case of a leading
surrogate followed by code units that are not trailing surrogates. This
allows for deserialization of "\ud83c\ud83c" (two leading surrogates),
or  "\ud83c\u0061" (a leading surrogate followed by "a").

The docs also now make it clear that we are serializing the invalid code
points as WTF-8. This reference to WTF-8 signals to the user that they
can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.
@lucacasonato
Copy link
Contributor Author

lucacasonato commented May 18, 2022

@dtolnay Have you had a chance to look into this? It'd be great to get your review.

purplesyringa added a commit to iex-rs/serde-json that referenced this pull request Aug 12, 2024
Closes serde-rs#877.

This is a good time to make ByteBuf parsing more consistent as I'm
rewriting it anyway. This commit integrates the changes from serde-rs#877 and
also handles a leading surrogate followed by a surrogate pair correctly.

This does not affect performance significantly.

Co-authored-by: Luca Casonato <hello@lcas.dev>
@dtolnay dtolnay closed this in 96ae604 Aug 15, 2024
@dtolnay dtolnay merged commit 0f942e5 into serde-rs:master Aug 15, 2024
@lucacasonato lucacasonato deleted the wtf8_encoding branch August 15, 2024 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants