-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow lone surrogates in raw values #830
Allow lone surrogates in raw values #830
Conversation
This allows for a proper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow why #828 is not suitable for your use case. How come all the string processing logic has to be duplicated instead of deserializing to bytebuf and patching out all the leftover surrogates with fffd?
fn main() {
println!("{}", String::from_utf8_lossy(&[b'a', 237, 160, 188, b'b']));
} Because UTF-8 decoders will treat this sequence that is meant to encode the lone surrogate as a sequence of three invalid codepoints, thus emitting 3 U+FFFD chars instead of just one. This is the case for both the UTF8 parser in Rust's stdlib, and the one in encoding_rs, and the one specified by https://encoding.spec.whatwg.org (browser ones). As such, I don't think this is a valid way of representing lone surrogates. The PR is useful even if no lone surrogates are used, because right now Ideally |
You're fully in control of the interpretation of bytes to utf-8 though. It doesn't really seem relevant how other logic that interprets bytes to utf-8 does it if those aren't the logic you need. My expectation would be something like: let mut bytes: ByteBuf = /* ... */;
for i in 2..bytes.len() {
// surrogates are codepoints 1101_1xxxxx_xxxxxx
if bytes[i - 2] == 0b1110_1101 && bytes[i - 1] >= 0b1010_0000 {
bytes[i - 2..=i].copy_from_slice("\u{fffd}".as_bytes());
}
} |
To elaborate a bit in response to:
If |
That would work, but it still seems pretty unfortunate that the
EDIT: looks like we were both typing at once :-) - this comment does not incorporate a response to your previous comment yet. |
I don't really see how this is feasible to do with the current implementation. Let's say you have the string // NOTE: implementation below likely not optimal.
fn main() {
let str = r#"["\ud83c","\udf0e"]"#;
let b1: Vec<Box<serde_json::value::RawValue>> = serde_json::from_str(str).unwrap();
let mut intermediary = String::from('"');
for buf in b1 {
let str = buf.get();
intermediary.push_str(&str[1..str.len() - 1]);
}
intermediary.push('"');
let recombined: String = serde_json::from_str(&intermediary).unwrap();
assert_eq!(recombined, "🌎");
} |
I just want to reiterate: whatever we settle on for |
Is this because we're not in agreement that the choice of representation is lossless, or because the algorithm for performing the stitching would be infeasible to implement in practice?
Yep that's what I had in mind: fn stitch(mut a: ByteBuf, b: ByteBuf) -> ByteBuf {
let a_len = a.len();
if let Some(&[0b1110_1101, a1 @ 0b1010_0000..=0b1010_1111, a2]) = a.get(a_len - 3..) {
if let Some(&[0b1110_1101, b1 @ 0b1011_0000..=0b1011_1111, b2]) = b.get(..3) {
let n = 1 + ((a1 & 0b0000_1111) as u32) << 16
| ((a2 & 0b0011_1111) as u32) << 10
| ((b1 & 0b0000_1111) as u32) << 6
| (b2 & 0b0011_1111) as u32;
if let Some(ch) = char::from_u32(n) {
a.truncate(a_len - 3);
a.reserve(ch.len_utf8() + b.len() - 3);
a.extend_from_slice(ch.encode_utf8(&mut [0; 4]).as_bytes());
a.extend_from_slice(&b[3..]);
return a;
}
}
}
a.extend_from_slice(&b);
a
} Forcing this kind of manipulation to be done using RawValue instead is not a compelling solution because enabling RawValue anywhere in a dependency graph adds performance overhead into all JSON deserializations that don't even end up deserializing a RawValue, so it's often not an option for codebases, or mean that there are dependencies they won't be able to use. So whatever RawValue does or doesn't do, I feel it's important for there to be an approach that does not involve RawValue. I'm also not planning on building lossy strings into serde_json, and not keen about large pieces of our string processing code being duplicated downstream. For these reasons I'm pushing back because it makes sense to keep RawValue the least useful possible, subject to addressing all the use cases for which it actually is necessary. |
I was not aware of the performance impact. Currently there is no way to do a partial deserialization of JSON that contains unpaired surrogates. I don't really see how that could be made to work with
I am not generally against a lossless representation, but the current choice of lossless representation is not great. I did figure out that this representation is not completely bespoke - it is called WTF-8. There is a Rust crate for it but it, but since WTF-8 must not be used for interchange there is no way to use that crate to get a WTF-8 view on an existing byte buffer containing WTF-8 bytes (this can be hacked around with some unsafe Let's properly document the use of WTF-8 here though: a
Yeah - I understand, I can use the If the decision is made that the encoding we are going for is really WTF-8, we should follow the encoding spec properly: https://simonsapin.github.io/wtf-8/#encode-to-wtf-8. The case |
5ec8141
to
51e9616
Compare
@dtolnay Hey, quick poke - have you had a chance to look into / think about my comment above yet? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Thanks!
The case
\ud83c\uaaaa
is not handled correctly right now for example.
I would accept a PR to address this case.
See https://github.com/serde-rs/json/issues/827#issuecomment-979208987