-
Notifications
You must be signed in to change notification settings - Fork 97
Handle invalid UTF-8 sequences [INFRA-196] #805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jbangelo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get rid of that extra copy.
By the way, do we know why we're getting non-UTF-8 strings? I know that SBP doesn't technically specify the string encoding, but I thought the Piksi basically sends ASCII and thought UTF-8 was a strict superset of ASCII.
rust/sbp/src/parser/mod.rs
Outdated
| Ok(s) | ||
| pub(crate) fn read_string(buf: &mut dyn Read) -> Result<String> { | ||
| let mut string_buffer = [0u8; crate::SBP_MAX_PAYLOAD]; | ||
| let len = buf.read(&mut string_buffer)?; // TODO: figure out how to get rid of this copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get rid of the extra copy we could modify the signature of this function to take in a &mut &[u8] like the other read functions. I think we would need to then find the null terminator and then pass that subslice into String::from_utf8_lossy. And something similar could be done to read_string_limit().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried &[u8] too but had issues with "short" reads during testing, this version preserved existing semantics but got rid of the panic on invalid UTF-8 data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't fully understand what you wrote at first, but I looked at the implementation of Read for &[u8] to understand how to use &mut &[u8] instead of this copy.
It's a weird situation, and it actually points to a larger issue. Since we're using This causes us to size the SBP message incorrectly, because the size of the This is an example message from bug that causes an issue (this is index 8 of a sbp settings "read by index" response): This is obviously a garbage message, and I'm not sure why the device would be generating this... but the last byte before the crc To fix this we need to not use *: We can force Rust to store a |
jbangelo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Just a couple of non-critical suggestions. 👍
| impl Into<String> for SbpString { | ||
| fn into(self) -> String { | ||
| String::from_utf8_lossy(&self.0).into() | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the orphaning rules changed in v1.41 to allow to impl From<SbpString> for String: https://doc.rust-lang.org/stable/std/convert/trait.Into.html#implementing-into-for-conversions-to-external-types-in-old-versions-of-rust
I'm not sure if we want to restrict the version of rust we require.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't many thoughts on this at the moment, but 1.41 feels too new since it came out this year-- despite that it seems like it's supported on all recent Ubuntu releases: https://packages.ubuntu.com/search?keywords=rustc (as a security update though)
| let mut slice = &v[..]; | ||
|
|
||
| let string: String = read_string(&mut slice).unwrap().into(); | ||
| assert_eq!(string, "hi, imma string".to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤣
rust/sbp/src/serialize.rs
Outdated
| impl SbpSerialize for SbpString { | ||
| fn append_to_sbp_buffer(&self, buf: &mut Vec<u8>) { | ||
| buf.extend(self.as_bytes()); | ||
| buf.extend(self.0.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a as_bytes() to SbpString as well? I think cloning here is going to make another copy that we could avoid.
rust/sbp/src/sbp2json.rs
Outdated
|
|
||
| let sender_id = msg.get_sender_id(); | ||
| let size = msg.sbp_size(); | ||
| let size = payload[MSG_HEADER_SIZE_OFFSET] as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should put this back, this was originally where the String sizing issue manifested
|
I'm still curious why we would ever be generating a string with On a side note, should we update the SBP spec to note that strings might not be strictly UTF-8? |
I think this would be prudent, but I think it only happens when there's a bug and the device spits out some uninitialized memory, but since we don't have any mechanism to prevent invalid UTF-8 from being emitted it's possible for it to happen. |
Any SBP message with a string member can contain data that is not a valid UTF-8 sequence, since we're using Rust's
Stringtype which requires valid UTF-8 we need to do a lossy conversion in the presence of invalid UTF-8 data.High level changes:
Use
SbpString(which wrapsVec<u8>) to store SBP string data instead ofStringdirectly and convert toStringorVec<u8>as neededUse
String::from_utf8_lossyto deal with invalid UTF-8 sequences when converting-- this is identical to Haskell's behavior when converting to a JSON stringCall
lock()on stdin/stdout objects upfront so we don't have to do it within each read/write call (which happens implicitly in the stdlib)-- this bumps up performance on some benchmarks by 20-30%.Add
pub(crate)to functions that probably don’t need to be part of our public APIDetails: since we were previously using
Stringto store strings from SBP messages there's a mismatch between what SBP strings allow and whatStringin Rust allows. In Rust you either panic or do a lossy conversion for invalid UTF-8 data *, so when we avoid the panic usingfrom_ut8_lossywe get a String that's 2 bytes too large (the one invalid byte gets converted to 3 bytes to encode the Unicode "unknown" character).This causes us to size the SBP message incorrectly, because the size of the
Stringmember object doesn't match what was in the SBP message, so oursbp_size()method reports the wrong value, which obviously messes up the encoding of the message.This is an example message from bug that causes an issue (this is index 8 of a sbp settings "read by index" response):
This is obviously a garbage message, and it's not clear why the device would be generating this... but the last byte before the crc
\xb6is invalid because it starts with10xx xxxxand there's no preceding110 x xxxxpattern to indicate that it's part of a multi-byte UTF-8 sequence. So, again this gets replaced with the Unicode unknown character.To fix this we need to not use
String, but (as implemented here) a wrapper type aroundVec<u8>which converts toStringorVec<u8>when needed.*: We can force Rust to store a
&[u8]as UTF-8 without validating it with the unsafe methodString::from_utf8_uncheckedbut this panics when you attempt to access the last byte of the buffer from a message like the one above.