Skip to content

Conversation

@silverjam
Copy link
Contributor

@silverjam silverjam commented May 19, 2020

Any SBP message with a string member can contain data that is not a valid UTF-8 sequence, since we're using Rust's String type which requires valid UTF-8 we need to do a lossy conversion in the presence of invalid UTF-8 data.

High level changes:

  • Use SbpString (which wraps Vec<u8>) to store SBP string data instead of String directly and convert to String or Vec<u8> as needed

  • Use String::from_utf8_lossyto deal with invalid UTF-8 sequences when converting-- this is identical to Haskell's behavior when converting to a JSON string

  • Call lock() on stdin/stdout objects upfront so we don't have to do it within each read/write call (which happens implicitly in the stdlib)-- this bumps up performance on some benchmarks by 20-30%.

  • Add pub(crate) to functions that probably don’t need to be part of our public API


Details: since we were previously using String to store strings from SBP messages there's a mismatch between what SBP strings allow and what String in Rust allows. In Rust you either panic or do a lossy conversion for invalid UTF-8 data *, so when we avoid the panic using from_ut8_lossy we get a String that's 2 bytes too large (the one invalid byte gets converted to 3 bytes to encode the Unicode "unknown" character).

This causes us to size the SBP message incorrectly, because the size of the String member object doesn't match what was in the SBP message, so our sbp_size() method reports the wrong value, which obviously messes up the encoding of the message.

This is an example message from bug that causes an issue (this is index 8 of a sbp settings "read by index" response):

❯ cat invalid_utf8.bin | od -tx1
0000000 55 a7 00 00 10 48 08 00 73 6f 6c 75 74 69 6f 6e
0000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000100 00 00 00 00 00 00 00 00 00 00 00 00 00 b6 e8 ab
0000120

This is obviously a garbage message, and it's not clear why the device would be generating this... but the last byte before the crc \xb6 is invalid because it starts with 10xx xxxx and there's no preceding 110 x xxxx pattern to indicate that it's part of a multi-byte UTF-8 sequence. So, again this gets replaced with the Unicode unknown character.

To fix this we need to not use String, but (as implemented here) a wrapper type around Vec<u8> which converts to String or Vec<u8> when needed.


*: We can force Rust to store a &[u8] as UTF-8 without validating it with the unsafe method String::from_utf8_unchecked but this panics when you attempt to access the last byte of the buffer from a message like the one above.

@silverjam silverjam requested a review from jbangelo May 19, 2020 05:37
Copy link
Contributor

@jbangelo jbangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get rid of that extra copy.

By the way, do we know why we're getting non-UTF-8 strings? I know that SBP doesn't technically specify the string encoding, but I thought the Piksi basically sends ASCII and thought UTF-8 was a strict superset of ASCII.

Ok(s)
pub(crate) fn read_string(buf: &mut dyn Read) -> Result<String> {
let mut string_buffer = [0u8; crate::SBP_MAX_PAYLOAD];
let len = buf.read(&mut string_buffer)?; // TODO: figure out how to get rid of this copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get rid of the extra copy we could modify the signature of this function to take in a &mut &[u8] like the other read functions. I think we would need to then find the null terminator and then pass that subslice into String::from_utf8_lossy. And something similar could be done to read_string_limit().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried &[u8] too but had issues with "short" reads during testing, this version preserved existing semantics but got rid of the panic on invalid UTF-8 data.

Copy link
Contributor Author

@silverjam silverjam May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't fully understand what you wrote at first, but I looked at the implementation of Read for &[u8] to understand how to use &mut &[u8] instead of this copy.

@silverjam
Copy link
Contributor Author

silverjam commented May 19, 2020

By the way, do we know why we're getting non-UTF-8 strings? I know that SBP doesn't technically specify the string encoding, but I thought the Piksi basically sends ASCII and thought UTF-8 was a strict superset of ASCII.

It's a weird situation, and it actually points to a larger issue. Since we're using String to store strings from SBP messages there's a mismatch between what SBP strings allow and what String in Rust allows. In Rust you either panic or do a lossy conversion for invalid UTF-8 data *, so when we avoid the panic using from_ut8_lossy we get a String that's 2 bytes too large (the one invalid byte gets converted to 3 bytes to encode the Unicode "unknown" character).

This causes us to size the SBP message incorrectly, because the size of the String member object doesn't match what was in the SBP message, so our sbp_size() method reports the wrong value, which obviously messes up the encoding of the message.

This is an example message from bug that causes an issue (this is index 8 of a sbp settings "read by index" response):

❯ cat invalid_utf8.bin | od -tx1                                                                       (libsbp)
0000000 55 a7 00 00 10 48 08 00 73 6f 6c 75 74 69 6f 6e
0000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000100 00 00 00 00 00 00 00 00 00 00 00 00 00 b6 e8 ab
0000120

This is obviously a garbage message, and I'm not sure why the device would be generating this... but the last byte before the crc \xb6 is invalid because it starts with 10xx xxxx and there's no preceding 110 x xxxx pattern to indicate that it's part of a multi-byte sequence. So, again this gets replaced with the Unicode unknown character.

To fix this we need to not use String, but (as I've implemented here) a wrapper type around Vec<u8> which converts to String or Vec<u8> when needed.


*: We can force Rust to store a &[u8] as UTF-8 without validating it with the unsafe method String::from_utf8_unchecked but this panics when you attempt to access the last byte of the buffer from a message like the one above.

@silverjam silverjam requested a review from jbangelo May 20, 2020 00:01
@silverjam silverjam requested a review from dgburr May 20, 2020 21:43
Copy link
Contributor

@jbangelo jbangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Just a couple of non-critical suggestions. 👍

Comment on lines +55 to +59
impl Into<String> for SbpString {
fn into(self) -> String {
String::from_utf8_lossy(&self.0).into()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the orphaning rules changed in v1.41 to allow to impl From<SbpString> for String: https://doc.rust-lang.org/stable/std/convert/trait.Into.html#implementing-into-for-conversions-to-external-types-in-old-versions-of-rust

I'm not sure if we want to restrict the version of rust we require.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't many thoughts on this at the moment, but 1.41 feels too new since it came out this year-- despite that it seems like it's supported on all recent Ubuntu releases: https://packages.ubuntu.com/search?keywords=rustc (as a security update though)

let mut slice = &v[..];

let string: String = read_string(&mut slice).unwrap().into();
assert_eq!(string, "hi, imma string".to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣

impl SbpSerialize for SbpString {
fn append_to_sbp_buffer(&self, buf: &mut Vec<u8>) {
buf.extend(self.as_bytes());
buf.extend(self.0.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a as_bytes() to SbpString as well? I think cloning here is going to make another copy that we could avoid.


let sender_id = msg.get_sender_id();
let size = msg.sbp_size();
let size = payload[MSG_HEADER_SIZE_OFFSET] as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should put this back, this was originally where the String sizing issue manifested

@jbangelo
Copy link
Contributor

I'm still curious why we would ever be generating a string with \x86 at the end, but that's not really critical for merging this PR.

On a side note, should we update the SBP spec to note that strings might not be strictly UTF-8?

@silverjam
Copy link
Contributor Author

On a side note, should we update the SBP spec to note that strings might not be strictly UTF-8?

I think this would be prudent, but I think it only happens when there's a bug and the device spits out some uninitialized memory, but since we don't have any mechanism to prevent invalid UTF-8 from being emitted it's possible for it to happen.

@silverjam silverjam merged commit 5a096ea into master May 22, 2020
@silverjam silverjam deleted the silverjam/win32-stdin branch May 22, 2020 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants