Handle invalid UTF-8 sequences [INFRA-196] #805

silverjam · 2020-05-19T02:03:38Z

Any SBP message with a string member can contain data that is not a valid UTF-8 sequence, since we're using Rust's String type which requires valid UTF-8 we need to do a lossy conversion in the presence of invalid UTF-8 data.

High level changes:

Use SbpString (which wraps Vec<u8>) to store SBP string data instead of String directly and convert to String or Vec<u8> as needed
Use String::from_utf8_lossyto deal with invalid UTF-8 sequences when converting-- this is identical to Haskell's behavior when converting to a JSON string
Call lock() on stdin/stdout objects upfront so we don't have to do it within each read/write call (which happens implicitly in the stdlib)-- this bumps up performance on some benchmarks by 20-30%.
Add pub(crate) to functions that probably don’t need to be part of our public API

Details: since we were previously using String to store strings from SBP messages there's a mismatch between what SBP strings allow and what String in Rust allows. In Rust you either panic or do a lossy conversion for invalid UTF-8 data *, so when we avoid the panic using from_ut8_lossy we get a String that's 2 bytes too large (the one invalid byte gets converted to 3 bytes to encode the Unicode "unknown" character).

This causes us to size the SBP message incorrectly, because the size of the String member object doesn't match what was in the SBP message, so our sbp_size() method reports the wrong value, which obviously messes up the encoding of the message.

This is an example message from bug that causes an issue (this is index 8 of a sbp settings "read by index" response):

❯ cat invalid_utf8.bin | od -tx1
0000000 55 a7 00 00 10 48 08 00 73 6f 6c 75 74 69 6f 6e
0000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000100 00 00 00 00 00 00 00 00 00 00 00 00 00 b6 e8 ab
0000120

This is obviously a garbage message, and it's not clear why the device would be generating this... but the last byte before the crc \xb6 is invalid because it starts with 10xx xxxx and there's no preceding 110 x xxxx pattern to indicate that it's part of a multi-byte UTF-8 sequence. So, again this gets replaced with the Unicode unknown character.

To fix this we need to not use String, but (as implemented here) a wrapper type around Vec<u8> which converts to String or Vec<u8> when needed.

*: We can force Rust to store a &[u8] as UTF-8 without validating it with the unsafe method String::from_utf8_unchecked but this panics when you attempt to access the last byte of the buffer from a message like the one above.

jbangelo

I think we can get rid of that extra copy.

By the way, do we know why we're getting non-UTF-8 strings? I know that SBP doesn't technically specify the string encoding, but I thought the Piksi basically sends ASCII and thought UTF-8 was a strict superset of ASCII.

jbangelo · 2020-05-19T18:55:59Z

rust/sbp/src/parser/mod.rs

-    Ok(s)
+pub(crate) fn read_string(buf: &mut dyn Read) -> Result<String> {
+    let mut string_buffer = [0u8; crate::SBP_MAX_PAYLOAD];
+    let len = buf.read(&mut string_buffer)?; // TODO: figure out how to get rid of this copy


To get rid of the extra copy we could modify the signature of this function to take in a &mut &[u8] like the other read functions. I think we would need to then find the null terminator and then pass that subslice into String::from_utf8_lossy. And something similar could be done to read_string_limit().

I tried &[u8] too but had issues with "short" reads during testing, this version preserved existing semantics but got rid of the panic on invalid UTF-8 data.

I didn't fully understand what you wrote at first, but I looked at the implementation of Read for &[u8] to understand how to use &mut &[u8] instead of this copy.

silverjam · 2020-05-19T23:45:10Z

By the way, do we know why we're getting non-UTF-8 strings? I know that SBP doesn't technically specify the string encoding, but I thought the Piksi basically sends ASCII and thought UTF-8 was a strict superset of ASCII.

It's a weird situation, and it actually points to a larger issue. Since we're using String to store strings from SBP messages there's a mismatch between what SBP strings allow and what String in Rust allows. In Rust you either panic or do a lossy conversion for invalid UTF-8 data *, so when we avoid the panic using from_ut8_lossy we get a String that's 2 bytes too large (the one invalid byte gets converted to 3 bytes to encode the Unicode "unknown" character).

This causes us to size the SBP message incorrectly, because the size of the String member object doesn't match what was in the SBP message, so our sbp_size() method reports the wrong value, which obviously messes up the encoding of the message.

This is an example message from bug that causes an issue (this is index 8 of a sbp settings "read by index" response):

❯ cat invalid_utf8.bin | od -tx1                                                                       (libsbp)
0000000 55 a7 00 00 10 48 08 00 73 6f 6c 75 74 69 6f 6e
0000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000100 00 00 00 00 00 00 00 00 00 00 00 00 00 b6 e8 ab
0000120

This is obviously a garbage message, and I'm not sure why the device would be generating this... but the last byte before the crc \xb6 is invalid because it starts with 10xx xxxx and there's no preceding 110 x xxxx pattern to indicate that it's part of a multi-byte sequence. So, again this gets replaced with the Unicode unknown character.

To fix this we need to not use String, but (as I've implemented here) a wrapper type around Vec<u8> which converts to String or Vec<u8> when needed.

*: We can force Rust to store a &[u8] as UTF-8 without validating it with the unsafe method String::from_utf8_unchecked but this panics when you attempt to access the last byte of the buffer from a message like the one above.

jbangelo

Overall looks good. Just a couple of non-critical suggestions. 👍

jbangelo · 2020-05-21T21:43:07Z

rust/sbp/src/lib.rs

+impl Into<String> for SbpString {
+    fn into(self) -> String {
+        String::from_utf8_lossy(&self.0).into()
+    }
+}


It looks like the orphaning rules changed in v1.41 to allow to impl From<SbpString> for String: https://doc.rust-lang.org/stable/std/convert/trait.Into.html#implementing-into-for-conversions-to-external-types-in-old-versions-of-rust

I'm not sure if we want to restrict the version of rust we require.

I don't many thoughts on this at the moment, but 1.41 feels too new since it came out this year-- despite that it seems like it's supported on all recent Ubuntu releases: https://packages.ubuntu.com/search?keywords=rustc (as a security update though)

jbangelo · 2020-05-21T21:45:37Z

rust/sbp/src/parser/mod.rs

+    let mut slice = &v[..];
+
+    let string: String = read_string(&mut slice).unwrap().into();
+    assert_eq!(string, "hi, imma string".to_string());


jbangelo · 2020-05-21T21:48:38Z

rust/sbp/src/serialize.rs

+impl SbpSerialize for SbpString {
    fn append_to_sbp_buffer(&self, buf: &mut Vec<u8>) {
-        buf.extend(self.as_bytes());
+        buf.extend(self.0.clone());


Could we add a as_bytes() to SbpString as well? I think cloning here is going to make another copy that we could avoid.

jbangelo · 2020-05-21T21:54:19Z

rust/sbp/src/sbp2json.rs


    let sender_id = msg.get_sender_id();
-    let size = msg.sbp_size();
+    let size = payload[MSG_HEADER_SIZE_OFFSET] as usize;


Why change this?

I should put this back, this was originally where the String sizing issue manifested

jbangelo · 2020-05-21T22:02:11Z

I'm still curious why we would ever be generating a string with \x86 at the end, but that's not really critical for merging this PR.

On a side note, should we update the SBP spec to note that strings might not be strictly UTF-8?

silverjam · 2020-05-21T22:09:20Z

On a side note, should we update the SBP spec to note that strings might not be strictly UTF-8?

I think this would be prudent, but I think it only happens when there's a bug and the device spits out some uninitialized memory, but since we don't have any mechanism to prevent invalid UTF-8 from being emitted it's possible for it to happen.

…p semantics of constant

Jason Mobarak added 6 commits May 18, 2020 17:47

win32 stdin

ae511eb

Fix issue with buffer reads

f59edec

Lock stdin/stdout up front

fd70305

add constant

4d92a9d

mark more things as pub(crate)

bdfc2bf

Adjust benchmark ratios

90f2c0a

silverjam requested a review from jbangelo May 19, 2020 05:37

jbangelo reviewed May 19, 2020

View reviewed changes

Jason Mobarak added 2 commits May 19, 2020 16:30

Implement SbpString so that we size messages correctly

4c80ec9

Use SbpString instead of String

873a100

silverjam requested a review from jbangelo May 20, 2020 00:01

Jason Mobarak added 3 commits May 19, 2020 17:14

remove slightly ugly match statement

61f1372

Use new constant, type inference

52bd77c

Reduce verbosity of SbpString impls

2b86d89

silverjam requested a review from dgburr May 20, 2020 21:43

jbangelo approved these changes May 21, 2020

View reviewed changes

Jason Mobarak added 3 commits May 21, 2020 16:04

review comments: remove unnecessary clone, add back sbp_size, clean-u…

d0be851

…p semantics of constant

try stripping released binaries

b2ea213

try slightly different forumlation

184bec7

silverjam merged commit 5a096ea into master May 22, 2020

silverjam deleted the silverjam/win32-stdin branch May 22, 2020 00:26

Handle invalid UTF-8 sequences [INFRA-196] #805

Handle invalid UTF-8 sequences [INFRA-196] #805

Uh oh!

Conversation

silverjam commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbangelo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silverjam May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silverjam commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbangelo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbangelo commented May 21, 2020

Uh oh!

silverjam commented May 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

silverjam commented May 19, 2020 •

edited

Loading

silverjam May 19, 2020 •

edited

Loading

silverjam commented May 19, 2020 •

edited

Loading