Handling of strings with nul bytes in them #5

timando · 2022-07-27T00:52:02Z

When I try and round-trip the following json, it doesn't work properly.
["stuff", "things", "zero\u0000things", "multi\u0000\u0000zero\u0000things"]

The text was updated successfully, but these errors were encountered:

waterfountain1996 · 2022-07-27T07:23:09Z

It works if we use length-prefixed strings instead of null terminated ones.

vshymanskyy · 2022-07-27T07:59:14Z

Currently encoder doesn't take it into account, but the format itself is capable of representing this with fixed-length strings. Will implement soon

vshymanskyy · 2022-07-27T08:49:53Z

Fixed. This is also related to #3

vshymanskyy · 2022-07-27T11:25:25Z

in this simple example, the resulting Muon file is 48 bytes compared to 74-byte minified JSON.
So 35% smaller

dgl · 2022-07-29T01:23:06Z

This is interesting to consider for someone wanting to write a high performance (but safe, validating) encoder, using NUL as the termination results in an edge case to deal with: the string is valid UTF-8, the string is valid UTF-8 but contains a NUL, or it's not valid UTF-8.

Using say 0xFF as both the tag pad and string terminator would mean there's only two cases, either valid UTF-8 (can stream encoding on the fly and only need to look at the byte stream once), or not valid and the encoder would error out. It does mean it wouldn't be consistent with typed arrays (and maybe that is another route to allow a high performance encoder, allow chunked strings via the implied idea in #3 -- which has the benefit of keeping C-string compatibility which I assume is the point of the non-length encoding?).

One slightly crazy case here is Perl's extended UTF-8, which will actually use 0xFF on the wire but I'm perfectly fine with that needing to be encoded as binary instead.

Edited to add: The more I think about this the more I think it's fine as it is (both this and #3), I wasn't thinking about C-string compatibility and keeping that seems valuable, the length based encoding (without the nul termination idea) only has an overhead for longer strings anyway, I also don't think chunked encoding for strings is a good idea, as it loses the nice property that strings currently have of being as-is on the wire.

vshymanskyy · 2022-07-30T08:53:31Z

@dgl zero termination is used specificly for the convenience of languages that use zero-terminated strings. overall, unicode strings containing zeroes are (should be) very rare. When optimizing Muon for speed, large strings should always be encoded as fixed-size. For small strings, it's easy to check if they contain a null byte.

vshymanskyy · 2022-07-31T16:14:47Z

@dumblob see #17

vshymanskyy closed this as completed in 5e7554d Jul 27, 2022

vshymanskyy added the bug label Jul 27, 2022

JobLeonard mentioned this issue Jul 28, 2022

Are 8C tagged strings lenght encoded? #12

Closed

This comment was marked as off-topic.

Sign in to view

vshymanskyy mentioned this issue Jul 30, 2022

Could we chain muon on-the-wire data? #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of strings with nul bytes in them #5

Handling of strings with nul bytes in them #5

timando commented Jul 27, 2022

waterfountain1996 commented Jul 27, 2022

vshymanskyy commented Jul 27, 2022

vshymanskyy commented Jul 27, 2022

vshymanskyy commented Jul 27, 2022

dgl commented Jul 29, 2022 •

edited

This comment was marked as off-topic.

vshymanskyy commented Jul 30, 2022 •

edited

vshymanskyy commented Jul 31, 2022

Handling of strings with nul bytes in them #5

Handling of strings with nul bytes in them #5

Comments

timando commented Jul 27, 2022

waterfountain1996 commented Jul 27, 2022

vshymanskyy commented Jul 27, 2022

vshymanskyy commented Jul 27, 2022

vshymanskyy commented Jul 27, 2022

dgl commented Jul 29, 2022 • edited

This comment was marked as off-topic.

vshymanskyy commented Jul 30, 2022 • edited

vshymanskyy commented Jul 31, 2022

dgl commented Jul 29, 2022 •

edited

vshymanskyy commented Jul 30, 2022 •

edited