Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of strings with nul bytes in them #5

Closed
timando opened this issue Jul 27, 2022 · 8 comments
Closed

Handling of strings with nul bytes in them #5

timando opened this issue Jul 27, 2022 · 8 comments
Labels

Comments

@timando
Copy link

timando commented Jul 27, 2022

When I try and round-trip the following json, it doesn't work properly.
["stuff", "things", "zero\u0000things", "multi\u0000\u0000zero\u0000things"]

@waterfountain1996
Copy link
Contributor

It works if we use length-prefixed strings instead of null terminated ones.

@vshymanskyy
Copy link
Owner

Currently encoder doesn't take it into account, but the format itself is capable of representing this with fixed-length strings. Will implement soon

@vshymanskyy
Copy link
Owner

Fixed. This is also related to #3

@vshymanskyy
Copy link
Owner

in this simple example, the resulting Muon file is 48 bytes compared to 74-byte minified JSON.
So 35% smaller

@dgl
Copy link

dgl commented Jul 29, 2022

This is interesting to consider for someone wanting to write a high performance (but safe, validating) encoder, using NUL as the termination results in an edge case to deal with: the string is valid UTF-8, the string is valid UTF-8 but contains a NUL, or it's not valid UTF-8.

Using say 0xFF as both the tag pad and string terminator would mean there's only two cases, either valid UTF-8 (can stream encoding on the fly and only need to look at the byte stream once), or not valid and the encoder would error out. It does mean it wouldn't be consistent with typed arrays (and maybe that is another route to allow a high performance encoder, allow chunked strings via the implied idea in #3 -- which has the benefit of keeping C-string compatibility which I assume is the point of the non-length encoding?).

One slightly crazy case here is Perl's extended UTF-8, which will actually use 0xFF on the wire but I'm perfectly fine with that needing to be encoded as binary instead.

Edited to add: The more I think about this the more I think it's fine as it is (both this and #3), I wasn't thinking about C-string compatibility and keeping that seems valuable, the length based encoding (without the nul termination idea) only has an overhead for longer strings anyway, I also don't think chunked encoding for strings is a good idea, as it loses the nice property that strings currently have of being as-is on the wire.

@dumblob

This comment was marked as off-topic.

@vshymanskyy
Copy link
Owner

vshymanskyy commented Jul 30, 2022

@dgl zero termination is used specificly for the convenience of languages that use zero-terminated strings. overall, unicode strings containing zeroes are (should be) very rare. When optimizing Muon for speed, large strings should always be encoded as fixed-size. For small strings, it's easy to check if they contain a null byte.

@vshymanskyy
Copy link
Owner

@dumblob see #17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants