New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of strings with nul bytes in them #5
Comments
It works if we use length-prefixed strings instead of null terminated ones. |
Currently encoder doesn't take it into account, but the format itself is capable of representing this with fixed-length strings. Will implement soon |
Fixed. This is also related to #3 |
in this simple example, the resulting Muon file is 48 bytes compared to 74-byte minified JSON. |
This is interesting to consider for someone wanting to write a high performance (but safe, validating) encoder, using NUL as the termination results in an edge case to deal with: the string is valid UTF-8, the string is valid UTF-8 but contains a NUL, or it's not valid UTF-8. Using say 0xFF as both the tag pad and string terminator would mean there's only two cases, either valid UTF-8 (can stream encoding on the fly and only need to look at the byte stream once), or not valid and the encoder would error out. It does mean it wouldn't be consistent with typed arrays (and maybe that is another route to allow a high performance encoder, allow chunked strings via the implied idea in #3 -- which has the benefit of keeping C-string compatibility which I assume is the point of the non-length encoding?). One slightly crazy case here is Perl's extended UTF-8, which will actually use 0xFF on the wire but I'm perfectly fine with that needing to be encoded as binary instead. Edited to add: The more I think about this the more I think it's fine as it is (both this and #3), I wasn't thinking about C-string compatibility and keeping that seems valuable, the length based encoding (without the nul termination idea) only has an overhead for longer strings anyway, I also don't think chunked encoding for strings is a good idea, as it loses the nice property that strings currently have of being as-is on the wire. |
This comment was marked as off-topic.
This comment was marked as off-topic.
@dgl zero termination is used specificly for the convenience of languages that use zero-terminated strings. overall, unicode strings containing zeroes are (should be) very rare. When optimizing Muon for speed, large strings should always be encoded as fixed-size. For small strings, it's easy to check if they contain a null byte. |
When I try and round-trip the following json, it doesn't work properly.
["stuff", "things", "zero\u0000things", "multi\u0000\u0000zero\u0000things"]
The text was updated successfully, but these errors were encountered: