Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
JSON logger uses \x escape unnecessarily on valid unicode code points #741
Consider this script:
The string to be logged is properly represented in JSON as
I'm not sure that
We chose the current encoding scheme to avoid any potential problems for invalid byte sequences.
Summarizing an offline discussion with @sethhall, he indicated to me that there have been instances in the past of consumers of Zeek json logs that don't correctly interpret \u sequence in json strings, so the solution has been to use the Zeek-specific \x escapes in this case to avoid that problem.
However, there's still some confusion here. Given the issue above with non-compliant json parsers, its probably irrelevant at this point, but nevertheless:
A \u escape represents a Unicode code point, regardless of how that code point is ultimately encoded (whether by UTF-8 or some other encoding). As for why the mechanism exists, I presume it exists as an alternative way to represent characters that are otherwise illegal in string literals, as well as for convenience when the entity writing json either can't directly generate the encoded code point or finds it clearer to write out in escaped form.
I don't understand this comment? \u0001 unambiguously represents the code point U+0001. In UTF-8, that code point it encoded in one byte.