Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON logger uses \x escape unnecessarily on valid unicode code points #741

Open
aswan opened this issue Jan 17, 2020 · 2 comments
Open

JSON logger uses \x escape unnecessarily on valid unicode code points #741

aswan opened this issue Jan 17, 2020 · 2 comments

Comments

@aswan
Copy link

@aswan aswan commented Jan 17, 2020

Consider this script:

module LowEscape;
 
export {
    redef enum Log::ID += { LOG };
 
    type Info: record {
        my_str:        string &log;
        };
    }
     
event bro_init()
    {
    Log::create_stream(LowEscape::LOG, [$columns=LowEscape::Info, $path="hello"]);
    }

event file_new(f: fa_file)
    {
    Log::write(LowEscape::LOG, [$my_str="low\x01\escape"]);
    }

The string to be logged is properly represented in JSON as "low\u0001escape" but the Zeek JSON logger renders it as:

{"my_str":"low\\x01escape"}
@sethhall

This comment has been minimized.

Copy link
Member

@sethhall sethhall commented Jan 22, 2020

I'm not sure that \u0001 is the correct representation. According to all of the JSON specs I've read you can only encode valid utf-8 values in that encoding (which confuses me about why that escaping mechanism even exists). There is even a question about if that encodes a single by or two bytes because you must encode two bytes in that output.

We chose the current encoding scheme to avoid any potential problems for invalid byte sequences.

@aswan

This comment has been minimized.

Copy link
Author

@aswan aswan commented Jan 22, 2020

Summarizing an offline discussion with @sethhall, he indicated to me that there have been instances in the past of consumers of Zeek json logs that don't correctly interpret \u sequence in json strings, so the solution has been to use the Zeek-specific \x escapes in this case to avoid that problem.

However, there's still some confusion here. Given the issue above with non-compliant json parsers, its probably irrelevant at this point, but nevertheless:

I'm not sure that \u0001 is the correct representation. According to all of the JSON specs I've read you can only encode valid utf-8 values in that encoding (which confuses me about why that escaping mechanism even exists).

A \u escape represents a Unicode code point, regardless of how that code point is ultimately encoded (whether by UTF-8 or some other encoding). As for why the mechanism exists, I presume it exists as an alternative way to represent characters that are otherwise illegal in string literals, as well as for convenience when the entity writing json either can't directly generate the encoded code point or finds it clearer to write out in escaped form.

There is even a question about if that encodes a single by or two bytes because you must encode two bytes in that output.

I don't understand this comment? \u0001 unambiguously represents the code point U+0001. In UTF-8, that code point it encoded in one byte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.