Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

print() improperly formats UTF-8 output #596

Open
grigorescu opened this issue Sep 25, 2019 · 13 comments

Comments

@grigorescu
Copy link
Member

commented Sep 25, 2019

Been tracking down a JSON formatting problem in to_json. The simplest test case I can find is a string from one of the values in SSL::ct_logs, "\x07\xd4\xb7o"

In Bro 2.6.3, the result of print to_json(l) is "\\x07\\xd4\\xb7o". Each backslash preceding the hex-escape is escaped.

http://try.bro.org/#/trybro/saved/355506

In Zeek master, the result of print to_json(l) is "\\x07\xd4\xb7o". The backslashes stop being double-escaped.

http://try.bro.org/#/trybro/saved/355504

This is invalid JSON (not 100% sure why, but checked it with jsonlint.com) and causing some parsers to choke.

@grigorescu grigorescu changed the title to_json formatting regression to_json formatting bug introduced in 3.0 Sep 25, 2019
@jsiwek

This comment has been minimized.

Copy link
Member

commented Sep 25, 2019

In Zeek 3.0, we wanted JSON formatting to not escape valid UTF-8 encodings by default. And that's basically what happens here: to_json() is returning a valid JSON string, but print is adding back in escapes.

Can see an example of what I mean with this script:

local j = to_json("\x07\xd4\xb7o");

local f = open("myfile");
enable_raw_output(f);
local o = open("/dev/stdout");
enable_raw_output(o);

print j;
print o, j;
print f, j;

The raw string data returned by to_json is actually "\\x07Էo", and that's ok since JSON spec says:

A string begins and ends with
quotation marks. All Unicode characters may be placed within the
quotation marks except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).

So I'd suggest we change print to pass through valid UTF-8 encodings instead of escape them.

@grigorescu do you have a more specific example/context in which you're using to_json other than just print in case there's actually more going on or other places to change ?

@jsiwek jsiwek added this to Unassigned / Todo in Release 3.0.1 via automation Sep 25, 2019
@jsiwek jsiwek added this to the 3.0.1 milestone Sep 25, 2019
@sethhall

This comment has been minimized.

Copy link
Member

commented Sep 25, 2019

Modifying print to support utf-8 is totally the right answer! Most terminals these days can handle UTF-8 and as we move toward supporting it, it's a little more surprising to have print not handle it correctly than anything.

@grigorescu

This comment has been minimized.

Copy link
Member Author

commented Sep 25, 2019

@grigorescu do you have a more specific example/context in which you're using to_json other than just print in case there's actually more going on or other places to change ?

I'm using to_json in combination with ActiveHTTP; the print was just for debugging. I believe that ActiveHTTP works as intended.

I think UTF-8 makes sense.

@timwoj timwoj self-assigned this Sep 25, 2019
@timwoj

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2019

So just so I'm clear, this is an issue with print not outputting utf-8 correctly? to_json is fine and does the right thing?

@jsiwek

This comment has been minimized.

Copy link
Member

commented Sep 26, 2019

@timwoj yes, looked like to_json() returned proper sequence of bytes, but print adds escapes where really it doesn't have to

@timwoj timwoj changed the title to_json formatting bug introduced in 3.0 print() improperly formats UTF-8 output Sep 26, 2019
@timwoj

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2019

I changed the title of this GHI to match the actual issue.

@jsiwek jsiwek moved this from Unassigned / Todo to Assigned / In Progress in Release 3.0.1 Sep 27, 2019
@jsiwek jsiwek referenced this issue Sep 27, 2019
13 of 14 tasks complete
@timwoj

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2019

Should control characters continue to be escaped? I'm thinking in the case of a few of our tests such as https://github.com/zeek/zeek/blob/master/testing/btest/Baseline/bifs.fmt/out (has newlines and tabs being escaped), https://github.com/zeek/zeek/blob/master/testing/btest/Baseline/bifs.hexdump/out (has a newline being escaped), and a number of others with similar issues.

@timwoj

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2019

The "number of others" above are mostly places where we print out something like a signature. Printing UTF-8 characters in those cases seems like a step backwards. We're already printing out simple ASCII characters in those cases, but printing UTF-8 actually makes it less readable.

timwoj added a commit that referenced this issue Sep 27, 2019
@jsiwek

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

Should control characters continue to be escaped?

I initially want to say "no" to be more similar to other languages: just output the bytes and I'll explicitly escape things when I want. But generally hard to say what all the expectations/usages of print there are out there and what we'll break by changing it.

In any case, I'm starting to think any change to the default print behavior (even just passing through UTF-8) may best target 3.1 rather than 3.0.1 on grounds it's potentially a breaking change (e.g. external plugin btests failing between 3.0.0 and 3.0.1 because of a change in print behavior violates stability expectations).

Alternative idea to changing default print behavior could be adding in new print modes, configurable either by global option or explicit syntax like print/<mode>.

Interested to hear other opinions/approaches.

@timwoj

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2019

I agree with this not being a 3.0.1 issue. It's not breaking anything for print to be over-escaping the UTF-8 characters, and there's clearly some other factors in question.

@timwoj

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2019

I think the primary problem is that the change modifies BroString::Render which is called from lots of other places than just print (via ODesc::Add). How about adding another flag to this block https://github.com/zeek/zeek/blob/master/src/BroString.h#L75-L84 that controls whether UTF-8 gets escaped, using that as the default, and then overriding it for print? ODesc already has a flag to enable utf8 support which is set to false by default, but it doesn't affect the call to BroString::Render.

@jsiwek jsiwek added this to Unassigned / Todo in Release 3.1.0 via automation Sep 27, 2019
@jsiwek jsiwek removed this from Assigned / In Progress in Release 3.0.1 Sep 27, 2019
@jsiwek jsiwek modified the milestones: 3.0.1, 3.1.0 Sep 27, 2019
@jsiwek jsiwek moved this from Unassigned / Todo to Assigned / In Progress in Release 3.1.0 Sep 27, 2019
@jsiwek

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

How about adding another flag to this block https://github.com/zeek/zeek/blob/master/src/BroString.h#L75-L84 that controls whether UTF-8 gets escaped, using that as the default, and then overriding it for print?

Yeah, either that or maybe just having ODesc do its own formatting logic with underlying data instead of even calling BroString::Render. No strong opinion/intuition though.

@sethhall

This comment has been minimized.

Copy link
Member

commented Oct 1, 2019

I think the argument to target 3.1 with this change makes sense.

jsiwek added a commit that referenced this issue Oct 2, 2019
This is a convenience function to make it easier to print literal byte
sequences to stdout without additional escaping like what may be added
by the default `print` statement behavior.

For example, related to GH-596, `print` currently escapes even valid
UTF-8 byte sequences and makes it difficult to output valid JSON strings
containing such.
jsiwek added a commit that referenced this issue Oct 14, 2019
This is a convenience function to make it easier to print literal byte
sequences to stdout without additional escaping like what may be added
by the default `print` statement behavior.

For example, related to GH-596, `print` currently escapes even valid
UTF-8 byte sequences and makes it difficult to output valid JSON strings
containing such.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Release 3.1.0
  
Assigned / In Progress
4 participants
You can’t perform that action at this time.