-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider moving to a binary format #30
Comments
Talking about this with @nibanks, he would primarily like this for larger traces (he has logs of several 100s of megabytes) and for integration with other tools (like https://docs.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer). He suggests https://diamon.org/ctf/ as one possible format (though, at first glance, this doesn't have a JavaScript parser somewhere). |
There is related experience with DNS log formats. In particular, look at the CBOR encoding of DNS logs proposed in RFC 8618, https://datatracker.ietf.org/doc/rfc8618/. They started from PCAP, but there was a practical issue with managing huge PCAP files. The first attempt was to just try compress the binary, but they ended up with a more structured approach. The logical syntax follows the "natural" repetitions in the data, managing to get for example DNS names encoded just once, and then represented by indices in the tables of names. Then they encode the "syntactically organized" data in CBOR (binary JSON), and they apply compression on top of that. The main value of the logical syntax comes when processing logs. For example, I observed a factor 50 performance gain between doing DNS statistics directly on the PCAP and doing the same statistics on the logical CBOR data, due to both reduced IO with shorter data, and more compact code following logical references. I suspect there is something similar hiding in the Quic traces. |
@huitema that's some very interesting stuff that I wasn't aware of yet, thanks! |
Talking about it some more with @nibanks, he states:
@LPardue did some initial tests with CBOR and found the file size gains not to really outweigh compressed JSON. I am currently experimenting with a few binary scheme options to get a first feel for potential file size and (de)serialization gains. That should give us some additional data to work from. |
To be clear I am no CBOR expert. All I did for my serializing code was substitute out serde_json for serde_cbor and compare the resulting output. CBOR shaved off 10% of identity encoding, gzipped-json shaved off about 40%. AFAIK It is possible to profile CBOR to be more efficient (e.g. https://tools.ietf.org/html/draft-raza-ace-cbor-certificates-04) but that is beyond my skillset. |
I am quite familiar with the work on using CBOR to record DNS traces in RFC 8618. The captures were originally in PCAP, but PCAP gets very large files. They looked at a set of variations:
You can see that there are some differences between various algorithms. JSON clearly gets bigger sizes there than the binary alternatives, even after compression. But the biggest differences come from switching from what they call "simple" to what they call "block". The simple alternative is pretty similar to the current Qlog. Each DNS transaction is represented by a corresponding record in JSON, CBOR, Avro or protobuf. In contrast, the "block" format starts by building tables of objects seen in multiple records: table of DNS names, table to record values, etc. Then the individual PCAP records are represented by "block records" which instead of listing DNS names simply list the index of the name in the table of names. You can think of that as a "logical compression", and it does reduces the size of the recording by a factor 10x. After that, they can still apply compression. The real beauty of the block format comes when processing the data in back end programs. Compare:
To:
In the cbor alternative, there are about 10 times fewer data piped into the analysis program than in the pcap alternative. That's a much lower IO load. On top of that, since the cbor data is structured in blocks, parsing and processing is much easier, resulting in a much lower CPU load. In a project that I was involved with, replacing process-pcap by process-cbor made us run 40 times faster! Also note that there are no practical differences between the various binary alternatives. yes, +- 10% here or there, but compared to a factor 40 that's really in the noise. |
Thanks a lot for that @huitema. Doing something similar to the "block format" would be trivial for qlog as well. However, it mismatches with how I thought general purpose compression works in my head... don't those algorithms also build that type of lookup-table on the fly? I will test with manual block formats as well and see what that gives. Another interesting ref from @martinthomson https://tools.ietf.org/html/draft-mattsson-tls-cbor-cert-compress-00 |
So I've been doing some tests of my own to figure out the best approach to this for qlog. I've created converter scripts (see https://github.com/quiclog/pcap2qlog/tree/binary/src/converters) that use a lookup table/dictionary instead of repeating values, one that cbor encodes the files and a (rudimentary) protobuf schema. The dictionary is currently fully dynamic and stored inside the resulting file, but this can obviously be improved by having a static shared dictionary with a dynamic part for just the field values (much like QPACK and Chrome's NetLog). I've then also looked at various compression schemes (https://github.com/quiclog/pcap2qlog/blob/binary/src/scripts/comparisons/compare.sh) (xz, gzip, brotli, zstd, lz4), focusing mainly on the schemes most often seen on the web for on-the-fly compression (gzip 6 and brotli 4). Full results can be found at https://gist.github.com/rmarx/49bb14f83157d9fe59fb40e7c05b1f3f, a bit nicer representation in the following image (sizes for traces in which a 500MB or 100MB file were downloaded from the lsquic public endpoint). The blue value is the reference point for the percentages, green is the "best in class" for that row: Main takeaways for me:
Next to these tests, we also ran a survey among QUIC experts (implementers and researchers), on which we got replies from 28 participants (thanks everyone!). Part of the survey was to ask how important they felt features like "fast (de)serialization, small file size, flexibility (e.g., easily add new event types), grep-ability" were. The full results will be posted soon (are part of a publication we're preparing), but the gist of it is: My interpretation:
Finally, we also talked to Facebook (cc @mjoras), who have been deploying qlog at scale, logging over 30 billion qlog events per day. Compared to their earlier binary format, qlog is about 2-3x larger and takes 50% longer to serialize. Yet, this is quite manageable on the server side, where they log full-string JSON events to a centralized service. On the client, they do find the file-size to be prohibitive to upload granular full qlogs (containing all events they'd like). Yet, Matt was also adamant that they'd rather keep the flexibility of the JSON format than move to a more inflexible binary one. They were considering utilizing compression and writing a custom JSON (de)serializer, optimized for qlog, to help deal with some of the overhead. So, presented with those results, my standpoint today is still to keep using JSON as the basis for qlog. I would propose to add the "dictionary" setup to the spec though, as an optional optimized mode and also recommend tools to support that (not sure about a default static dictionary at this point though). Furthermore, I'd recommend using cbor if file size is important. Companies that do need more optimizations can write their own protobuf (or equivalent) schema (which I've shown is possible) and then write a post-processor to go to proper JSON qlog for shared tooling. Still, feedback on all this is more than welcome of course! @marten-seemann, @martinthomson, @huitema, @LPardue, @nibanks, @mjoras |
If we use cbor, does that mean that we can get rid of the |
- Removes the event_fields optimization. Relates to #30. Fixes #89. - Removes several points of complexity wrt the group_id field, as they were not being used in practice. - Makes JSON the default serialization option. Fixes #101. - Adds a proper streaming option with NDJSON. Relates to #106. Fixes #109, #2. - Generally tightens the text and adds more examples.
- Closes #30. - Also added some placeholders for privacy/security section
With the latest commit linked above (eb59e69), I feel this issue has been resolved. qlog has not moved to a binary format by default, but is now much easier to serialize as one/to define a binary schema for. Some of the reasoning behind that has also been included in the qlog document. |
I just started working on implementing qlog in quic-go. Maybe it's because I'm still fairly unfamiliar with qlog, but I feel like encoding things in JSON leads to some awkward hacks. Examples of these are:
stream_side
("sending" or "receiving") andstream_type
("unidirectional" or "bidirectional"), which are both string fields.I'm not sure if I like trick to save bytes on the
events
by first defining theevent_fields
and then using a list instead of an object to encode theevents
. To me, this feels more like a hack to work around the shortcomings of JSON, namely the repetition of the field labels when using objects.As far as I can see, a binary encoding scheme would be able to provide a type-safe representation here without repeating the field labels (and blowing up the file size), as long as it's possible to define some
common_fields
for a connection.A protobuf-based logging format (This is just a suggestion. Protobufs are the thing I'm most familiar with, maybe there are better choices out there.) would resolve the encoding ambiguities I listed above, because we'd be able to make use of a strong typing system, which would allow us to completely eliminate the use of
string
s (except for places where things actually are strings, e.g. CONNECTION_CLOSE reason phrases). Furthermore, it would greatly simplify implementing qlog: Just fill in the corresponding fields in the Protobuf messages, callMarshal()
, and you're done. No need to manually define dozens of logging structs and make sure they're correctly serialized into qlog's flavor of JSON.The text was updated successfully, but these errors were encountered: