Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider moving to a binary format #30

Closed
marten-seemann opened this issue Nov 19, 2019 · 10 comments
Closed

consider moving to a binary format #30

marten-seemann opened this issue Nov 19, 2019 · 10 comments

Comments

@marten-seemann
Copy link
Collaborator

I just started working on implementing qlog in quic-go. Maybe it's because I'm still fairly unfamiliar with qlog, but I feel like encoding things in JSON leads to some awkward hacks. Examples of these are:

  • A lot of numbers are encoded as strings, e.g. stream offset or packet numbers. I assume this is because JSON doesn't properly handle uint64s (or does it?).
  • IP addresses are encoded as strings. If that means they're supposed to be encoded in the human-readable encoding (with . and :), that's ambiguous for IPv6 addresses. Really, IP addresses should be a byte array.
  • (Raw) packet data is supposed to be hex encoded, which greatly increases the log size.
  • Some fields are defined as enums, whereas other fields that just have a few options are encoded as strings. Examples are the stream_side ("sending" or "receiving") and stream_type ("unidirectional" or "bidirectional"), which are both string fields.

I'm not sure if I like trick to save bytes on the events by first defining the event_fields and then using a list instead of an object to encode the events. To me, this feels more like a hack to work around the shortcomings of JSON, namely the repetition of the field labels when using objects.
As far as I can see, a binary encoding scheme would be able to provide a type-safe representation here without repeating the field labels (and blowing up the file size), as long as it's possible to define some common_fields for a connection.

A protobuf-based logging format (This is just a suggestion. Protobufs are the thing I'm most familiar with, maybe there are better choices out there.) would resolve the encoding ambiguities I listed above, because we'd be able to make use of a strong typing system, which would allow us to completely eliminate the use of strings (except for places where things actually are strings, e.g. CONNECTION_CLOSE reason phrases). Furthermore, it would greatly simplify implementing qlog: Just fill in the corresponding fields in the Protobuf messages, call Marshal(), and you're done. No need to manually define dozens of logging structs and make sure they're correctly serialized into qlog's flavor of JSON.

@rmarx
Copy link
Contributor

rmarx commented Jan 7, 2020

Talking about this with @nibanks, he would primarily like this for larger traces (he has logs of several 100s of megabytes) and for integration with other tools (like https://docs.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer).

He suggests https://diamon.org/ctf/ as one possible format (though, at first glance, this doesn't have a JavaScript parser somewhere).

@huitema
Copy link

huitema commented Feb 26, 2020

There is related experience with DNS log formats. In particular, look at the CBOR encoding of DNS logs proposed in RFC 8618, https://datatracker.ietf.org/doc/rfc8618/. They started from PCAP, but there was a practical issue with managing huge PCAP files. The first attempt was to just try compress the binary, but they ended up with a more structured approach. The logical syntax follows the "natural" repetitions in the data, managing to get for example DNS names encoded just once, and then represented by indices in the tables of names. Then they encode the "syntactically organized" data in CBOR (binary JSON), and they apply compression on top of that.

The main value of the logical syntax comes when processing logs. For example, I observed a factor 50 performance gain between doing DNS statistics directly on the PCAP and doing the same statistics on the logical CBOR data, due to both reduced IO with shorter data, and more compact code following logical references.

I suspect there is something similar hiding in the Quic traces.

@rmarx
Copy link
Contributor

rmarx commented Feb 26, 2020

@huitema that's some very interesting stuff that I wasn't aware of yet, thanks!

@rmarx
Copy link
Contributor

rmarx commented Mar 17, 2020

Talking about it some more with @nibanks, he states:

I'd prefer something that is light-weight and doesn't depend on yet another thing (protobuf). Or something that exists and is light weight to implement a parser for from scratch

@LPardue did some initial tests with CBOR and found the file size gains not to really outweigh compressed JSON.

I am currently experimenting with a few binary scheme options to get a first feel for potential file size and (de)serialization gains. That should give us some additional data to work from.

@LPardue
Copy link
Member

LPardue commented Mar 17, 2020

To be clear I am no CBOR expert. All I did for my serializing code was substitute out serde_json for serde_cbor and compare the resulting output. CBOR shaved off 10% of identity encoding, gzipped-json shaved off about 40%.

AFAIK It is possible to profile CBOR to be more efficient (e.g. https://tools.ietf.org/html/draft-raza-ace-cbor-certificates-04) but that is beyond my skillset.

@huitema
Copy link

huitema commented Mar 18, 2020

I am quite familiar with the work on using CBOR to record DNS traces in RFC 8618. The captures were originally in PCAP, but PCAP gets very large files. They looked at a set of variations:

Format File size Comp. Comp. size RSS User time
PCAP 661.87 snzip 212.48 2696 1.26
lz4 181.58 6336 1.35
gzip 153.46 1428 18.20
zstd 87.07 3544 4.27
xz 49.09 97416 160.79
JSON simple 4113.92 snzip 603.78 2656 5.72
lz4 386.42 5636 5.25
gzip 271.11 1492 73.00
zstd 133.43 3284 8.68
xz 51.98 97412 600.74
Avro simple 640.45 snzip 148.98 2656 0.90
lz4 111.92 5828 0.99
gzip 103.07 1540 11.52
zstd 49.08 3524 2.50
xz 22.87 97308 90.34
CBOR simple 764.82 snzip 164.57 2664 1.11
lz4 120.98 5892 1.13
gzip 110.61 1428 12.88
zstd 54.14 3224 2.77
xz 23.43 97276 111.48
PBuf simple 749.51 snzip 167.16 2660 1.08
lz4 123.09 5824 1.14
gzip 112.05 1424 12.75
zstd 53.39 3388 2.76
xz 23.99 97348 106.47
JSON block 519.77 snzip 106.12 2812 0.93
lz4 104.34 6080 0.97
gzip 57.97 1604 12.70
zstd 61.51 3396 3.45
xz 27.67 97524 169.10
Avro block 60.45 snzip 48.38 2688 0.20
lz4 48.78 8540 0.22
gzip 39.62 1576 2.92
zstd 29.63 3612 1.25
xz 18.28 97564 25.81
CBOR block 75.25 snzip 53.27 2684 0.24
lz4 51.88 8008 0.28
gzip 41.17 1548 4.36
zstd 30.61 3476 1.48
xz 18.15 97556 38.78
PBuf block 67.98 snzip 51.10 2636 0.24
lz4 52.39 8304 0.24
gzip 40.19 1520 3.63
zstd 31.61 3576 1.40
xz 17.94 97440 33.99

You can see that there are some differences between various algorithms. JSON clearly gets bigger sizes there than the binary alternatives, even after compression. But the biggest differences come from switching from what they call "simple" to what they call "block".

The simple alternative is pretty similar to the current Qlog. Each DNS transaction is represented by a corresponding record in JSON, CBOR, Avro or protobuf. In contrast, the "block" format starts by building tables of objects seen in multiple records: table of DNS names, table to record values, etc. Then the individual PCAP records are represented by "block records" which instead of listing DNS names simply list the index of the name in the table of names. You can think of that as a "logical compression", and it does reduces the size of the recording by a factor 10x. After that, they can still apply compression.

The real beauty of the block format comes when processing the data in back end programs. Compare:

uncompress < pcap.xz | process-pcap

To:

uncompress < cbor.xz | process-cbor

In the cbor alternative, there are about 10 times fewer data piped into the analysis program than in the pcap alternative. That's a much lower IO load. On top of that, since the cbor data is structured in blocks, parsing and processing is much easier, resulting in a much lower CPU load. In a project that I was involved with, replacing process-pcap by process-cbor made us run 40 times faster!

Also note that there are no practical differences between the various binary alternatives. yes, +- 10% here or there, but compared to a factor 40 that's really in the noise.

@rmarx
Copy link
Contributor

rmarx commented Mar 18, 2020

Thanks a lot for that @huitema. Doing something similar to the "block format" would be trivial for qlog as well. However, it mismatches with how I thought general purpose compression works in my head... don't those algorithms also build that type of lookup-table on the fly? I will test with manual block formats as well and see what that gives.

Another interesting ref from @martinthomson https://tools.ietf.org/html/draft-mattsson-tls-cbor-cert-compress-00

@rmarx
Copy link
Contributor

rmarx commented Apr 22, 2020

So I've been doing some tests of my own to figure out the best approach to this for qlog.

I've created converter scripts (see https://github.com/quiclog/pcap2qlog/tree/binary/src/converters) that use a lookup table/dictionary instead of repeating values, one that cbor encodes the files and a (rudimentary) protobuf schema. The dictionary is currently fully dynamic and stored inside the resulting file, but this can obviously be improved by having a static shared dictionary with a dynamic part for just the field values (much like QPACK and Chrome's NetLog).

I've then also looked at various compression schemes (https://github.com/quiclog/pcap2qlog/blob/binary/src/scripts/comparisons/compare.sh) (xz, gzip, brotli, zstd, lz4), focusing mainly on the schemes most often seen on the web for on-the-fly compression (gzip 6 and brotli 4).

Full results can be found at https://gist.github.com/rmarx/49bb14f83157d9fe59fb40e7c05b1f3f, a bit nicer representation in the following image (sizes for traces in which a 500MB or 100MB file were downloaded from the lsquic public endpoint). The blue value is the reference point for the percentages, green is the "best in class" for that row:

2020-04-22 11_27_11-results xlsx - Excel

Main takeaways for me:

  1. protobuf is the smallest, but not by a huge margin compared to dictionary+cbor, especially not when compression is used.
  2. compression alone saves massively, even on the original JSON file or direct CBOR version of that
  3. protobuf without compression is still quite large (23% of original), so I'd reckon you'd always use compression for storage/transmission anyway?

Next to these tests, we also ran a survey among QUIC experts (implementers and researchers), on which we got replies from 28 participants (thanks everyone!). Part of the survey was to ask how important they felt features like "fast (de)serialization, small file size, flexibility (e.g., easily add new event types), grep-ability" were. The full results will be posted soon (are part of a publication we're preparing), but the gist of it is:

2020-04-22 11_37_40-QUIC and HTTP_3 Debugging Survey - March 2020 - Google Forms

My interpretation:

  1. Flexibility is the major selling point across the board. I personally believe moving to something like protobuf sort of robs us from that (much more difficult to add new event types as you have to update the schema)
  2. Most don't care too much about (de)serialization performance
  3. Small file size was regarded as important, but again not as much as flexibility.
  4. grep-ability was also considered an important feature by many
  5. easy integration is also a major point and this would be easier with something like protobuf

Finally, we also talked to Facebook (cc @mjoras), who have been deploying qlog at scale, logging over 30 billion qlog events per day. Compared to their earlier binary format, qlog is about 2-3x larger and takes 50% longer to serialize. Yet, this is quite manageable on the server side, where they log full-string JSON events to a centralized service. On the client, they do find the file-size to be prohibitive to upload granular full qlogs (containing all events they'd like). Yet, Matt was also adamant that they'd rather keep the flexibility of the JSON format than move to a more inflexible binary one. They were considering utilizing compression and writing a custom JSON (de)serializer, optimized for qlog, to help deal with some of the overhead.


So, presented with those results, my standpoint today is still to keep using JSON as the basis for qlog. I would propose to add the "dictionary" setup to the spec though, as an optional optimized mode and also recommend tools to support that (not sure about a default static dictionary at this point though). Furthermore, I'd recommend using cbor if file size is important.

Companies that do need more optimizations can write their own protobuf (or equivalent) schema (which I've shown is possible) and then write a post-processor to go to proper JSON qlog for shared tooling.

Still, feedback on all this is more than welcome of course! @marten-seemann, @martinthomson, @huitema, @LPardue, @nibanks, @mjoras

@marten-seemann
Copy link
Collaborator Author

If we use cbor, does that mean that we can get rid of the event_fields? Having implemented a both an encoder as well as a parsing tool, this complicated things quite a bit for me (over just encoding an event as a normal JSON object).

rmarx pushed a commit that referenced this issue Jul 11, 2020
- Updated examples

- Started to decouple JSON from the main schema and describe mapping

- Removed option to specify time_units (fixed #95)

- Overall (partially) relates to: #10, #36, #39, #89, #102, #30
rmarx pushed a commit that referenced this issue Aug 19, 2020
8594230.

- Partially relates to #10, #36, #39, #89, #102, #30

- TODO: not at all sure about the QPACK data types, need outside input

- Last commit before switching to separate draft02 branch
rmarx pushed a commit that referenced this issue Aug 31, 2020
- Removes the event_fields optimization. Relates to #30. Fixes #89.

- Removes several points of complexity wrt the group_id field, as they were not being used in practice.

- Makes JSON the default serialization option. Fixes #101.

- Adds a proper streaming option with NDJSON. Relates to #106. Fixes #109, #2.

- Generally tightens the text and adds more examples.
rmarx pushed a commit that referenced this issue Sep 5, 2020
- Closes #30.

- Also added some placeholders for privacy/security section
@rmarx
Copy link
Contributor

rmarx commented Sep 5, 2020

With the latest commit linked above (eb59e69), I feel this issue has been resolved.

qlog has not moved to a binary format by default, but is now much easier to serialize as one/to define a binary schema for. Some of the reasoning behind that has also been included in the qlog document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants