consider moving to a binary format #30

marten-seemann · 2019-11-19T07:30:57Z

I just started working on implementing qlog in quic-go. Maybe it's because I'm still fairly unfamiliar with qlog, but I feel like encoding things in JSON leads to some awkward hacks. Examples of these are:

A lot of numbers are encoded as strings, e.g. stream offset or packet numbers. I assume this is because JSON doesn't properly handle uint64s (or does it?).
IP addresses are encoded as strings. If that means they're supposed to be encoded in the human-readable encoding (with . and :), that's ambiguous for IPv6 addresses. Really, IP addresses should be a byte array.
(Raw) packet data is supposed to be hex encoded, which greatly increases the log size.
Some fields are defined as enums, whereas other fields that just have a few options are encoded as strings. Examples are the stream_side ("sending" or "receiving") and stream_type ("unidirectional" or "bidirectional"), which are both string fields.

I'm not sure if I like trick to save bytes on the events by first defining the event_fields and then using a list instead of an object to encode the events. To me, this feels more like a hack to work around the shortcomings of JSON, namely the repetition of the field labels when using objects.
As far as I can see, a binary encoding scheme would be able to provide a type-safe representation here without repeating the field labels (and blowing up the file size), as long as it's possible to define some common_fields for a connection.

A protobuf-based logging format (This is just a suggestion. Protobufs are the thing I'm most familiar with, maybe there are better choices out there.) would resolve the encoding ambiguities I listed above, because we'd be able to make use of a strong typing system, which would allow us to completely eliminate the use of strings (except for places where things actually are strings, e.g. CONNECTION_CLOSE reason phrases). Furthermore, it would greatly simplify implementing qlog: Just fill in the corresponding fields in the Protobuf messages, call Marshal(), and you're done. No need to manually define dozens of logging structs and make sure they're correctly serialized into qlog's flavor of JSON.

The text was updated successfully, but these errors were encountered:

rmarx · 2020-01-07T16:05:08Z

Talking about this with @nibanks, he would primarily like this for larger traces (he has logs of several 100s of megabytes) and for integration with other tools (like https://docs.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer).

He suggests https://diamon.org/ctf/ as one possible format (though, at first glance, this doesn't have a JavaScript parser somewhere).

huitema · 2020-02-26T01:05:34Z

There is related experience with DNS log formats. In particular, look at the CBOR encoding of DNS logs proposed in RFC 8618, https://datatracker.ietf.org/doc/rfc8618/. They started from PCAP, but there was a practical issue with managing huge PCAP files. The first attempt was to just try compress the binary, but they ended up with a more structured approach. The logical syntax follows the "natural" repetitions in the data, managing to get for example DNS names encoded just once, and then represented by indices in the tables of names. Then they encode the "syntactically organized" data in CBOR (binary JSON), and they apply compression on top of that.

The main value of the logical syntax comes when processing logs. For example, I observed a factor 50 performance gain between doing DNS statistics directly on the PCAP and doing the same statistics on the logical CBOR data, due to both reduced IO with shorter data, and more compact code following logical references.

I suspect there is something similar hiding in the Quic traces.

rmarx · 2020-02-26T09:51:43Z

@huitema that's some very interesting stuff that I wasn't aware of yet, thanks!

rmarx · 2020-03-17T15:24:30Z

Talking about it some more with @nibanks, he states:

I'd prefer something that is light-weight and doesn't depend on yet another thing (protobuf). Or something that exists and is light weight to implement a parser for from scratch

@LPardue did some initial tests with CBOR and found the file size gains not to really outweigh compressed JSON.

I am currently experimenting with a few binary scheme options to get a first feel for potential file size and (de)serialization gains. That should give us some additional data to work from.

LPardue · 2020-03-17T23:06:52Z

To be clear I am no CBOR expert. All I did for my serializing code was substitute out serde_json for serde_cbor and compare the resulting output. CBOR shaved off 10% of identity encoding, gzipped-json shaved off about 40%.

AFAIK It is possible to profile CBOR to be more efficient (e.g. https://tools.ietf.org/html/draft-raza-ace-cbor-certificates-04) but that is beyond my skillset.

huitema · 2020-03-18T00:58:46Z

I am quite familiar with the work on using CBOR to record DNS traces in RFC 8618. The captures were originally in PCAP, but PCAP gets very large files. They looked at a set of variations:

Format	File size	Comp.	Comp. size	RSS	User time
PCAP	661.87	snzip	212.48	2696	1.26
		lz4	181.58	6336	1.35
		gzip	153.46	1428	18.20
		zstd	87.07	3544	4.27
		xz	49.09	97416	160.79

JSON simple	4113.92	snzip	603.78	2656	5.72
		lz4	386.42	5636	5.25
		gzip	271.11	1492	73.00
		zstd	133.43	3284	8.68
		xz	51.98	97412	600.74

Avro simple	640.45	snzip	148.98	2656	0.90
		lz4	111.92	5828	0.99
		gzip	103.07	1540	11.52
		zstd	49.08	3524	2.50
		xz	22.87	97308	90.34

CBOR simple	764.82	snzip	164.57	2664	1.11
		lz4	120.98	5892	1.13
		gzip	110.61	1428	12.88
		zstd	54.14	3224	2.77
		xz	23.43	97276	111.48

PBuf simple	749.51	snzip	167.16	2660	1.08
		lz4	123.09	5824	1.14
		gzip	112.05	1424	12.75
		zstd	53.39	3388	2.76
		xz	23.99	97348	106.47

JSON block	519.77	snzip	106.12	2812	0.93
		lz4	104.34	6080	0.97
		gzip	57.97	1604	12.70
		zstd	61.51	3396	3.45
		xz	27.67	97524	169.10

Avro block	60.45	snzip	48.38	2688	0.20
		lz4	48.78	8540	0.22
		gzip	39.62	1576	2.92
		zstd	29.63	3612	1.25
		xz	18.28	97564	25.81

CBOR block	75.25	snzip	53.27	2684	0.24
		lz4	51.88	8008	0.28
		gzip	41.17	1548	4.36
		zstd	30.61	3476	1.48
		xz	18.15	97556	38.78

PBuf block	67.98	snzip	51.10	2636	0.24
		lz4	52.39	8304	0.24
		gzip	40.19	1520	3.63
		zstd	31.61	3576	1.40
		xz	17.94	97440	33.99

You can see that there are some differences between various algorithms. JSON clearly gets bigger sizes there than the binary alternatives, even after compression. But the biggest differences come from switching from what they call "simple" to what they call "block".

The simple alternative is pretty similar to the current Qlog. Each DNS transaction is represented by a corresponding record in JSON, CBOR, Avro or protobuf. In contrast, the "block" format starts by building tables of objects seen in multiple records: table of DNS names, table to record values, etc. Then the individual PCAP records are represented by "block records" which instead of listing DNS names simply list the index of the name in the table of names. You can think of that as a "logical compression", and it does reduces the size of the recording by a factor 10x. After that, they can still apply compression.

The real beauty of the block format comes when processing the data in back end programs. Compare:

uncompress < pcap.xz | process-pcap

To:

uncompress < cbor.xz | process-cbor

In the cbor alternative, there are about 10 times fewer data piped into the analysis program than in the pcap alternative. That's a much lower IO load. On top of that, since the cbor data is structured in blocks, parsing and processing is much easier, resulting in a much lower CPU load. In a project that I was involved with, replacing process-pcap by process-cbor made us run 40 times faster!

Also note that there are no practical differences between the various binary alternatives. yes, +- 10% here or there, but compared to a factor 40 that's really in the noise.

rmarx · 2020-03-18T11:10:58Z

Thanks a lot for that @huitema. Doing something similar to the "block format" would be trivial for qlog as well. However, it mismatches with how I thought general purpose compression works in my head... don't those algorithms also build that type of lookup-table on the fly? I will test with manual block formats as well and see what that gives.

Another interesting ref from @martinthomson https://tools.ietf.org/html/draft-mattsson-tls-cbor-cert-compress-00

rmarx · 2020-04-22T09:50:22Z

So I've been doing some tests of my own to figure out the best approach to this for qlog.

I've created converter scripts (see https://github.com/quiclog/pcap2qlog/tree/binary/src/converters) that use a lookup table/dictionary instead of repeating values, one that cbor encodes the files and a (rudimentary) protobuf schema. The dictionary is currently fully dynamic and stored inside the resulting file, but this can obviously be improved by having a static shared dictionary with a dynamic part for just the field values (much like QPACK and Chrome's NetLog).

I've then also looked at various compression schemes (https://github.com/quiclog/pcap2qlog/blob/binary/src/scripts/comparisons/compare.sh) (xz, gzip, brotli, zstd, lz4), focusing mainly on the schemes most often seen on the web for on-the-fly compression (gzip 6 and brotli 4).

Full results can be found at https://gist.github.com/rmarx/49bb14f83157d9fe59fb40e7c05b1f3f, a bit nicer representation in the following image (sizes for traces in which a 500MB or 100MB file were downloaded from the lsquic public endpoint). The blue value is the reference point for the percentages, green is the "best in class" for that row:

Main takeaways for me:

protobuf is the smallest, but not by a huge margin compared to dictionary+cbor, especially not when compression is used.
compression alone saves massively, even on the original JSON file or direct CBOR version of that
protobuf without compression is still quite large (23% of original), so I'd reckon you'd always use compression for storage/transmission anyway?

Next to these tests, we also ran a survey among QUIC experts (implementers and researchers), on which we got replies from 28 participants (thanks everyone!). Part of the survey was to ask how important they felt features like "fast (de)serialization, small file size, flexibility (e.g., easily add new event types), grep-ability" were. The full results will be posted soon (are part of a publication we're preparing), but the gist of it is:

My interpretation:

Flexibility is the major selling point across the board. I personally believe moving to something like protobuf sort of robs us from that (much more difficult to add new event types as you have to update the schema)
Most don't care too much about (de)serialization performance
Small file size was regarded as important, but again not as much as flexibility.
grep-ability was also considered an important feature by many
easy integration is also a major point and this would be easier with something like protobuf

Finally, we also talked to Facebook (cc @mjoras), who have been deploying qlog at scale, logging over 30 billion qlog events per day. Compared to their earlier binary format, qlog is about 2-3x larger and takes 50% longer to serialize. Yet, this is quite manageable on the server side, where they log full-string JSON events to a centralized service. On the client, they do find the file-size to be prohibitive to upload granular full qlogs (containing all events they'd like). Yet, Matt was also adamant that they'd rather keep the flexibility of the JSON format than move to a more inflexible binary one. They were considering utilizing compression and writing a custom JSON (de)serializer, optimized for qlog, to help deal with some of the overhead.

So, presented with those results, my standpoint today is still to keep using JSON as the basis for qlog. I would propose to add the "dictionary" setup to the spec though, as an optional optimized mode and also recommend tools to support that (not sure about a default static dictionary at this point though). Furthermore, I'd recommend using cbor if file size is important.

Companies that do need more optimizations can write their own protobuf (or equivalent) schema (which I've shown is possible) and then write a post-processor to go to proper JSON qlog for shared tooling.

Still, feedback on all this is more than welcome of course! @marten-seemann, @martinthomson, @huitema, @LPardue, @nibanks, @mjoras

marten-seemann · 2020-04-22T09:56:02Z

If we use cbor, does that mean that we can get rid of the event_fields? Having implemented a both an encoder as well as a parsing tool, this complicated things quite a bit for me (over just encoding an event as a normal JSON object).

- Updated examples - Started to decouple JSON from the main schema and describe mapping - Removed option to specify time_units (fixed #95) - Overall (partially) relates to: #10, #36, #39, #89, #102, #30

8594230. - Partially relates to #10, #36, #39, #89, #102, #30 - TODO: not at all sure about the QPACK data types, need outside input - Last commit before switching to separate draft02 branch

- Removes the event_fields optimization. Relates to #30. Fixes #89. - Removes several points of complexity wrt the group_id field, as they were not being used in practice. - Makes JSON the default serialization option. Fixes #101. - Adds a proper streaming option with NDJSON. Relates to #106. Fixes #109, #2. - Generally tightens the text and adds more examples.

- Closes #30. - Also added some placeholders for privacy/security section

rmarx · 2020-09-05T16:22:44Z

With the latest commit linked above (eb59e69), I feel this issue has been resolved.

qlog has not moved to a binary format by default, but is now much easier to serialize as one/to define a binary schema for. Some of the reasoning behind that has also been included in the qlog document.

rmarx mentioned this issue May 25, 2020

Make traditional JSON the default, provide two optimization options #89

Closed

marten-seemann mentioned this issue Jun 21, 2020

Re-evaluate dependencie quic-go/quic-go#2618

Closed

rmarx pushed a commit that referenced this issue Sep 5, 2020

Add explanation on alternative serialization formats

eb59e69

- Closes #30. - Also added some placeholders for privacy/security section

rmarx closed this as completed Sep 5, 2020

rmarx mentioned this issue Jul 12, 2021

Aggregating stream_frame #163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider moving to a binary format #30

consider moving to a binary format #30

marten-seemann commented Nov 19, 2019

rmarx commented Jan 7, 2020

huitema commented Feb 26, 2020

rmarx commented Feb 26, 2020

rmarx commented Mar 17, 2020

LPardue commented Mar 17, 2020

huitema commented Mar 18, 2020

rmarx commented Mar 18, 2020

rmarx commented Apr 22, 2020

marten-seemann commented Apr 22, 2020

rmarx commented Sep 5, 2020

consider moving to a binary format #30

consider moving to a binary format #30

Comments

marten-seemann commented Nov 19, 2019

rmarx commented Jan 7, 2020

huitema commented Feb 26, 2020

rmarx commented Feb 26, 2020

rmarx commented Mar 17, 2020

LPardue commented Mar 17, 2020

huitema commented Mar 18, 2020

rmarx commented Mar 18, 2020

rmarx commented Apr 22, 2020

marten-seemann commented Apr 22, 2020

rmarx commented Sep 5, 2020