Difference between CLP and Parquet? #117

emschwartz · 2023-05-02T16:31:33Z

(Sorry if this isn't the right place for this question or if the question doesn't really make sense.) I read about CLP in the Uber blog post and I'm curious how CLP differs from storing logs in Parquet files. Is there a write up of comparisons anywhere?

Thank you!

kirkrodrigues · 2023-05-16T11:21:27Z

Hi @emschwartz,

So sorry for the delayed response.

We don't currently have a detailed write up, but perhaps I can provide some info here.

For some frame of reference, to store (unstructured) logs in Parquet files, the simplest approach would be to store timestamps (as an offset from the UNIX epoch), message content, and other fields (e.g., log-level) in separate columns. This should give you some compression as a result of Parquet's built-in encodings. For instance, you could use delta-encoding for the timestamps and dictionary-encoding for the other fields.

Similarities to CLP:

Both Parquet and CLP store logs in a columnar format where the columns are split into row-groups (such that you don't end up with extremely large columns).
Both apply specialized encodings per data-type to improve compression ratio.
On a similar note, both can use dictionaries for data deduplication.

Differences with CLP:

With Parquet's dictionary-encoding, the message content likely wouldn't compress very well (at least not as well as CLP) since the content contains variable values. In contrast, CLP's parsing separates the variable values from the rest of the message which benefits compression since each column is less random. That said, if you use CLP's parsing, you could store the parsed log messages in Parquet.
Parquet's dictionary is limited to a certain size in each column chunk, so the level of deduplication is more localized compared to CLP where we maintain separate dictionary files per archive.
Parquet's on-disk layout seems optimized for sequential writes but not necessarily sequential reads. For instance, the metadata is stored at the end of the file, so using it requires seeking to the end of the file, then seeking backwards to read the metadata and eventually the desired columns. In contrast, CLP's layout is optimized for both sequential writes and reads (so that we can take advantage of low-cost storage like hard drives and block storage). First, CLP stores its metadata in files separate from the actual data. In addition, CLP avoids seeks when writing and reading the data. Although, I must admit I'm not too familiar with the latest innovations in the Parquet space; I would imagine they have some solutions for this problem.

Overall, I would say Parquet and CLP have similar storage formats (at least at the level of storing log events), which is part of what motivated using it at Uber. That said, we anticipate some potential compression and performance bottlenecks (including for search) with using Parquet since we wouldn't as much flexibility to choose the precise layout of the data.

emschwartz · 2023-05-16T11:24:34Z

Thanks @kirkrodrigues, that's exactly what I was looking for!

kirkrodrigues closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between CLP and Parquet? #117

Difference between CLP and Parquet? #117

emschwartz commented May 2, 2023

kirkrodrigues commented May 16, 2023

emschwartz commented May 16, 2023

Difference between CLP and Parquet? #117

Difference between CLP and Parquet? #117

Comments

emschwartz commented May 2, 2023

kirkrodrigues commented May 16, 2023

emschwartz commented May 16, 2023