Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between CLP and Parquet? #117

Closed
emschwartz opened this issue May 2, 2023 · 2 comments
Closed

Difference between CLP and Parquet? #117

emschwartz opened this issue May 2, 2023 · 2 comments

Comments

@emschwartz
Copy link

(Sorry if this isn't the right place for this question or if the question doesn't really make sense.) I read about CLP in the Uber blog post and I'm curious how CLP differs from storing logs in Parquet files. Is there a write up of comparisons anywhere?

Thank you!

@kirkrodrigues
Copy link
Member

Hi @emschwartz,

So sorry for the delayed response.

We don't currently have a detailed write up, but perhaps I can provide some info here.

For some frame of reference, to store (unstructured) logs in Parquet files, the simplest approach would be to store timestamps (as an offset from the UNIX epoch), message content, and other fields (e.g., log-level) in separate columns. This should give you some compression as a result of Parquet's built-in encodings. For instance, you could use delta-encoding for the timestamps and dictionary-encoding for the other fields.

Similarities to CLP:

  • Both Parquet and CLP store logs in a columnar format where the columns are split into row-groups (such that you don't end up with extremely large columns).
  • Both apply specialized encodings per data-type to improve compression ratio.
  • On a similar note, both can use dictionaries for data deduplication.

Differences with CLP:

  • With Parquet's dictionary-encoding, the message content likely wouldn't compress very well (at least not as well as CLP) since the content contains variable values. In contrast, CLP's parsing separates the variable values from the rest of the message which benefits compression since each column is less random. That said, if you use CLP's parsing, you could store the parsed log messages in Parquet.

  • Parquet's dictionary is limited to a certain size in each column chunk, so the level of deduplication is more localized compared to CLP where we maintain separate dictionary files per archive.

  • Parquet's on-disk layout seems optimized for sequential writes but not necessarily sequential reads. For instance, the metadata is stored at the end of the file, so using it requires seeking to the end of the file, then seeking backwards to read the metadata and eventually the desired columns. In contrast, CLP's layout is optimized for both sequential writes and reads (so that we can take advantage of low-cost storage like hard drives and block storage). First, CLP stores its metadata in files separate from the actual data. In addition, CLP avoids seeks when writing and reading the data. Although, I must admit I'm not too familiar with the latest innovations in the Parquet space; I would imagine they have some solutions for this problem.

Overall, I would say Parquet and CLP have similar storage formats (at least at the level of storing log events), which is part of what motivated using it at Uber. That said, we anticipate some potential compression and performance bottlenecks (including for search) with using Parquet since we wouldn't as much flexibility to choose the precise layout of the data.

@emschwartz
Copy link
Author

Thanks @kirkrodrigues, that's exactly what I was looking for!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants