-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference between CLP and Parquet? #117
Comments
Hi @emschwartz, So sorry for the delayed response. We don't currently have a detailed write up, but perhaps I can provide some info here. For some frame of reference, to store (unstructured) logs in Parquet files, the simplest approach would be to store timestamps (as an offset from the UNIX epoch), message content, and other fields (e.g., log-level) in separate columns. This should give you some compression as a result of Parquet's built-in encodings. For instance, you could use delta-encoding for the timestamps and dictionary-encoding for the other fields. Similarities to CLP:
Differences with CLP:
Overall, I would say Parquet and CLP have similar storage formats (at least at the level of storing log events), which is part of what motivated using it at Uber. That said, we anticipate some potential compression and performance bottlenecks (including for search) with using Parquet since we wouldn't as much flexibility to choose the precise layout of the data. |
Thanks @kirkrodrigues, that's exactly what I was looking for! |
(Sorry if this isn't the right place for this question or if the question doesn't really make sense.) I read about CLP in the Uber blog post and I'm curious how CLP differs from storing logs in Parquet files. Is there a write up of comparisons anywhere?
Thank you!
The text was updated successfully, but these errors were encountered: