New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the status of LowCardinality feature, where is it documented ? #4074
Comments
LowCardinality data type builds dictionaries automatically. It can use multiple different dictionaries if neccessarily.
It works independent of (generic) columns compression. Columns with LowCardinality data type are subsequently compressed as usual. The compression ratio of LowCardinality columns in comparison to data in text format may be significantly better then without LowCardinality but sometimes it is the same. The main benefit is data processing speed, because data is processed in dictionary encoded form (you can imagine it as the data is not fully decompressed before being processed: only generic decompression is applied and the data remains in LowCardinality form).
Rule of thumb: it should make benefits if the number of distinct values is less that few millions. But actually, the implementation is way smarter. If the number of distinct values is pretty large, the dictionaries became local: several different dictionaries will be used for different ranges of data. For example, if you have too many distinct values in total, but only less than about a million values each day - then queries by day will be processed efficiently... and queries for larger ranges will be processed rather efficient.
It is supported for all table types and for all query types. All functions and aggregate functions are also supported automatically. I don't remember any remaining restrictions, will ask @KochetovNicolai
The feature is almost production ready. Only single known issue remains: #4038 and it is going to be fixed in 19.0. Though we have not started to use LowCardinality feature in production at Yandex. |
There is no restrictions on LowCardinality usage. More information you can find in this presentation. |
@alexey-milovidov and @KochetovNicolai thank you for the briefing, dictionary encoded processing is an extremely important issue in database technology. I foresee that there is going to be significant progress on techniques based on this. It's all about the power of abstraction and representation that is deeply rooted in semiotics. TriaClick, the next release of TRIADB, with Clickhouse, project I am working on is based on similar encoding where among other things it solves (better than RDF/Property/Topic Maps) to a large extent the problems of data modeling, namespace, joins, missing data and output representation (tuples-table vs associations-graph). This is achieved by assigning numerical dimensions to data models, entities, attributes and instances of them (e.g. tuple PKs and values). In that case I discovered that Clickhouse partitioning system is a perfect fit, partition key and primary key matches a 3D dimensional key. I will use LowCardinality feature and let you know about its performance. I think this will also make TriaClick associative, semiotic, hypergraph technology to run faster. I had a quick look at Kochetov Nikolai presentation, that helps, but I will also search for code examples to study various cases in testing repositories of the latest clickhouse distribution. Then I may continue our discussion here about PS: @alexey-milovidov speaking about the smart implementation of LowCardinality in the case of large volumes of data, I faced a similar problem with smart adhoc filtering by various ranges (user selections) of data. Generally speaking one is never interested in looking at all the data but only part of it, i.e. view of data. Therefore each time there is a new user request, i.e. update of the filters, TriaClick system reconstructs temporary engines (clickhouse MergeTree tables and Sets) under the hood for this view. So I think I can understand you well here. |
Hi, I have searched for official documentation of this feature, it is quite important on the project I am working, but I have not found any yet, it is not described in data types, please correct me if I am wrong.
There has been a similar question here, #2903, half a year ago, but the answer from @alexey-milovidov is not very specific. I want to find out details on this feature, specifically I want to know.
Can somebody answer these quickly ?
Thank you
PS: Change log is often updated with information about
LowCardinality
. That shows I think how important this feature is. It also seems it has matured, but unfortunately it has not been documented.The text was updated successfully, but these errors were encountered: