Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the status of LowCardinality feature, where is it documented ? #4074

Closed
healiseu opened this issue Jan 16, 2019 · 3 comments
Closed
Assignees
Labels
question Question?

Comments

@healiseu
Copy link

healiseu commented Jan 16, 2019

Hi, I have searched for official documentation of this feature, it is quite important on the project I am working, but I have not found any yet, it is not described in data types, please correct me if I am wrong.

There has been a similar question here, #2903, half a year ago, but the answer from @alexey-milovidov is not very specific. I want to find out details on this feature, specifically I want to know.

  1. What are the differences from ENUM types ?
  2. Can you give a brief explanation on how it supplements column compression ?
  3. When is it used ? How low, in numbers, the cardinality of a field should be to qualify for its use ?
  4. What are the current restrictions on its use, e.g. INSERT, Engines, etc ?
  5. Is it safe to use the latest version 18.16.1 and include this feature ?

Can somebody answer these quickly ?

Thank you

PS: Change log is often updated with information about LowCardinality. That shows I think how important this feature is. It also seems it has matured, but unfortunately it has not been documented.

@healiseu healiseu changed the title Status of LowCardinality feature What is the status of LowCardinality feature, where is it documented ? Jan 16, 2019
@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jan 17, 2019

What are the differences from ENUM types?

LowCardinality data type builds dictionaries automatically. It can use multiple different dictionaries if neccessarily.

Can you give a brief explanation on how it supplements column compression?

It works independent of (generic) columns compression. Columns with LowCardinality data type are subsequently compressed as usual. The compression ratio of LowCardinality columns in comparison to data in text format may be significantly better then without LowCardinality but sometimes it is the same. The main benefit is data processing speed, because data is processed in dictionary encoded form (you can imagine it as the data is not fully decompressed before being processed: only generic decompression is applied and the data remains in LowCardinality form).

When is it used ? How low, in numbers, the cardinality of a field should be to qualify for its use ?

Rule of thumb: it should make benefits if the number of distinct values is less that few millions.

But actually, the implementation is way smarter. If the number of distinct values is pretty large, the dictionaries became local: several different dictionaries will be used for different ranges of data. For example, if you have too many distinct values in total, but only less than about a million values each day - then queries by day will be processed efficiently... and queries for larger ranges will be processed rather efficient.

What are the current restrictions on its use, e.g. INSERT, Engines, etc?

It is supported for all table types and for all query types. All functions and aggregate functions are also supported automatically. I don't remember any remaining restrictions, will ask @KochetovNicolai

Is it safe to use the latest version 18.16.1 and include this feature?

The feature is almost production ready. Only single known issue remains: #4038 and it is going to be fixed in 19.0. Though we have not started to use LowCardinality feature in production at Yandex.

@alexey-milovidov alexey-milovidov added the question Question? label Jan 17, 2019
@KochetovNicolai
Copy link
Member

There is no restrictions on LowCardinality usage. More information you can find in this presentation.

@healiseu
Copy link
Author

healiseu commented Jan 18, 2019

@alexey-milovidov and @KochetovNicolai thank you for the briefing, dictionary encoded processing is an extremely important issue in database technology. I foresee that there is going to be significant progress on techniques based on this. It's all about the power of abstraction and representation that is deeply rooted in semiotics.

TriaClick, the next release of TRIADB, with Clickhouse, project I am working on is based on similar encoding where among other things it solves (better than RDF/Property/Topic Maps) to a large extent the problems of data modeling, namespace, joins, missing data and output representation (tuples-table vs associations-graph). This is achieved by assigning numerical dimensions to data models, entities, attributes and instances of them (e.g. tuple PKs and values). In that case I discovered that Clickhouse partitioning system is a perfect fit, partition key and primary key matches a 3D dimensional key. I will use LowCardinality feature and let you know about its performance. I think this will also make TriaClick associative, semiotic, hypergraph technology to run faster.

I had a quick look at Kochetov Nikolai presentation, that helps, but I will also search for code examples to study various cases in testing repositories of the latest clickhouse distribution. Then I may continue our discussion here about LowCardinality.

PS: @alexey-milovidov speaking about the smart implementation of LowCardinality in the case of large volumes of data, I faced a similar problem with smart adhoc filtering by various ranges (user selections) of data. Generally speaking one is never interested in looking at all the data but only part of it, i.e. view of data. Therefore each time there is a new user request, i.e. update of the filters, TriaClick system reconstructs temporary engines (clickhouse MergeTree tables and Sets) under the hood for this view. So I think I can understand you well here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question?
Projects
None yet
Development

No branches or pull requests

3 participants