What is the status of LowCardinality feature, where is it documented ? #4074

healiseu · 2019-01-16T13:20:29Z

Hi, I have searched for official documentation of this feature, it is quite important on the project I am working, but I have not found any yet, it is not described in data types, please correct me if I am wrong.

There has been a similar question here, #2903, half a year ago, but the answer from @alexey-milovidov is not very specific. I want to find out details on this feature, specifically I want to know.

What are the differences from ENUM types ?
Can you give a brief explanation on how it supplements column compression ?
When is it used ? How low, in numbers, the cardinality of a field should be to qualify for its use ?
What are the current restrictions on its use, e.g. INSERT, Engines, etc ?
Is it safe to use the latest version 18.16.1 and include this feature ?

Can somebody answer these quickly ?

Thank you

PS: Change log is often updated with information about LowCardinality. That shows I think how important this feature is. It also seems it has matured, but unfortunately it has not been documented.

The text was updated successfully, but these errors were encountered:

alexey-milovidov · 2019-01-17T14:25:05Z

What are the differences from ENUM types?

LowCardinality data type builds dictionaries automatically. It can use multiple different dictionaries if neccessarily.

Can you give a brief explanation on how it supplements column compression?

It works independent of (generic) columns compression. Columns with LowCardinality data type are subsequently compressed as usual. The compression ratio of LowCardinality columns in comparison to data in text format may be significantly better then without LowCardinality but sometimes it is the same. The main benefit is data processing speed, because data is processed in dictionary encoded form (you can imagine it as the data is not fully decompressed before being processed: only generic decompression is applied and the data remains in LowCardinality form).

When is it used ? How low, in numbers, the cardinality of a field should be to qualify for its use ?

Rule of thumb: it should make benefits if the number of distinct values is less that few millions.

But actually, the implementation is way smarter. If the number of distinct values is pretty large, the dictionaries became local: several different dictionaries will be used for different ranges of data. For example, if you have too many distinct values in total, but only less than about a million values each day - then queries by day will be processed efficiently... and queries for larger ranges will be processed rather efficient.

What are the current restrictions on its use, e.g. INSERT, Engines, etc?

It is supported for all table types and for all query types. All functions and aggregate functions are also supported automatically. I don't remember any remaining restrictions, will ask @KochetovNicolai

Is it safe to use the latest version 18.16.1 and include this feature?

The feature is almost production ready. Only single known issue remains: #4038 and it is going to be fixed in 19.0. Though we have not started to use LowCardinality feature in production at Yandex.

KochetovNicolai · 2019-01-17T15:02:28Z

There is no restrictions on LowCardinality usage. More information you can find in this presentation.

healiseu · 2019-01-18T14:18:12Z

@alexey-milovidov and @KochetovNicolai thank you for the briefing, dictionary encoded processing is an extremely important issue in database technology. I foresee that there is going to be significant progress on techniques based on this. It's all about the power of abstraction and representation that is deeply rooted in semiotics.

TriaClick, the next release of TRIADB, with Clickhouse, project I am working on is based on similar encoding where among other things it solves (better than RDF/Property/Topic Maps) to a large extent the problems of data modeling, namespace, joins, missing data and output representation (tuples-table vs associations-graph). This is achieved by assigning numerical dimensions to data models, entities, attributes and instances of them (e.g. tuple PKs and values). In that case I discovered that Clickhouse partitioning system is a perfect fit, partition key and primary key matches a 3D dimensional key. I will use LowCardinality feature and let you know about its performance. I think this will also make TriaClick associative, semiotic, hypergraph technology to run faster.

I had a quick look at Kochetov Nikolai presentation, that helps, but I will also search for code examples to study various cases in testing repositories of the latest clickhouse distribution. Then I may continue our discussion here about LowCardinality.

PS: @alexey-milovidov speaking about the smart implementation of LowCardinality in the case of large volumes of data, I faced a similar problem with smart adhoc filtering by various ranges (user selections) of data. Generally speaking one is never interested in looking at all the data but only part of it, i.e. view of data. Therefore each time there is a new user request, i.e. update of the filters, TriaClick system reconstructs temporary engines (clickhouse MergeTree tables and Sets) under the hood for this view. So I think I can understand you well here.

healiseu changed the title ~~Status of LowCardinality feature~~ What is the status of LowCardinality feature, where is it documented ? Jan 16, 2019

alexey-milovidov assigned KochetovNicolai Jan 17, 2019

alexey-milovidov added the question Question? label Jan 17, 2019

alexey-milovidov closed this as completed Feb 11, 2019

healiseu mentioned this issue Mar 25, 2019

Use of LowCardinality and/or INDEX to speed up query performance #4796

Closed

bobrik mentioned this issue May 11, 2020

ClickHouse as a storage backend jaegertracing/jaeger#1438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the status of LowCardinality feature, where is it documented ? #4074

What is the status of LowCardinality feature, where is it documented ? #4074

healiseu commented Jan 16, 2019 •

edited

alexey-milovidov commented Jan 17, 2019 •

edited

KochetovNicolai commented Jan 17, 2019

healiseu commented Jan 18, 2019 •

edited

What is the status of LowCardinality feature, where is it documented ? #4074

What is the status of LowCardinality feature, where is it documented ? #4074

Comments

healiseu commented Jan 16, 2019 • edited

alexey-milovidov commented Jan 17, 2019 • edited

KochetovNicolai commented Jan 17, 2019

healiseu commented Jan 18, 2019 • edited

healiseu commented Jan 16, 2019 •

edited

alexey-milovidov commented Jan 17, 2019 •

edited

healiseu commented Jan 18, 2019 •

edited