What is a "granule"? #4414

arctica · 2019-02-15T21:31:37Z

The documentation on data skipping indexes states:

These indices aggregate some information about the specified expression on blocks, which consist of granularity_value granules, then these aggregates are used in SELECT queries for reducing the amount of data to read from the disk by skipping big blocks of data where where query cannot be satisfied.

What exactly is a granule? Is it a row?

As a related question: are there plans for an index type similar to btree/hash secondary indexes of traditional RDBMS so a WHERE could efficiently look up rows without needing to be part of a prefix of the primary key or scanning all rows for the given column?
As I understand it, the current data skipping indexes basically allow only to answer the question "does this block of rows contain the value that I am looking for?" instead of "which rows in this block contain the value that I am looking for".

alesapin · 2019-02-18T16:32:26Z

What exactly is a granule? Is it a row?

Granule is a batch of rows of fixed size which addresses with primary key. Term make sense only for MergeTree* engine family. It can be set with setting index_granularity=N, default value is 8192 rows per batch. So if you use default value, you will have index per each 8192 row.

As I understand it, the current data skipping indexes basically allow only to answer the question "does this block of rows contain the value that I am looking for?" instead of "which rows in this block contain the value that I am looking for".

Yes, you understood correctly. This way (sparse index) of indexing is very efficient. Index is very small so it can be placed in memory. Sequential processing of group of small granules is also very fast.
You can set index_granularity=1 (primary key per each row) and also set GRANULARITY=1 if you want to get index per each row, but this will require a lot of memory.

arctica · 2019-02-19T10:43:11Z

Thank you for the explanation. Maybe a small piece of text could be added to the documentation like "(a granule is one block of primary key containing index_granularity rows)?

I see now how this index can be properly used. It only makes sense when the value being filtered for is very sparse or one needs very fine grained primary keys.

As I now understand it, the data skipping index is tied to the primary key. E.g. If I have index_granularity=8192 and GRANULARITY=1, then each 8192 rows, the index contains say the minmax for the Nth primary key.

Is there an advantage to tieing the data skipping index to the primary key or would it make sense to make it its own stand-alone index which could have its own granularity defined by rows? If I had a data skipping index with GRANULARITY=4096rows then one could easily compute which primary key the current data skipping index batch belongs to since the number of rows is always fixed. That way one could have a finer grained data skipping index if filtering just by that column. It would also make for easier understanding of the index.

alexey-milovidov · 2019-02-19T16:51:33Z

As I now understand it, the data skipping index is tied to the primary key. E.g. If I have index_granularity=8192 and GRANULARITY=1, then each 8192 rows, the index contains say the minmax for the Nth primary key.

Correct.

Is there an advantage to tieing the data skipping index to the primary key or would it make sense to make it its own stand-alone index which could have its own granularity defined by rows?

Every column has the .mrk file along with .bin (data) file. These files store "marks" - offsets in data file, that allow to read or skip data for specific granules. These marks have primary key index granularity.

If you have different granularity for secondary keys, you either:

cannot skip data efficiently (you'll have to read and throw off data instead of seek);
have to store secondary .mrk files for every column.

stale · 2019-10-20T18:25:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

arctica added the question Question? label Feb 15, 2019

alexey-milovidov added the comp-documentation Documentation label Feb 15, 2019

alexey-milovidov assigned alesapin Feb 15, 2019

filimonov added the comp-skipidx Data skipping indices label May 11, 2019

stale bot added the stale label Oct 20, 2019

alexey-milovidov removed the stale label Oct 23, 2019

alexey-milovidov closed this as completed Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is a "granule"? #4414

What is a "granule"? #4414

arctica commented Feb 15, 2019

alesapin commented Feb 18, 2019

arctica commented Feb 19, 2019

alexey-milovidov commented Feb 19, 2019

stale bot commented Oct 20, 2019

What is a "granule"? #4414

What is a "granule"? #4414

Comments

arctica commented Feb 15, 2019

alesapin commented Feb 18, 2019

arctica commented Feb 19, 2019

alexey-milovidov commented Feb 19, 2019

stale bot commented Oct 20, 2019