Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is a "granule"? #4414

Closed
arctica opened this issue Feb 15, 2019 · 4 comments
Closed

What is a "granule"? #4414

arctica opened this issue Feb 15, 2019 · 4 comments
Assignees
Labels
comp-documentation Documentation comp-skipidx Data skipping indices question Question?

Comments

@arctica
Copy link

arctica commented Feb 15, 2019

The documentation on data skipping indexes states:

These indices aggregate some information about the specified expression on blocks, which consist of granularity_value granules, then these aggregates are used in SELECT queries for reducing the amount of data to read from the disk by skipping big blocks of data where where query cannot be satisfied.

What exactly is a granule? Is it a row?

As a related question: are there plans for an index type similar to btree/hash secondary indexes of traditional RDBMS so a WHERE could efficiently look up rows without needing to be part of a prefix of the primary key or scanning all rows for the given column?
As I understand it, the current data skipping indexes basically allow only to answer the question "does this block of rows contain the value that I am looking for?" instead of "which rows in this block contain the value that I am looking for".

@alesapin
Copy link
Member

What exactly is a granule? Is it a row?

Granule is a batch of rows of fixed size which addresses with primary key. Term make sense only for MergeTree* engine family. It can be set with setting index_granularity=N, default value is 8192 rows per batch. So if you use default value, you will have index per each 8192 row.

As I understand it, the current data skipping indexes basically allow only to answer the question "does this block of rows contain the value that I am looking for?" instead of "which rows in this block contain the value that I am looking for".

Yes, you understood correctly. This way (sparse index) of indexing is very efficient. Index is very small so it can be placed in memory. Sequential processing of group of small granules is also very fast.
You can set index_granularity=1 (primary key per each row) and also set GRANULARITY=1 if you want to get index per each row, but this will require a lot of memory.

@arctica
Copy link
Author

arctica commented Feb 19, 2019

Thank you for the explanation. Maybe a small piece of text could be added to the documentation like "(a granule is one block of primary key containing index_granularity rows)?

I see now how this index can be properly used. It only makes sense when the value being filtered for is very sparse or one needs very fine grained primary keys.

As I now understand it, the data skipping index is tied to the primary key. E.g. If I have index_granularity=8192 and GRANULARITY=1, then each 8192 rows, the index contains say the minmax for the Nth primary key.

Is there an advantage to tieing the data skipping index to the primary key or would it make sense to make it its own stand-alone index which could have its own granularity defined by rows? If I had a data skipping index with GRANULARITY=4096rows then one could easily compute which primary key the current data skipping index batch belongs to since the number of rows is always fixed. That way one could have a finer grained data skipping index if filtering just by that column. It would also make for easier understanding of the index.

@alexey-milovidov
Copy link
Member

As I now understand it, the data skipping index is tied to the primary key. E.g. If I have index_granularity=8192 and GRANULARITY=1, then each 8192 rows, the index contains say the minmax for the Nth primary key.

Correct.

Is there an advantage to tieing the data skipping index to the primary key or would it make sense to make it its own stand-alone index which could have its own granularity defined by rows?

Every column has the .mrk file along with .bin (data) file. These files store "marks" - offsets in data file, that allow to read or skip data for specific granules. These marks have primary key index granularity.

If you have different granularity for secondary keys, you either:

  • cannot skip data efficiently (you'll have to read and throw off data instead of seek);
  • have to store secondary .mrk files for every column.

@filimonov filimonov added the comp-skipidx Data skipping indices label May 11, 2019
@stale
Copy link

stale bot commented Oct 20, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-documentation Documentation comp-skipidx Data skipping indices question Question?
Projects
None yet
Development

No branches or pull requests

4 participants