Skip to content

Dictionary-encoded columns in MemCS #5630

@TarantoolBot

Description

@TarantoolBot

Since: Tarantool EE 3.7

MemCS engine is populated with dictionary-encoding columns. They can be
created with layout option (here should be link to layout doc page):

local s = box.schema.create_space('test', {
    engine = 'memcs', format = format, field_count = field_count,
})
local pk = s:create_index('pk', {layout = 'dict'})

Only string non-key columns can use this layout. Other columns silently
ignore this option.

The maximal amount of unique values in such column (in other words,
maximal dictionary size) is UINT16_MAX (65536). When it's full, writes
of new unique values fail with an error. All indexes occupy 2 bytes
(uint16 type is used for indexes under the hood). Hence, such column
will occupy 2 * space_size + dict_size amount of bytes.

The dictionary is accounted in the space:bsize() statistics.

ArrowStream of dictionary-encoded columns always return values in
dictionary-encoded Arrow layout. The dictionary are returned in
string-view layout, the indexes have uint16 type. When Arrow Stream
is used, we have some guarantees for dictionaries:

  1. Unless the space was populated with a new unique value, all batches
    have the same dictionary.
  2. Dictionaries are not copied so their dump to ArrowArray is cheap.
  3. Dictionary can only grow, so values in the middle of the dictionary
    will never be deleted. Hence, after dictioniary was changed, it can
    be used for batches used old dictioinary.
    Requested by @drewdzzz in https://github.com/tarantool/tarantool-ee/commit/cd7bd1ca233b1db4211f57439ef08364d4d27c6b.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions