-
Notifications
You must be signed in to change notification settings - Fork 109
Description
Problem: Currently, Vortex has poor support for nested data types.
Consider the following schema:
D describe events;
┌─────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ id │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ type │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ actor │ STRUCT(id BIGINT, login VARCHAR, display_login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR) │ YES │ NULL │ NULL │ NULL │
│ repo │ STRUCT(id BIGINT, "name" VARCHAR, url VARCHAR) │ YES │ NULL │ NULL │ NULL │
│ payload │ STRUCT("action" VARCHAR, issue STRUCT(url VARCHAR, repository_url VARCHAR, labels_url VARCHAR, comments_url VARCHAR, events_url VARCHAR, … │ YES │ NULL │ NULL │ NULL │
│ public │ BOOLEAN │ YES │ NULL │ NULL │ NULL │
│ created_at │ TIMESTAMP │ YES │ NULL │ NULL │ NULL │
│ org │ STRUCT(id BIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR) │ YES │ NULL │ NULL │ NULL │
└─────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
Currently the way our StructStrategy works, this is written with the following layout tree:
StructLayout
|__ "payload" child (Zoned) dtype=struct{id, name,}
|__ data: FlatLayout
|__ zones: ZoneMap(null_count)
... (other columns)
Putting the entire Struct column inside of a FlatLayout has many downsides
- No stats propagation. Currently we can't pushdown operations over the nested fields
- Terrible random access performance. All nested fields get read together when the FlatLayout is deserialized. This is really bad, and the problems gets worse the larger and bushier the schema
We end up having a similar problem for schemas with LIST fields, both nested or even top-level. Our layout strategies are not aware of complex types and it causes pain where users have to work around it by flattening the schema themselves.
Solution Space
RepDef
Parquet and Lance solve the this problem using repetition and definition levels. There's a lot of prior art here so I won't go into detail, but there are good links covering the basics here and here.
RepDef exists to solve the dual problem of wasted space and multiple buffer access
Weston's article makes a good case for repdef: namely, that it reduces the amount of data that needs to be scanned for a random cell access. You only need 1-2 IOs max to read a specific cell value, and then some linear amount of CPU decoding to find the N-th value and its validity bit.
What RepDef is not as good at is zero-copy deserialization. Converting back into Arrow requires you to unravel the repdef back into the hierarchical format.
Component Shredding
A more simple solution involves a layout that performs full component shredding, and holds onto metadata required to unshred the nested fields back together at scan time.
For example, the layout would be responsible for taking something like the following (JSON)
{
"a": {
"b": [
{ "c": true, "d": 1 },
null,
{ "c": false, "d": 2 },
{ "c": null, "d": 3 },
]
}
} Into the following components
ShreddedLayout
|__ a.b.validity
|__ a.b.offsets
|__ a.b.c.values
|__ a.b.c.validity
|__ a.b.d.values
Projecting specific components would be pretty simple here, and we can do it zero-copy directly from the file, for example reading the array SELECT d FROM unnest(a.b) involves
- Decode
a.b.validity - Decode
a.b.d.values
Random access requires extra steps, decoding potentially several buffers to access a single cell. For example, say we wanted to access the value at a.b[0].c, this would require:
- Decode
a.b.validity, check ifa.b[0]is null - If not, decode
a.b.offsets. Find offset for 0-th list - Decode
a.b.c.validityat the 0-th offset position, to check if null - Decode
a.b.c.valuesat the 0-th offset position to retrieve the value
In all, this takes 4 random IOs, on top of the initial IO to retrieve the footer