Skip to content

Better data layout for nested/repeated schemas #4889

@a10y

Description

@a10y

Problem: Currently, Vortex has poor support for nested data types.

Consider the following schema:

D describe events;
┌─────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │                                                                 column_type                                                                 │  null   │   key   │ default │  extra  │
│   varchar   │                                                                   varchar                                                                   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ id          │ VARCHAR                                                                                                                                     │ YES     │ NULL    │ NULL    │ NULL    │
│ type        │ VARCHAR                                                                                                                                     │ YES     │ NULL    │ NULL    │ NULL    │
│ actor       │ STRUCT(id BIGINT, login VARCHAR, display_login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)                               │ YES     │ NULL    │ NULL    │ NULL    │
│ repo        │ STRUCT(id BIGINT, "name" VARCHAR, url VARCHAR)                                                                                              │ YES     │ NULL    │ NULL    │ NULL    │
│ payload     │ STRUCT("action" VARCHAR, issue STRUCT(url VARCHAR, repository_url VARCHAR, labels_url VARCHAR, comments_url VARCHAR, events_url VARCHAR, …  │ YES     │ NULL    │ NULL    │ NULL    │
│ public      │ BOOLEAN                                                                                                                                     │ YES     │ NULL    │ NULL    │ NULL    │
│ created_at  │ TIMESTAMP                                                                                                                                   │ YES     │ NULL    │ NULL    │ NULL    │
│ org         │ STRUCT(id BIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)                                                      │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘

Currently the way our StructStrategy works, this is written with the following layout tree:

StructLayout
|__ "payload" child (Zoned) dtype=struct{id, name,}
	|__ data: FlatLayout
	|__ zones: ZoneMap(null_count)
... (other columns)

Putting the entire Struct column inside of a FlatLayout has many downsides

  • No stats propagation. Currently we can't pushdown operations over the nested fields
  • Terrible random access performance. All nested fields get read together when the FlatLayout is deserialized. This is really bad, and the problems gets worse the larger and bushier the schema
Image

We end up having a similar problem for schemas with LIST fields, both nested or even top-level. Our layout strategies are not aware of complex types and it causes pain where users have to work around it by flattening the schema themselves.

Solution Space

RepDef

Parquet and Lance solve the this problem using repetition and definition levels. There's a lot of prior art here so I won't go into detail, but there are good links covering the basics here and here.

RepDef exists to solve the dual problem of wasted space and multiple buffer access

Weston's article makes a good case for repdef: namely, that it reduces the amount of data that needs to be scanned for a random cell access. You only need 1-2 IOs max to read a specific cell value, and then some linear amount of CPU decoding to find the N-th value and its validity bit.

What RepDef is not as good at is zero-copy deserialization. Converting back into Arrow requires you to unravel the repdef back into the hierarchical format.

Component Shredding

A more simple solution involves a layout that performs full component shredding, and holds onto metadata required to unshred the nested fields back together at scan time.

For example, the layout would be responsible for taking something like the following (JSON)

{
  "a": {
      "b": [
         { "c": true, "d": 1 },
         null,
         { "c": false, "d": 2 },
         { "c": null, "d": 3 },
      ]
   }
}      

Into the following components

ShreddedLayout
  |__ a.b.validity
  |__ a.b.offsets
  |__ a.b.c.values
  |__ a.b.c.validity
  |__ a.b.d.values

Projecting specific components would be pretty simple here, and we can do it zero-copy directly from the file, for example reading the array SELECT d FROM unnest(a.b) involves

  • Decode a.b.validity
  • Decode a.b.d.values

Random access requires extra steps, decoding potentially several buffers to access a single cell. For example, say we wanted to access the value at a.b[0].c, this would require:

  • Decode a.b.validity, check if a.b[0] is null
  • If not, decode a.b.offsets. Find offset for 0-th list
  • Decode a.b.c.validity at the 0-th offset position, to check if null
  • Decode a.b.c.values at the 0-th offset position to retrieve the value

In all, this takes 4 random IOs, on top of the initial IO to retrieve the footer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions