Better data layout for nested/repeated schemas

**Problem**: Currently, Vortex has poor support for nested data types.

Consider the following schema:

```
D describe events;
┌─────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │                                                                 column_type                                                                 │  null   │   key   │ default │  extra  │
│   varchar   │                                                                   varchar                                                                   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ id          │ VARCHAR                                                                                                                                     │ YES     │ NULL    │ NULL    │ NULL    │
│ type        │ VARCHAR                                                                                                                                     │ YES     │ NULL    │ NULL    │ NULL    │
│ actor       │ STRUCT(id BIGINT, login VARCHAR, display_login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)                               │ YES     │ NULL    │ NULL    │ NULL    │
│ repo        │ STRUCT(id BIGINT, "name" VARCHAR, url VARCHAR)                                                                                              │ YES     │ NULL    │ NULL    │ NULL    │
│ payload     │ STRUCT("action" VARCHAR, issue STRUCT(url VARCHAR, repository_url VARCHAR, labels_url VARCHAR, comments_url VARCHAR, events_url VARCHAR, …  │ YES     │ NULL    │ NULL    │ NULL    │
│ public      │ BOOLEAN                                                                                                                                     │ YES     │ NULL    │ NULL    │ NULL    │
│ created_at  │ TIMESTAMP                                                                                                                                   │ YES     │ NULL    │ NULL    │ NULL    │
│ org         │ STRUCT(id BIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)                                                      │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
```

Currently the way our StructStrategy works, this is written with the following layout tree:

```
StructLayout
|__ "payload" child (Zoned) dtype=struct{id, name,}
	|__ data: FlatLayout
	|__ zones: ZoneMap(null_count)
... (other columns)
```

Putting the entire Struct column inside of a FlatLayout has many downsides

* **No stats propagation**. Currently we can't pushdown operations over the nested fields
* **Terrible random access performance**. All nested fields get read together when the FlatLayout is deserialized. This is really bad, and the problems gets worse the larger and bushier the schema

<img width="1962" height="1638" alt="Image" src="https://github.com/user-attachments/assets/9c265ff4-5dfd-41c7-bee2-fbacedf5bf21" />

We end up having a similar problem for schemas with LIST fields, both nested or even top-level. Our layout strategies are not aware of complex types and it causes pain where users have to work around it by flattening the schema themselves.

## Solution Space

### RepDef

Parquet and Lance solve the this problem using repetition and definition levels. There's a lot of prior art here so I won't go into detail, but there are good links covering the basics [here](https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper) and [here](https://lancedb.com/blog/columnar-file-readers-in-depth-repetition-definition-levels/).

RepDef exists to solve the dual problem of wasted space and multiple buffer access 

Weston's article makes a good case for repdef: namely, that it reduces the amount of data that needs to be scanned for a random cell access. You only need 1-2 IOs max to read a specific cell value, and then some linear amount of CPU decoding to find the N-th value and its validity bit.

What RepDef is not as good at is zero-copy deserialization. Converting back into Arrow requires you to unravel the repdef back into the hierarchical format.

### Component Shredding

A more simple solution involves a layout that performs full component shredding, and holds onto metadata required to unshred the nested fields back together at scan time.

For example, the layout would be responsible for taking something like the following (JSON)

```json
{
  "a": {
      "b": [
         { "c": true, "d": 1 },
         null,
         { "c": false, "d": 2 },
         { "c": null, "d": 3 },
      ]
   }
}      
```

Into the following components

```
ShreddedLayout
  |__ a.b.validity
  |__ a.b.offsets
  |__ a.b.c.values
  |__ a.b.c.validity
  |__ a.b.d.values
```

Projecting specific components would be pretty simple here, and we can do it zero-copy directly from the file, for example reading the array `SELECT d FROM unnest(a.b)` involves

* Decode `a.b.validity`
* Decode `a.b.d.values`


Random access requires extra steps, decoding potentially several buffers to access a single cell. For example, say we wanted to access the value at `a.b[0].c`, this would require:

* Decode `a.b.validity`, check if `a.b[0]` is null
* If not, decode `a.b.offsets`. Find offset for 0-th list
* Decode `a.b.c.validity` at the 0-th offset position, to check if null
* Decode `a.b.c.values` at the 0-th offset position to retrieve the value

In all, this takes 4 random IOs, on top of the initial IO to retrieve the footer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better data layout for nested/repeated schemas #4889

Solution Space

RepDef

Component Shredding

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Better data layout for nested/repeated schemas #4889

Description

Solution Space

RepDef

Component Shredding

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions