Skip to content

Tracking Issue: Add DType::Union #7882

@connortsui20

Description

@connortsui20

This is a tracking issue for adding the DType::Union variant and canonical encoding for Union.

Motivation

See the parent epic issue: #7705

Design

We will be following the Arrow specification of Union, though we have an important decision to make here. Arrow specifies both a dense and a sparse union layout. Both layouts (encodings in Vortex) would have child arrays for each type that a value could be, as well as a types child array that specifies which type in the Union array a given row is (and thus, which child array to look at to fetch the correct value).

Dense vs Sparse

Dense encoding: each child array contains only the values that actually belong to that type, packed contiguously. An additional offsets buffer tells us which slot of the chosen child holds the value for that row.

Slicing and filtering requires walking type_ids to recompute per-child offsets.

type_ids: [0, 1, 0, 1, 0]
offsets:  [0, 0, 1, 1, 2]
child 0 (int):    [10, 20, 30]
child 1 (string): ["a", "b"]

Sparse encoding: every child array has the same length as the union itself, so the value at row i is simply found at position i of child type_ids[i].

The "inactive" slots of each child are unused but must still be addressable. There is no offsets buffer.

type_ids: [0, 1, 0, 1, 0]
child 0 (int):    [10, _, 20, _, 30]   // _ slots are inactive
child 1 (string): [_, "a", _, "b", _]

Why Sparse should be Canonical (and not Dense)

We want to support quick execution of things like slice and filter on our canonical arrays, and sparse is clearly superior to dense in this case (since dense would have to perform compute just to figure out how to slice child arrays).

Additionally, the space loss of sparse in Arrow does not necessarily translate to Vortex because our canonical types do not require the children to also be canonical. This means that actually sparse children under a sparse Union (for example, if I have 99000 integers and 1000 strings all separated) can be encoded with a SparseArray, so we would not lose much space compared to the dense union encoding.

It then follows that DenseUnionArray can just be a different physical encoding of Union. I think the compressor might have to be smarter about compressing into Dense though, since SparseUnion is basically always more performant for the compute we want to do.

Type IDs

Per Arrow's Schema.fbs: "By default ids in the type vector refer to the offsets in the children. Optionally typeIds provides an indirection between the child offset and the type id for each child."

So the per-row type tag defaults to consecutive 0..N matching child offsets, but an optional typeIds: [i8] array can provide indirection. For example, children [a, b, c] with typeIds = [0, 5, 7] means the data buffer uses tags 0, 5, 7 to select them (This matters for schema evolution because it means removing a child doesn't renumber the others).

UnionVariants will have to carry this information.

Also note that UnionVariants will enforce that variant names cannot be duplicated, and must be unique. The arrow specification says nothing about this, but we can enforce this behavior as the alternative of having duplicate variant names is insane.

Unresolved questions

  • The Arrow specification states that the nullability of the union array is determined solely by the child arrays and the types array (it is whatever the nullability of the chosen type array is according to the types array). Is this something we need to follow? This would be different from the rest of our canonical types, which all track their validity in a dedicated validity mask.
    • I think the answer is that it just doesn't give any more information that we can't already express. So we will follow the Arrow spec, and validity will be a lazily-computed thing that is essentially a gather from a bunch of different validity masks from the children.
  • TODO?

Steps

  • Add DType::Union variant + UnionVariants (flatbuffer/proto schema updates, round-trip tests): Add DType::Union and UnionVariants #7890
  • Canonical sparse UnionArray (vtable, validation, Canonical::Union, Canonical::empty)
  • Compute kernels: slice, take, filter, mask, scalar accessor, cast
  • Arrow to_arrow / from_arrow round-trip (handles both dense and sparse Arrow unions → canonical sparse Vortex)
  • UnionBuilder + UnionScalar (+ scalar.proto update)
  • DenseUnionArray as a separate non-canonical physical encoding + compressor changes?

Implementation history

Metadata

Metadata

Assignees

Labels

tracking-issueShared implementation context for work likely to span multiple PRs.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions