Skip to content

Support file stats on nested fields #6389

@asubiotto

Description

@asubiotto

Discussed in #5917

Problem

Currently, only top-level field stats are stored in the vortex file footer:

/// Note: for now this only collects top-level struct fields.

While chunk-level pruning on nested fields already works, file-level stats are missing, preventing whole-file pruning on nested fields or other optimizations. For example, we have some queries that can 100% be satisfied just by nested file stats, removing extra object storage reads.

Proposed approach

Store stats via post-order DType walk. Each leaf and nullable struct gets a stats set entry:

{name=utf8?, age=i32} → [name stats_set, age stats_set]
{name=utf8?, age=i32}? → [name stats_set, age stats_set, struct stats_set (null count)]

Extend the stats set flatbuffer with an is_nested: bool = false field for backward compatibility. When false, existing behavior is preserved; when true, entries follow the post-order walk.

Lists: For list(T), store stats only for the element column.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions