Skip to content

[Parquet][C++] Add experimental VECTOR repetition level for Arrow FixedSizeList#50

Draft
rok wants to merge 6 commits into
mainfrom
vector_repetition_level
Draft

[Parquet][C++] Add experimental VECTOR repetition level for Arrow FixedSizeList#50
rok wants to merge 6 commits into
mainfrom
vector_repetition_level

Conversation

@rok
Copy link
Copy Markdown
Owner

@rok rok commented May 15, 2026

This PR prototypes a new experimental Parquet repetition type, VECTOR, mainly for Arrow's FixedSizeList<T, N> as proposed in Option B here.

Today this is written through Parquet LIST, which means we pay repetition/definition-level costs for an inner shape that is already fixed. VECTOR stores the fixed length in the Parquet schema as vector_length and avoids the inner repetition level.

Writing FixedSizeList<T, N> as VECTOR is opt-in via ArrowWriterProperties::Builder::enable_experimental_vector_encoding() defaulting to LIST otherwise.

Implemented in this prototype

  • Parquet schema/thrift addition of VECTOR
  • Arrow schema conversion:
    • FixedSizeList -> VECTOR when explicitly enabled
    • VECTOR -> FixedSizeList
  • required and nullable rows in VECTOR
  • nullable primitive elements in VECTOR
  • primitive type vectors
  • limited composability proof of concept: FixedSizeList<struct<x: float, y: int32>, N>
  • roundtrip/schema/path tests
  • read/write/roundtrip benchmarks

Deferred for now:

  • dictionary / non-PLAIN encodings
  • statistics
  • page index
  • general nested VECTOR children (non-fixed will not be possible)

Compatibility

We expect non-VECTOR-aware readers to fail when encountering VECTOR repetition level.

Benchmark snapshot

All with FixedSizeList<float, {80,768,10k,100k}>.
Numbers below are from a local debug build, so directional only.

Non-nullable vector

Vector length LIST write VECTOR write Write speedup LIST read VECTOR read Read speedup LIST roundtrip VECTOR roundtrip Roundtrip speedup
80 80.29k rows/s 13.58M rows/s 169x 108.94k rows/s 110.26M rows/s 1012x 50.25k rows/s 12.52M rows/s 249x
768 9.05k rows/s 1.50M rows/s 166x 13.18k rows/s 11.38M rows/s 863x 5.32k rows/s 1.23M rows/s 230x
10,000 701.42 rows/s 140.62k rows/s 200x 1.01k rows/s 871.62k rows/s 861x 411.96 rows/s 131.99k rows/s 320x
100,000 70.45 rows/s 16.08k rows/s 228x 101.06 rows/s 87.30k rows/s 864x 41.29 rows/s 12.51k rows/s 303x

Nullable vector

Vector length LIST write VECTOR write Write speedup LIST read VECTOR read Read speedup LIST roundtrip VECTOR roundtrip Roundtrip speedup
80 144.77k rows/s 246.10k rows/s 1.7x 236.82k rows/s 1.07M rows/s 4.5x 95.86k rows/s 230.77k rows/s 2.4x
768 17.76k rows/s 71.77k rows/s 4.0x 25.98k rows/s 128.66k rows/s 5.0x 10.57k rows/s 46.66k rows/s 4.4x
10,000 1.39k rows/s 7.18k rows/s 5.2x 2.01k rows/s 10.39k rows/s 5.2x 819.23 rows/s 4.27k rows/s 5.2x
100,000 139.88 rows/s 707.80 rows/s 5.1x 201.00 rows/s 1.03k rows/s 5.1x 81.94 rows/s 421.25 rows/s 5.1x

@rok rok changed the title Vector repetition level proposal Draft: [Parquet][C++] Add experimental VECTOR repetition support for Arrow FixedSizeList May 15, 2026
@rok rok changed the title Draft: [Parquet][C++] Add experimental VECTOR repetition support for Arrow FixedSizeList [Parquet][C++] Add experimental VECTOR repetition level for Arrow FixedSizeList May 15, 2026
Repository owner deleted a comment from github-actions Bot May 15, 2026
@rok rok marked this pull request as draft May 15, 2026 19:22
Repository owner deleted a comment from github-actions Bot May 15, 2026
Repository owner deleted a comment from github-actions Bot May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant