Skip to content

Allow running reduce_parent operations on stack allocated parents#7751

Open
robert3005 wants to merge 5 commits intodevelopfrom
refactor/parent-ref-stack-allocated
Open

Allow running reduce_parent operations on stack allocated parents#7751
robert3005 wants to merge 5 commits intodevelopfrom
refactor/parent-ref-stack-allocated

Conversation

@robert3005
Copy link
Copy Markdown
Contributor

Add indirection logic to support stack allocated array parents in reduce rules.

This lets us avoid heap allocation for commonly used operations that are often
immediately optimised away

robert3005 added 2 commits May 1, 2026 09:31
`ArrayRef::slice/filter/take` previously allocated a wrapper array
(`SliceArray`, `FilterArray`, `DictArray`) and called `.optimize()` on
the resulting `ArrayRef`, relying on a `reduce_parent` rule to throw the
wrapper away. The wrapper allocation was always paid even when reduction
ran in one shot.

This change establishes the API surface for moving the wrapper onto the
stack:

- `ParentRef<'a>`: a parent representation that optionally borrows an
  `&ArrayRef` and otherwise carries the encoding-specific data, dtype,
  length, slots, and encoding id directly. `into_array_ref(self)` clones
  the underlying `Arc` when the parent is heap-backed.
- `ParentView<'a, V>`: a typed view of a parent that derefs to
  `V::ArrayData` without holding an `&ArrayRef`. Used by the upcoming
  matcher path that accepts stack-allocated parents.
- `DynArray::data_any` exposes the encoding-specific data so a matcher
  can downcast to `V::ArrayData` from a `ParentRef` regardless of
  whether the parent is heap- or stack-backed.
- `ArrayParts<V>::optimize`, `optimize_ctx(session)`, and `into_array`,
  plus `Optimized<V>` with its own `into_array`. Callers chain
  `parts.optimize()?.into_array()` so reduction is an explicit,
  orthogonal step from materialization.
- `ArrayRef::slice / filter / take` now build an `ArrayParts<V>` and
  drive it through the chain.

The internals of `ArrayParts::optimize` still materialize before
running the existing `reduce_parent` chain, so this PR does not yet
remove the `Arc<ArrayInner<V>>` allocation. Wiring `ParentRef` through
`DynArray::reduce_parent`, `VTable::reduce_parent`, `ParentRuleSet`,
`DynArrayParentReduceRule`, `ReduceParentFn`, the `Matcher` API, and
the per-encoding rule bodies is the follow-up that delivers the
allocation savings.

### Tests
- `cargo build --workspace`
- `cargo nextest run -p vortex-array -p vortex-fastlanes -p vortex-fsst -p vortex-alp -p vortex-runend -p vortex-zigzag` (all 3078 pass; the 21 skipped are timezone-dependent and unrelated to this change)
- `cargo clippy -p vortex-array --all-targets --all-features`
- `cargo +nightly fmt --all -- --check`
- `./scripts/public-api.sh`

Signed-off-by: Robert Kruszewski <dzobert@gmail.com>
…r slice

Builds on the previous commit's `ArrayParts::optimize` chain by flipping the
`reduce_parent` dispatch signatures to take `&ParentRef<'_>` and wiring the
slice path through the stack-allocated parent so `ArrayRef::slice` never
allocates a wrapper `Arc<ArrayInner<Slice>>` when the child encoding has a
`SliceReduce` impl.

### Signature changes

The following all flip `parent: &ArrayRef` -> `parent: &ParentRef<'_>`:

- `DynArray::reduce_parent`, `VTable::reduce_parent`,
  `ArrayRef::reduce_parent` delegate.
- `DynArrayParentReduceRule::matches` / `::reduce_parent`,
  `ParentRuleSet::evaluate`, `ParentReduceRuleAdapter`.
- `ReduceParentFn` (session-registered kernels). The lookup keys
  `(parent_id, child_id)` are unchanged.
- All 34 `impl VTable::reduce_parent` overrides — mechanical change because
  every impl just delegates to `PARENT_RULES.evaluate(array, parent, idx)`.

The optimizer's `try_optimize` and the executor's `walk` build a
`ParentRef::from_array_ref(&current_array)` once and reuse it across the
slot loop.

### How existing rules keep working

- `Matcher` gains a `try_match_parent` default that delegates to the existing
  `try_match` when the parent is heap-allocated and returns `None` otherwise.
  So existing matchers (`AnyArray`, `AnyCanonical`, `AnyTemporal`,
  `AnyScalarFn`, `ExactScalarFn`, plus the blanket `impl<V: VTable> Matcher
  for V`) work unchanged for heap-backed parents.
- `ArrayParentReduceRule` gains a `reduce_parent_ref` default that uses
  `try_match_parent` and delegates to `reduce_parent`. All 19 existing rule
  bodies keep their `Match<'_>` signature; the dispatch from
  `ParentReduceRuleAdapter` now goes through `reduce_parent_ref`.
- `ArrayParts::optimize_inner` walks slots with a stack `ParentRef::from_parts`
  first; if no rule fires (e.g. the existing matchers don't match a stack
  parent), it falls back to materializing into an `Arc<ArrayInner<V>>` and
  running `ArrayRef::optimize`, preserving today's behaviour for filter,
  take, and any other rule that has not been migrated to the stack path.

### Stack path for slice

`SliceReduceAdaptor::reduce_parent_ref` is overridden to extract the parent
via `ParentRef::try_view::<Slice>` (which works for both heap- and
stack-allocated parents). When `ArrayRef::slice` is called:

1. Builds `ArrayParts<Slice>` on the stack with a `SliceData` and the
   single child slot.
2. `ArrayParts::optimize` walks the slot, calling the child's
   `reduce_parent` with a stack `ParentRef`.
3. `SliceReduceAdaptor::reduce_parent_ref` downcasts via `try_view::<Slice>`,
   bypasses the default `try_match_parent` (which would have returned `None`
   for the stack parent), runs `precondition` and `<V as SliceReduce>::slice`,
   and returns the reduced array.

No wrapper `Arc<ArrayInner<Slice>>` is allocated for the slice itself; the
fallback only fires when the child encoding has no `SliceReduce` impl.

### Tests

- `cargo build --workspace`
- `cargo nextest run -p vortex-array` (2487 pass, 21 skipped — same set of
  timezone-dependent skips as before; unrelated to this change)
- `cargo nextest run -p vortex-fastlanes -p vortex-fsst -p vortex-alp -p
  vortex-runend -p vortex-zigzag -p vortex-pco -p vortex-datetime-parts -p
  vortex-decimal-byte-parts -p vortex-bytebool -p vortex-zstd -p
  vortex-sparse -p vortex-sequence` (893 pass)
- `cargo clippy -p vortex-array --all-targets --all-features`
- `cargo +nightly fmt --all`
- `./scripts/public-api.sh`

Signed-off-by: Robert Kruszewski <dzobert@gmail.com>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 force-pushed the refactor/parent-ref-stack-allocated branch from 3751910 to 08c9f4a Compare May 1, 2026 14:51
Comment thread vortex-array/src/array/erased.rs Outdated
Comment thread vortex-array/src/matcher.rs Outdated
Comment thread vortex-array/src/array/parent.rs
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 1, 2026

Merging this PR will degrade performance by 24.95%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 7 regressed benchmarks
✅ 1161 untouched benchmarks
⏩ 138 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation filter_all_true[100000] 7.6 µs 9.5 µs -19.79%
Simulation filter_all_true[10000] 8.6 µs 10.4 µs -18%
Simulation filter_all_true[250000] 7.6 µs 9.5 µs -19.79%
Simulation filter_powerlaw_by_correlated_runs[1000] 28.8 µs 32.7 µs -11.9%
Simulation varbinview_zip_block_mask 2.9 ms 3.7 ms -21.65%
Simulation varbinview_zip_fragmented_mask 6.5 ms 7.3 ms -10.29%
Simulation new_bp_prim_test_between[i64, 32768] 177.3 µs 236.2 µs -24.95%
Simulation bitwise_not_vortex_buffer_mut[128] 275.3 ns 246.1 ns +11.85%

Comparing refactor/parent-ref-stack-allocated (af552a8) with develop (5e5572b)

Open in CodSpeed

Footnotes

  1. 138 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Comment thread vortex-array/src/array/typed.rs
Comment thread encodings/fastlanes/src/bitpacking/vtable/operations.rs
robert3005 and others added 3 commits May 1, 2026 11:30
- `ArrayParts::optimize` and `optimize_ctx` now take `&self` and return
  `Option<ArrayRef>` instead of consuming `self` and returning a custom
  `Optimized<V>` enum. This drops `Optimized<V>` from the public API and
  makes `optimize` follow the usual Rust convention of borrowing what it
  doesn't need to consume.
- The materialize-and-then-heap-optimize fallback that used to live inside
  `optimize_inner` now lives at the call sites for `filter` and `take`
  (which haven't migrated their adaptors to the stack path yet). For
  `slice`, the fallback is unreachable because `SliceReduceAdaptor` already
  overrides `reduce_parent_ref`.
- `Matcher::matches_parent` is overridden for the blanket `impl<V: VTable>
  Matcher for V` to test by encoding-specific data downcast via
  `ParentRef::try_view`. Without this, `ParentRuleSet` skips the rule for
  stack-backed parents because the default `try_match_parent` returns
  `None`, so `SliceReduceAdaptor::reduce_parent_ref` was never invoked on
  the slice fast path. The override matches both heap- and stack-backed
  parents uniformly.
- `SliceArray::try_new_parts` / `new_parts`,
  `FilterArray::try_new_parts` / `new_parts`, and
  `DictArray::try_new_parts` build the corresponding `ArrayParts<V>`
  without materializing. The existing `try_new` / `new` constructors
  delegate. Call sites in `ArrayRef::slice` / `filter` / `take` use them
  directly:

  ```rust
  let parts = SliceArray::try_new_parts(self.clone(), range)?;
  let sliced = match parts.optimize()? {
      Some(reduced) => reduced,
      None => parts.into_array()?,
  };
  ```

### Tests
- `cargo build --workspace`
- `cargo nextest run -p vortex-array` (2487 pass, 21 timezone-dependent
  skips)
- `cargo clippy -p vortex-array --all-targets --all-features`
- `cargo +nightly fmt --all`
- `./scripts/public-api.sh`

Signed-off-by: Robert Kruszewski <dzobert@gmail.com>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@joseph-isaacs joseph-isaacs added the action/benchmark Trigger full benchmarks to run on this PR label May 1, 2026
@robert3005 robert3005 added changelog/break A breaking API change changelog/performance A performance improvement and removed action/benchmark Trigger full benchmarks to run on this PR changelog/performance A performance improvement labels May 1, 2026
{
for plugin in plugins.as_ref() {
if let Some(reduced) = plugin(child, &parent_ref, slot_idx)? {
return Ok(Some(cascade(reduced, session)?));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check here if slot_idx equals slots length and not call cascade if it is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/break A breaking API change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants