Allow running reduce_parent operations on stack allocated parents by robert3005 · Pull Request #7751 · vortex-data/vortex

robert3005 · 2026-05-01T14:51:07Z

Add indirection logic to support stack allocated array parents in reduce rules.

This lets us avoid heap allocation for commonly used operations that are often
immediately optimised away

`ArrayRef::slice/filter/take` previously allocated a wrapper array (`SliceArray`, `FilterArray`, `DictArray`) and called `.optimize()` on the resulting `ArrayRef`, relying on a `reduce_parent` rule to throw the wrapper away. The wrapper allocation was always paid even when reduction ran in one shot. This change establishes the API surface for moving the wrapper onto the stack: - `ParentRef<'a>`: a parent representation that optionally borrows an `&ArrayRef` and otherwise carries the encoding-specific data, dtype, length, slots, and encoding id directly. `into_array_ref(self)` clones the underlying `Arc` when the parent is heap-backed. - `ParentView<'a, V>`: a typed view of a parent that derefs to `V::ArrayData` without holding an `&ArrayRef`. Used by the upcoming matcher path that accepts stack-allocated parents. - `DynArray::data_any` exposes the encoding-specific data so a matcher can downcast to `V::ArrayData` from a `ParentRef` regardless of whether the parent is heap- or stack-backed. - `ArrayParts<V>::optimize`, `optimize_ctx(session)`, and `into_array`, plus `Optimized<V>` with its own `into_array`. Callers chain `parts.optimize()?.into_array()` so reduction is an explicit, orthogonal step from materialization. - `ArrayRef::slice / filter / take` now build an `ArrayParts<V>` and drive it through the chain. The internals of `ArrayParts::optimize` still materialize before running the existing `reduce_parent` chain, so this PR does not yet remove the `Arc<ArrayInner<V>>` allocation. Wiring `ParentRef` through `DynArray::reduce_parent`, `VTable::reduce_parent`, `ParentRuleSet`, `DynArrayParentReduceRule`, `ReduceParentFn`, the `Matcher` API, and the per-encoding rule bodies is the follow-up that delivers the allocation savings. ### Tests - `cargo build --workspace` - `cargo nextest run -p vortex-array -p vortex-fastlanes -p vortex-fsst -p vortex-alp -p vortex-runend -p vortex-zigzag` (all 3078 pass; the 21 skipped are timezone-dependent and unrelated to this change) - `cargo clippy -p vortex-array --all-targets --all-features` - `cargo +nightly fmt --all -- --check` - `./scripts/public-api.sh` Signed-off-by: Robert Kruszewski <dzobert@gmail.com>

…r slice Builds on the previous commit's `ArrayParts::optimize` chain by flipping the `reduce_parent` dispatch signatures to take `&ParentRef<'_>` and wiring the slice path through the stack-allocated parent so `ArrayRef::slice` never allocates a wrapper `Arc<ArrayInner<Slice>>` when the child encoding has a `SliceReduce` impl. ### Signature changes The following all flip `parent: &ArrayRef` -> `parent: &ParentRef<'_>`: - `DynArray::reduce_parent`, `VTable::reduce_parent`, `ArrayRef::reduce_parent` delegate. - `DynArrayParentReduceRule::matches` / `::reduce_parent`, `ParentRuleSet::evaluate`, `ParentReduceRuleAdapter`. - `ReduceParentFn` (session-registered kernels). The lookup keys `(parent_id, child_id)` are unchanged. - All 34 `impl VTable::reduce_parent` overrides — mechanical change because every impl just delegates to `PARENT_RULES.evaluate(array, parent, idx)`. The optimizer's `try_optimize` and the executor's `walk` build a `ParentRef::from_array_ref(&current_array)` once and reuse it across the slot loop. ### How existing rules keep working - `Matcher` gains a `try_match_parent` default that delegates to the existing `try_match` when the parent is heap-allocated and returns `None` otherwise. So existing matchers (`AnyArray`, `AnyCanonical`, `AnyTemporal`, `AnyScalarFn`, `ExactScalarFn`, plus the blanket `impl<V: VTable> Matcher for V`) work unchanged for heap-backed parents. - `ArrayParentReduceRule` gains a `reduce_parent_ref` default that uses `try_match_parent` and delegates to `reduce_parent`. All 19 existing rule bodies keep their `Match<'_>` signature; the dispatch from `ParentReduceRuleAdapter` now goes through `reduce_parent_ref`. - `ArrayParts::optimize_inner` walks slots with a stack `ParentRef::from_parts` first; if no rule fires (e.g. the existing matchers don't match a stack parent), it falls back to materializing into an `Arc<ArrayInner<V>>` and running `ArrayRef::optimize`, preserving today's behaviour for filter, take, and any other rule that has not been migrated to the stack path. ### Stack path for slice `SliceReduceAdaptor::reduce_parent_ref` is overridden to extract the parent via `ParentRef::try_view::<Slice>` (which works for both heap- and stack-allocated parents). When `ArrayRef::slice` is called: 1. Builds `ArrayParts<Slice>` on the stack with a `SliceData` and the single child slot. 2. `ArrayParts::optimize` walks the slot, calling the child's `reduce_parent` with a stack `ParentRef`. 3. `SliceReduceAdaptor::reduce_parent_ref` downcasts via `try_view::<Slice>`, bypasses the default `try_match_parent` (which would have returned `None` for the stack parent), runs `precondition` and `<V as SliceReduce>::slice`, and returns the reduced array. No wrapper `Arc<ArrayInner<Slice>>` is allocated for the slice itself; the fallback only fires when the child encoding has no `SliceReduce` impl. ### Tests - `cargo build --workspace` - `cargo nextest run -p vortex-array` (2487 pass, 21 skipped — same set of timezone-dependent skips as before; unrelated to this change) - `cargo nextest run -p vortex-fastlanes -p vortex-fsst -p vortex-alp -p vortex-runend -p vortex-zigzag -p vortex-pco -p vortex-datetime-parts -p vortex-decimal-byte-parts -p vortex-bytebool -p vortex-zstd -p vortex-sparse -p vortex-sequence` (893 pass) - `cargo clippy -p vortex-array --all-targets --all-features` - `cargo +nightly fmt --all` - `./scripts/public-api.sh` Signed-off-by: Robert Kruszewski <dzobert@gmail.com> Signed-off-by: Robert Kruszewski <github@robertk.io>

codspeed-hq · 2026-05-01T14:58:01Z

Merging this PR will degrade performance by 24.95%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 7 regressed benchmarks
✅ 1161 untouched benchmarks
⏩ 138 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`filter_all_true[100000]`	7.6 µs	9.5 µs	-19.79%
❌	Simulation	`filter_all_true[10000]`	8.6 µs	10.4 µs	-18%
❌	Simulation	`filter_all_true[250000]`	7.6 µs	9.5 µs	-19.79%
❌	Simulation	`filter_powerlaw_by_correlated_runs[1000]`	28.8 µs	32.7 µs	-11.9%
❌	Simulation	`varbinview_zip_block_mask`	2.9 ms	3.7 ms	-21.65%
❌	Simulation	`varbinview_zip_fragmented_mask`	6.5 ms	7.3 ms	-10.29%
❌	Simulation	`new_bp_prim_test_between[i64, 32768]`	177.3 µs	236.2 µs	-24.95%
⚡	Simulation	`bitwise_not_vortex_buffer_mut[128]`	275.3 ns	246.1 ns	+11.85%

_{Comparing refactor/parent-ref-stack-allocated (af552a8) with develop (5e5572b)}

138 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

- `ArrayParts::optimize` and `optimize_ctx` now take `&self` and return `Option<ArrayRef>` instead of consuming `self` and returning a custom `Optimized<V>` enum. This drops `Optimized<V>` from the public API and makes `optimize` follow the usual Rust convention of borrowing what it doesn't need to consume. - The materialize-and-then-heap-optimize fallback that used to live inside `optimize_inner` now lives at the call sites for `filter` and `take` (which haven't migrated their adaptors to the stack path yet). For `slice`, the fallback is unreachable because `SliceReduceAdaptor` already overrides `reduce_parent_ref`. - `Matcher::matches_parent` is overridden for the blanket `impl<V: VTable> Matcher for V` to test by encoding-specific data downcast via `ParentRef::try_view`. Without this, `ParentRuleSet` skips the rule for stack-backed parents because the default `try_match_parent` returns `None`, so `SliceReduceAdaptor::reduce_parent_ref` was never invoked on the slice fast path. The override matches both heap- and stack-backed parents uniformly. - `SliceArray::try_new_parts` / `new_parts`, `FilterArray::try_new_parts` / `new_parts`, and `DictArray::try_new_parts` build the corresponding `ArrayParts<V>` without materializing. The existing `try_new` / `new` constructors delegate. Call sites in `ArrayRef::slice` / `filter` / `take` use them directly: ```rust let parts = SliceArray::try_new_parts(self.clone(), range)?; let sliced = match parts.optimize()? { Some(reduced) => reduced, None => parts.into_array()?, }; ``` ### Tests - `cargo build --workspace` - `cargo nextest run -p vortex-array` (2487 pass, 21 timezone-dependent skips) - `cargo clippy -p vortex-array --all-targets --all-features` - `cargo +nightly fmt --all` - `./scripts/public-api.sh` Signed-off-by: Robert Kruszewski <dzobert@gmail.com>

Signed-off-by: Robert Kruszewski <github@robertk.io>

robert3005 · 2026-05-01T17:56:23Z

+            {
+                for plugin in plugins.as_ref() {
+                    if let Some(reduced) = plugin(child, &parent_ref, slot_idx)? {
+                        return Ok(Some(cascade(reduced, session)?));


we should check here if slot_idx equals slots length and not call cascade if it is

robert3005 added 2 commits May 1, 2026 09:31

robert3005 force-pushed the refactor/parent-ref-stack-allocated branch from 3751910 to 08c9f4a Compare May 1, 2026 14:51