Allow running reduce_parent operations on stack allocated parents#7751
Allow running reduce_parent operations on stack allocated parents#7751robert3005 wants to merge 5 commits intodevelopfrom
Conversation
`ArrayRef::slice/filter/take` previously allocated a wrapper array (`SliceArray`, `FilterArray`, `DictArray`) and called `.optimize()` on the resulting `ArrayRef`, relying on a `reduce_parent` rule to throw the wrapper away. The wrapper allocation was always paid even when reduction ran in one shot. This change establishes the API surface for moving the wrapper onto the stack: - `ParentRef<'a>`: a parent representation that optionally borrows an `&ArrayRef` and otherwise carries the encoding-specific data, dtype, length, slots, and encoding id directly. `into_array_ref(self)` clones the underlying `Arc` when the parent is heap-backed. - `ParentView<'a, V>`: a typed view of a parent that derefs to `V::ArrayData` without holding an `&ArrayRef`. Used by the upcoming matcher path that accepts stack-allocated parents. - `DynArray::data_any` exposes the encoding-specific data so a matcher can downcast to `V::ArrayData` from a `ParentRef` regardless of whether the parent is heap- or stack-backed. - `ArrayParts<V>::optimize`, `optimize_ctx(session)`, and `into_array`, plus `Optimized<V>` with its own `into_array`. Callers chain `parts.optimize()?.into_array()` so reduction is an explicit, orthogonal step from materialization. - `ArrayRef::slice / filter / take` now build an `ArrayParts<V>` and drive it through the chain. The internals of `ArrayParts::optimize` still materialize before running the existing `reduce_parent` chain, so this PR does not yet remove the `Arc<ArrayInner<V>>` allocation. Wiring `ParentRef` through `DynArray::reduce_parent`, `VTable::reduce_parent`, `ParentRuleSet`, `DynArrayParentReduceRule`, `ReduceParentFn`, the `Matcher` API, and the per-encoding rule bodies is the follow-up that delivers the allocation savings. ### Tests - `cargo build --workspace` - `cargo nextest run -p vortex-array -p vortex-fastlanes -p vortex-fsst -p vortex-alp -p vortex-runend -p vortex-zigzag` (all 3078 pass; the 21 skipped are timezone-dependent and unrelated to this change) - `cargo clippy -p vortex-array --all-targets --all-features` - `cargo +nightly fmt --all -- --check` - `./scripts/public-api.sh` Signed-off-by: Robert Kruszewski <dzobert@gmail.com>
…r slice Builds on the previous commit's `ArrayParts::optimize` chain by flipping the `reduce_parent` dispatch signatures to take `&ParentRef<'_>` and wiring the slice path through the stack-allocated parent so `ArrayRef::slice` never allocates a wrapper `Arc<ArrayInner<Slice>>` when the child encoding has a `SliceReduce` impl. ### Signature changes The following all flip `parent: &ArrayRef` -> `parent: &ParentRef<'_>`: - `DynArray::reduce_parent`, `VTable::reduce_parent`, `ArrayRef::reduce_parent` delegate. - `DynArrayParentReduceRule::matches` / `::reduce_parent`, `ParentRuleSet::evaluate`, `ParentReduceRuleAdapter`. - `ReduceParentFn` (session-registered kernels). The lookup keys `(parent_id, child_id)` are unchanged. - All 34 `impl VTable::reduce_parent` overrides — mechanical change because every impl just delegates to `PARENT_RULES.evaluate(array, parent, idx)`. The optimizer's `try_optimize` and the executor's `walk` build a `ParentRef::from_array_ref(¤t_array)` once and reuse it across the slot loop. ### How existing rules keep working - `Matcher` gains a `try_match_parent` default that delegates to the existing `try_match` when the parent is heap-allocated and returns `None` otherwise. So existing matchers (`AnyArray`, `AnyCanonical`, `AnyTemporal`, `AnyScalarFn`, `ExactScalarFn`, plus the blanket `impl<V: VTable> Matcher for V`) work unchanged for heap-backed parents. - `ArrayParentReduceRule` gains a `reduce_parent_ref` default that uses `try_match_parent` and delegates to `reduce_parent`. All 19 existing rule bodies keep their `Match<'_>` signature; the dispatch from `ParentReduceRuleAdapter` now goes through `reduce_parent_ref`. - `ArrayParts::optimize_inner` walks slots with a stack `ParentRef::from_parts` first; if no rule fires (e.g. the existing matchers don't match a stack parent), it falls back to materializing into an `Arc<ArrayInner<V>>` and running `ArrayRef::optimize`, preserving today's behaviour for filter, take, and any other rule that has not been migrated to the stack path. ### Stack path for slice `SliceReduceAdaptor::reduce_parent_ref` is overridden to extract the parent via `ParentRef::try_view::<Slice>` (which works for both heap- and stack-allocated parents). When `ArrayRef::slice` is called: 1. Builds `ArrayParts<Slice>` on the stack with a `SliceData` and the single child slot. 2. `ArrayParts::optimize` walks the slot, calling the child's `reduce_parent` with a stack `ParentRef`. 3. `SliceReduceAdaptor::reduce_parent_ref` downcasts via `try_view::<Slice>`, bypasses the default `try_match_parent` (which would have returned `None` for the stack parent), runs `precondition` and `<V as SliceReduce>::slice`, and returns the reduced array. No wrapper `Arc<ArrayInner<Slice>>` is allocated for the slice itself; the fallback only fires when the child encoding has no `SliceReduce` impl. ### Tests - `cargo build --workspace` - `cargo nextest run -p vortex-array` (2487 pass, 21 skipped — same set of timezone-dependent skips as before; unrelated to this change) - `cargo nextest run -p vortex-fastlanes -p vortex-fsst -p vortex-alp -p vortex-runend -p vortex-zigzag -p vortex-pco -p vortex-datetime-parts -p vortex-decimal-byte-parts -p vortex-bytebool -p vortex-zstd -p vortex-sparse -p vortex-sequence` (893 pass) - `cargo clippy -p vortex-array --all-targets --all-features` - `cargo +nightly fmt --all` - `./scripts/public-api.sh` Signed-off-by: Robert Kruszewski <dzobert@gmail.com> Signed-off-by: Robert Kruszewski <github@robertk.io>
3751910 to
08c9f4a
Compare
Merging this PR will degrade performance by 24.95%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | filter_all_true[100000] |
7.6 µs | 9.5 µs | -19.79% |
| ❌ | Simulation | filter_all_true[10000] |
8.6 µs | 10.4 µs | -18% |
| ❌ | Simulation | filter_all_true[250000] |
7.6 µs | 9.5 µs | -19.79% |
| ❌ | Simulation | filter_powerlaw_by_correlated_runs[1000] |
28.8 µs | 32.7 µs | -11.9% |
| ❌ | Simulation | varbinview_zip_block_mask |
2.9 ms | 3.7 ms | -21.65% |
| ❌ | Simulation | varbinview_zip_fragmented_mask |
6.5 ms | 7.3 ms | -10.29% |
| ❌ | Simulation | new_bp_prim_test_between[i64, 32768] |
177.3 µs | 236.2 µs | -24.95% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[128] |
275.3 ns | 246.1 ns | +11.85% |
Comparing refactor/parent-ref-stack-allocated (af552a8) with develop (5e5572b)
Footnotes
-
138 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
- `ArrayParts::optimize` and `optimize_ctx` now take `&self` and return
`Option<ArrayRef>` instead of consuming `self` and returning a custom
`Optimized<V>` enum. This drops `Optimized<V>` from the public API and
makes `optimize` follow the usual Rust convention of borrowing what it
doesn't need to consume.
- The materialize-and-then-heap-optimize fallback that used to live inside
`optimize_inner` now lives at the call sites for `filter` and `take`
(which haven't migrated their adaptors to the stack path yet). For
`slice`, the fallback is unreachable because `SliceReduceAdaptor` already
overrides `reduce_parent_ref`.
- `Matcher::matches_parent` is overridden for the blanket `impl<V: VTable>
Matcher for V` to test by encoding-specific data downcast via
`ParentRef::try_view`. Without this, `ParentRuleSet` skips the rule for
stack-backed parents because the default `try_match_parent` returns
`None`, so `SliceReduceAdaptor::reduce_parent_ref` was never invoked on
the slice fast path. The override matches both heap- and stack-backed
parents uniformly.
- `SliceArray::try_new_parts` / `new_parts`,
`FilterArray::try_new_parts` / `new_parts`, and
`DictArray::try_new_parts` build the corresponding `ArrayParts<V>`
without materializing. The existing `try_new` / `new` constructors
delegate. Call sites in `ArrayRef::slice` / `filter` / `take` use them
directly:
```rust
let parts = SliceArray::try_new_parts(self.clone(), range)?;
let sliced = match parts.optimize()? {
Some(reduced) => reduced,
None => parts.into_array()?,
};
```
### Tests
- `cargo build --workspace`
- `cargo nextest run -p vortex-array` (2487 pass, 21 timezone-dependent
skips)
- `cargo clippy -p vortex-array --all-targets --all-features`
- `cargo +nightly fmt --all`
- `./scripts/public-api.sh`
Signed-off-by: Robert Kruszewski <dzobert@gmail.com>
| { | ||
| for plugin in plugins.as_ref() { | ||
| if let Some(reduced) = plugin(child, &parent_ref, slot_idx)? { | ||
| return Ok(Some(cascade(reduced, session)?)); |
There was a problem hiding this comment.
we should check here if slot_idx equals slots length and not call cascade if it is
Add indirection logic to support stack allocated array parents in reduce rules.
This lets us avoid heap allocation for commonly used operations that are often
immediately optimised away