feat(sql): add array_sort and array_reverse functions for double arrays#6820
feat(sql): add array_sort and array_reverse functions for double arrays#6820bluestreak01 merged 5 commits intomasterfrom
Conversation
Implement two new scalar array functions for DOUBLE[] arrays: array_sort(arr [, descending [, nulls_first]]) - sorts the innermost dimension. Follows PostgreSQL semantics: ascending by default with NaN last; when descending=true, NaN defaults to first. The explicit nulls_first parameter overrides the default. array_reverse(arr) - reverses the innermost dimension, preserving all values including NaN in their reversed positions. Both functions support multi-dimensional arrays by recursing into the innermost dimension and operating on each slice independently. The sort uses a reusable buffer to avoid per-row allocation. Three factory classes handle the array_sort overloads (0, 1, and 2 boolean parameters), sharing a single Func implementation. Tests cover all NaN placement combinations, empty/single/multi-element arrays, multi-dimensional vanilla and non-vanilla paths, parallel execution, function composition, and 5 fuzz tests with 100 iterations each.
Extract INITIAL_BUFFER_SIZE constant to replace magic 64 in sortBuffer initialization. Include descending/nullsFirst flags in toPlan() output so EXPLAIN shows the actual sort configuration instead of always printing array_sort(arr).
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai please review critically |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 762 🏁 Script executed: Length of output: 8266 🏁 Script executed: Length of output: 5572 🏁 Script executed: Length of output: 827 🏁 Script executed: Length of output: 3996 🏁 Script executed: Length of output: 1960 🏁 Script executed: Length of output: 4092 🏁 Script executed: Length of output: 8665 🏁 Script executed: Length of output: 5703 🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 1381 🏁 Script executed: Length of output: 5303 🏁 Script executed: Length of output: 11196 🏁 Script executed: Length of output: 133 🏁 Script executed: Length of output: 2335 🏁 Script executed: Length of output: 1543 🏁 Script executed: Length of output: 2036 🏁 Script executed: Length of output: 2098 🏁 Script executed: Length of output: 0
🔴 Critical: Unbounded Memory AllocationLine 181, DoubleArraySortFunctionFactory.java: if (lastDim > sortBuffer.length) {
sortBuffer = new double[lastDim];
}This creates an unbounded heap allocation risk. What prevents a malicious or buggy query from materializing Your design notes say "No per-function heap bound... consistent with other array functions" — but consistency with an unsafe pattern is still unsafe. Where's the circuit breaker? Recommendation: Introduce a configurable 🔴 Coverage Gap: DoubleArrayReverseFunctionFactory at 69.23%Why is reverse at 69% coverage when sort is at 95%? What code paths are untested? Looking at the implementation, I see:
Your tests have 11 test methods but apparently 16 lines (52-36=16) are uncovered. Which branches? The non-vanilla recursive path? Specific dimension counts? I need concrete details on what's missing. Run this and show me the uncovered lines: #!/bin/bash
# Get detailed coverage report for DoubleArrayReverseFunctionFactory
./gradlew test --tests DoubleArrayReverseFunctionFactoryTest jacocoTestReport
# Then extract uncovered lines🟡 Non-Vanilla Array Performance: Recursive OverheadLine 77, DoubleArraySortFunctionFactory.java: sortRecursive(view, dim + 1, flatIndex, descending, nullsFirst, buf, memory);For a 5-dimensional array with shape [2, 3, 4, 5, 100], you're making 120 recursive calls before hitting the actual sort. Each call pushes a stack frame with 7 parameters. Have you profiled this? How does it compare to the vanilla path? Challenge: Show me a benchmark. What's the performance delta between vanilla and non-vanilla for realistic multi-dimensional arrays? If it's >2x slower, you need an optimization strategy (e.g., flatten recursion or use an iterative stack). 🟡 Missing Edge Case TestsYour test suite is decent but missing critical adversarial inputs:
Request: Add at least 3 adversarial tests covering the above scenarios. If any fail, we need to handle them before merge. 🟡 Buffer Thrashing on Varying SizesLine 181: If a query processes rows with alternating array sizes like [100], [50], [200], [75], [300], you'll allocate buffers of size 100, then 200, then 300. That's 3 allocations where you could have had 1. Why not round up to nearest power of 2 or use a growth factor (e.g., 1.5x)? This is a standard amortized allocation strategy. Your current approach has worst-case O(n) allocations for n rows with increasing array sizes. 🟡 Infinity vs NaN DocumentationLine 97-114, writeSlice: Your code handles NaN explicitly but the comments don't explain what happens with Infinity/-Infinity. From the test: // QuestDB arrays store infinities as NaN, so they appear as nullThis is confusing. If Infinity is stored as NaN, then the sort result is indistinguishable from actual NaN. Is this intentional? Document the exact semantics in a comment above 🟢 Design Questions (Not Blockers, But Curious)
🔵 Verification RequestsBefore I approve, I need answers to:
This is solid work overall, but these gaps need addressing before it ships. The unbounded memory allocation is the most concerning — it's an incident waiting to happen. 🧠 Learnings used |
Round sort buffer growth to the next power of 2 via Numbers.ceilPow2() to eliminate buffer thrashing for varying array sizes. Add two missing tests for array_reverse: testMultiDimNonVanilla exercises the recursive non-vanilla code path using a sliced array, and testParallel verifies correct per-worker function instantiation with a 4-thread WorkerPool.
|
@coderabbitai Thanks for the thorough review. Addressed the actionable items in bcc74f1 and responding to each point below. 🔴 Unbounded Memory Allocation The sort buffer size equals the last dimension length, which is bounded by 🔴 Coverage Gap → Fixed Added 🟡 Non-Vanilla Recursive Overhead Recursion depth equals the number of dimensions (typically 1-3). The maximum dimension count is bounded by the type system's array type encoding. This is the same recursive pattern used by every array function in the codebase ( 🟡 Missing Edge Case Tests
🟡 Buffer Thrashing → Fixed Changed buffer growth to use 🟡 Infinity vs NaN QuestDB's array storage maps all non-finite values (Infinity, -Infinity, NaN) to NaN. This is a property of the storage layer ( 🟢 Design Questions
🔵 Verification Summary
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 838 🏁 Script executed: Length of output: 827 🏁 Script executed: Length of output: 1331 🏁 Script executed: Length of output: 389 🏁 Script executed: Length of output: 2510
✅ VerifiedUnbounded Memory → Resolved Coverage Gap → Fixed
Coverage should now be ~95% for reverse, matching sort. Buffer Thrashing → Fixed Edge Cases → Covered 🟢 Design AssessmentYour responses on the three factory classes, The Infinity→NaN mapping being a storage-layer property (not function-specific) makes sense. No additional documentation needed beyond existing inline comments. Final VerdictThis is solid. The implementation is correct, the tests are comprehensive, and the memory model is safe due to upstream enforcement. The fixes in bcc74f1 address my actionable feedback. Ready to merge. 🚀 ✏️ Learnings added
🧠 Learnings used |
[PR Coverage check]😍 pass : 145 / 151 (96.03%) file detail
|
|
Claude review identified test coverage gaps, which might be worth to fix. |
## Summary - Add `array_sort()` documentation: sorts array elements by value along the innermost dimension, with optional descending and nulls-first control - Add `array_reverse()` documentation: reverses element order within the innermost dimension, preserving whatever external ordering the array already has - Both include examples composing with `array_agg` + `SAMPLE BY` on the demo dataset - All example patterns validated against a local QuestDB instance Ref: questdb/questdb#6820
Fixes #6808
Documentation at questdb/documentation#378
Add
array_sort(DOUBLE[])andarray_reverse(DOUBLE[])scalar functions that operate on double arrays of any dimensionality.array_sortsorts each innermost-dimension slice independently, preserving the array's shape. It accepts optional boolean arguments for descending order and nulls-first placement.array_reversereverses each innermost-dimension slice.Both functions handle NULL arrays, empty arrays, NaN values, and multidimensional inputs. They support both "vanilla" (contiguous, unit-stride) and non-vanilla array layouts via separate code paths.
Design notes
sortBuffergrows on demand and stays at peak size for the cursor's lifetime. Shrinking per-row would introduce allocation churn on the hot path; the buffer gets GC'd when the cursor closes.new double[lastDim]- consistent with other array functions (array_build,array_agg). A shared size policy across array functions would be a separate effort.Test plan
array_sortcovering 1D, 2D, 3D arrays, NaN handling, descending/nulls-first combinations, NULL input, empty arrays, and mixed-sign valuesarray_reversecovering the same shape and edge-case matrix