fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results#6943
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
bluestreak01
left a comment
There was a problem hiding this comment.
@jovfer — review findings below.
Moderate
1. Code duplication with checkForChildWindowFunctions — SqlOptimiser.java:10876-10892
The new hasNestedWindowExpression is functionally identical to the existing checkForChildWindowFunctions at line 10900. Both walk the AST checking node.windowExpression != null, with the same paramCount < 3 branching logic. The only difference is recursive vs. iterative implementation.
Since the call site at line 9083 is in an instance method of SqlOptimiser, the instance field sqlNodeStack (line 193) is available. The new method should be removed and replaced with a call to the existing one:
if (explicitGroupBy && !qc.isWindowExpression() && checkForChildWindowFunctions(sqlNodeStack, qc.getAst())) {This eliminates duplication and avoids introducing a recursive traversal where the codebase already uses an iterative stack-based approach for the same logic.
2. Method placement violates member ordering — SqlOptimiser.java:10876
The new method is private static but is placed among instance methods (between wrapWithSelectModel at line 10837 and wrapWithSelectWildcard at line 10894). Per CLAUDE.md, static members should be grouped with other statics and sorted alphabetically. The other private static methods live at lines 431–785; hasNestedWindowExpression belongs between extractAndTerms (431) and isCompileTimeConstant (444). Moot if finding 1 is addressed.
Minor
3. node.args.size() vs node.paramCount — SqlOptimiser.java:10886
The new method loops over node.args.size(), while the existing checkForChildWindowFunctions at line 10926 uses node.paramCount. These should match in practice, but paramCount is the canonical bound per the ExpressionNode invariants (line 64–68). Moot if finding 1 is addressed.
4. Test uses multiple INSERT statements — WindowFunctionTest.java:8217-8219
Per coding conventions: "use a single INSERT statement to insert multiple rows." The three separate INSERTs should be consolidated:
execute("INSERT INTO t VALUES (1, 'A', 1000000), (2, 'A', 2000000), (3, 'B', 3000000)");5. Error position points at the root expression, not the window function — SqlOptimiser.java:9084
qc.getAst().position yields position 33 (the - operator), whereas the actual offending character is the window function avg at position 35. The convention is that the position should point at the specific offending character. The existing top-level check at line 9253 uses hardcoded position 0, so position 33 is still an improvement — this is a nit.
6. Limited test coverage — WindowFunctionTest.java:8214-8229
Only one query shape is tested. Consider adding:
- Deeper nesting:
SELECT cat, 1 + (2 + avg(x) OVER ()) FROM t GROUP BY cat— verifies the AST walk recurses correctly - CASE expression:
SELECT cat, CASE WHEN avg(x) > 0 THEN avg(x) OVER () END FROM t GROUP BY cat— exercises theparamCount >= 3branch - Positive regression test: a query with nested window functions WITHOUT GROUP BY still works (e.g.,
SELECT (avg(x) - avg(x) OVER ()) FROM t) — guards against the check accidentally rejecting valid queries
ff4b81e to
f62483e
Compare
150dcd9 to
30deaa7
Compare
59343af to
94e22a4
Compare
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Window functions nested inside arithmetic expressions (e.g., avg(x) - avg(x) OVER ()) bypassed the GROUP BY + window function validation because the top-level column was an operator, not a WindowExpression. The existing check only detected top-level window columns via REWRITE_STATUS flags, so such queries silently produced wrong results. Walk the AST of each SELECT column at the start of the rewrite loop to detect nested window expressions that sit outside any aggregate, and reject the query with the same error as top-level window + GROUP BY mixing. The new traversal skips subtrees rooted at aggregate functions because PR #6955 made max(avg(x) OVER (...)) and similar shapes legal: those window functions are lowered into an inner window model. Only windows that reach the top of the SELECT column without an enclosing aggregate are unrepresentable. The guard fires both for explicit GROUP BY and for implicit grouping (bare aggregates without a GROUP BY clause, e.g. SELECT sum(x) + avg(x) OVER () FROM t). The implicit case is safe because the pre-scan sets REWRITE_STATUS_USE_GROUP_BY_MODEL only when real aggregates are present (checkForChildAggregates skips window functions via windowExpression == null), so queries like SELECT (x - avg(x) OVER ()) FROM t are unaffected. The SqlException position points at the nested window function itself (e.g., avg at position 35) rather than the root operator, matching the convention that error positions identify the specific offending character. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
94e22a4 to
1b01325
Compare
Extend findWindowFunctionOutsideAggregatePos to also match pure window function names (row_number, rank, ...) used without an OVER clause next to an aggregate. The previous form only checked windowExpression, so SELECT sum(x) + row_number() FROM t fell through to a runtime "empty window context" rather than a clean compile-time error. Reword the comment on the aggregate-subtree skip to attribute the windowExpression invariant to the parser (where it is actually set), not to checkForChildAggregates, which only reads it. Correct the adjacent comment describing what the top-level REWRITE_STATUS check catches - only columns whose AST is itself a window expression, not every top-level window reference. Convert the method header to Javadoc to match neighbouring methods. Add underscore thousands separators to the test data, matching the project-wide rule for numbers of five or more digits. Cover two more shapes in the regression test: - SELECT sum(x) + row_number() FROM t, pinning the new pure-window-name path. - SELECT ... GROUP BY category ORDER BY avg(x) - avg(x) OVER (), which moveOrderByFunctionsIntoOuterSelect lifts into the bottom-up column list, so the SELECT-list guard now catches it with a precise position.
PIVOT compiles into a SELECT with implicit GROUP BY and aggregates per pivot value. The aggregator slot accepts whatever the user writes, but windowExpression-bearing nodes (e.g. sum(41) over()) are not aggregates and have no valid plan in that position. Before the new guard, the optimiser silently lowered the window function and produced a structural NULL NULL row from the empty test table - clearly wrong, but not detected. The new guard rejects sum(41) over() in PIVOT with the same "Window function is not allowed in context of aggregation. Use sub-query." message as any other window-outside-aggregate-with-GROUP-BY shape. The neighbouring case sum(sum(41) over()) still passes because the walker skips aggregate subtrees, matching the surrounding pivot test expectations. Convert the entry from "query 2 ... NULL NULL" to "query error ..." in pivot_errors.test, which already documents PIVOT's other non-aggregate rejection (now()).
…w_groupby_validation
e47b6cc to
2f936ec
Compare
The rewriter only handles agg(window(x)) when the aggregate sits at the column root (e.g. SELECT max(avg(x) OVER ()) FROM t GROUP BY cat). Wrapping that aggregate in arithmetic, a function call, or a CASE branch falls back to a code path that raises a misleading "Aggregate function cannot be passed as an argument" error pointing at the inner window function. Add findInvalidAggregateOverWindowPos: walks each SELECT column's AST under the existing GROUP BY context guard. Allows the column when its root is the supported agg(window()) shape. Otherwise descends the tree looking for an aggregate node that contains a window function, and rejects with "Aggregate over window function cannot be combined with other terms. Use a sub-query." pointing at the offending aggregate. The walker shares its outer stack with findWindowFunctionOutsideAggregatePos and reuses the static checkForChildWindowFunctions helper with sqlNodeStack2 to test whether each candidate aggregate's subtree contains a window. Both guards run in the same call site; the bare-window-outside-aggregate case takes precedence as the more obvious bug. Cover the shapes the user-facing error now catches: - max(avg(x) OVER ()) - min(avg(x) OVER ()) under GROUP BY and the same pattern without an explicit GROUP BY (implicit grouping from sibling aggregates). - max(avg(x) OVER ()) + literal and max(avg(x) OVER ()) + agg(col). - Aggregate-wrapped window inside a CASE THEN branch. Pin the supported and adjacent cases as positive regression guards: - Bare max(avg(x) OVER ()) FROM t GROUP BY cat. - Two separate columns each with a bare aggregate-of-window. - Plain aggregate arithmetic (max(x) + sum(x)) without any window. Extend testNestedWindowFunctionInGroupByContextFails with three more shapes: - mixed aggregate-wrapped window plus a sibling bare window in the same expression - the existing walker correctly skips the aggregate subtree and still flags the bare window. - GROUP BY ordinal that resolves to a window-bearing column - caught by the pre-existing validateGroupByExpression flow with the "aggregate functions are not allowed in GROUP BY" message. - sub-query exposing the window column with the outer query aggregating it - positive regression guard for the suggested workaround. Add a one-line ExpressionNode.java:64-68 invariant pointer above the walk loop in findWindowFunctionOutsideAggregatePos so the paramCount/lhs/rhs/args contract the walker depends on is documented at the use site. PR description gains a Known limitations section listing the two cases that stay rejected: window outside any aggregate mixed with GROUP BY (PostgreSQL accepts this) and any aggregate-wrapped window in a compound expression. Both come with workarounds.
2f936ec to
ddac142
Compare
Apply review feedback on the window/group-by validation guard: - findInvalidAggregateOverWindowPos now null-checks variadic args before pushing to the stack, matching the convention of every other AST walker in SqlOptimiser (checkForChildWindowFunctions, checkForWindowFunctionsOrNames, etc.). The variadic-poll also guards on isEmpty() for symmetry with the sibling walker. - findWindowFunctionOrNamePosition now returns -1 for "not found" instead of 0, aligning with the two new sibling helpers and letting position 0 be a valid result. Internal recursive call sites updated from `pos > 0` to `pos >= 0`. The existing top-level callers in validateNoWindowFunctionsInWhereClauses are unaffected because they only invoke the helper after checkForWindowFunctionsOrNames has already confirmed a window function exists. - Add SAMPLE BY coverage to both negative test methods. SAMPLE BY triggers implicit grouping through sibling aggregates, so the guard fires on both `bare window in expression` and `aggregate-over-window in expression` shapes under SAMPLE BY.
Round of polish on the window/group-by validation guard: - Drop the redundant explicitGroupBy disjunct from the SELECT-list guard. REWRITE_STATUS_USE_GROUP_BY_MODEL is set whenever the GROUP BY clause is present and the model is not HORIZON JOIN, which is exactly the explicitGroupBy condition. The single bit-test is enough. - Add an assert in findInvalidAggregateOverWindowPos enforcing that the caller has already invoked findWindowFunctionOutsideAggregatePos. The comment claimed this contract but nothing checked it; a future caller that copies one helper without the other would silently miss bare windows. The assert turns the implicit precondition into a hard failure under -ea. - Rephrase the ExpressionNode.paramCount comment to point at the field's documentation by name instead of citing line 64-68 of ExpressionNode.java, which would rot on any reordering. - Add CTE and JOIN coverage to testNestedWindowFunctionInGroupByContextFails. Both flow through rewriteSelectClause0, so the same SELECT-list guard catches the offending column with a precise position.
…w_groupby_validation
bluestreak01
left a comment
There was a problem hiding this comment.
Verified the doc-comment and PR-description fixes from the review. Approved.
[PR Coverage check]😍 pass : 59 / 61 (96.72%) file detail
|
Brings in 14 upstream commits (master..errand merge-base 5037278 through b6b3b15). Notable upstream changes: * JDK 25 migration (#6980) — drops Java 8/11 reflection helpers (isJava8Or11, is32BitJVM, getOrdinaryObjectPointersCompressionStatus, AccessibleObject_override_fieldOffset, OVERRIDE constant) from Unsafe.java. Our errand branch had inherited them; they are dead code in our tree too (the only call site was the dropped helper itself). Followed upstream and removed them. * feat(http): QWP egress / WebSocket SQL streaming (#6991, #7004) * perf(sql): faster GROUP BY / hash join finalizer (#6997) * fix(sql): nested window inside aggregate (#6943, #6955) * feat(sql): configurable parquet export encodings (#6949) Conflict resolutions: * core/src/main/java/io/questdb/std/Unsafe.java Both sides added different methods between setRssMemLimit() and checkAllocLimit(). Kept all three: storeFence (upstream), chargeExternalRss (ours), and dropped the dead AccessibleObject_override_fieldOffset() helper per upstream's JDK 25 cleanup. * core/src/main/java/io/questdb/std/str/GcUtf8String.java Our errand branch redesigned this class for lazy native allocation (data byte[] field, allocateNative() on first ptr() call) instead of upstream's eager DirectByteBuffer + reflection-fetched address. Upstream's only post-divergence change was a tiny refactor replacing Unsafe.getUnsafe().X with Unsafe.X — irrelevant to our redesign. Took ours. * core/src/main/resources/io/questdb/bin/linux-x86-64/libquestdbr.so Built artifact — both sides changed Rust source. Took ours as placeholder; will refresh via `mvn package -pl core -am -DskipTests` immediately after this commit.
Summary
avg(x) - avg(x) OVER ()) silently produced wrong results when combined with GROUP BY. The existing check inSqlOptimiser.rewriteSelect0only inspectedREWRITE_STATUSflags on top-level window columns and missed window functions buried inside operators, function calls, or CASE branches.findWindowFunctionOutsideAggregatePos: walks each SELECT column's AST and returns the position of the first window function that is not nested inside an aggregate. Skipping aggregate subtrees keeps shapes likemax(avg(x) OVER (...))legal (per fix(sql): fix nested window inside aggregate with explicit GROUP BY #6955) while rejecting unrepresentable windows mixed with GROUP BY. The error points at the offending window (e.g.avgat position 35) instead of the root operator or position 0.windowExpression != null) and pure window function names used without OVER (e.g.row_number()). The latter previously fell through to a cryptic runtime "empty window context".REWRITE_STATUS_USE_GROUP_BY_MODELonly for real aggregates (checkForChildAggregatesskips window functions viawindowExpression == null), so the extended guard fires onSELECT sum(x) + avg(x) OVER () FROM t(sibling aggregate triggers grouping) without producing false positives on valid queries likeSELECT (x - avg(x) OVER ()) FROM t.findInvalidAggregateOverWindowPosfor a related shape: a wrapper around an aggregate that contains a window function (arithmetic, function call, CASE branch). PR fix(sql): fix nested window inside aggregate with explicit GROUP BY #6955 wired up the rewriter for the case where the aggregate sits at the column root; the extractor recurses through any non-window subtree underneath that root, somax(avg(x) OVER ())andmax(abs(avg(x) OVER ()))both work. What does NOT work is wrapping the aggregate itself:max(avg(x) OVER ()) - 1ormax(avg(x) OVER ()) + min(avg(x) OVER ())falls back to a code path that raises a misleading "Aggregate function cannot be passed as an argument" error pointing at the inner window. The new guard rejects those upfront with "Aggregate over window function cannot be combined with other terms. Use a sub-query." pointing at the offending aggregate.findWindowFunctionOrNamePositionfrom0to-1so a hit at position 0 is no longer mistaken for "not found". All current callers gate oncheckForWindowFunctionsOrNamesfirst, so this is a latent-bug fix with no observable change in current call sites.Known limitations
This PR rejects shapes that previously produced silently wrong output or cryptic errors. Two related cases stay rejected and are documented here as known limitations the planner has yet to grow into:
SELECT col - agg(window(x)) FROM t GROUP BY col(window outside any aggregate, mixed with GROUP BY) is rejected with "Window function is not allowed in context of aggregation. Use sub-query." PostgreSQL accepts this and applies the window post-aggregation when the bare argument is a grouped column. QuestDB's planner does not yet support that pattern. Workaround: compute the window in a sub-query.agg(window(x))itself in any operator, function call, or CASE branch (e.g.max(avg(x) OVER ()) + 1,max(avg(x) OVER ()) - min(avg(x) OVER ())) is rejected with "Aggregate over window function cannot be combined with other terms. Use a sub-query." The rewriter recurses through subtrees inside the root aggregate but does not handle wrapping outside it (fix(sql): fix nested window inside aggregate with explicit GROUP BY #6955). Workaround: split into separate columns or compute the window inside a sub-query and aggregate over it.Test plan
New negative tests added to
testNestedWindowFunctionInGroupByContextFails:SELECT cat, avg(x), (avg(x) - avg(x) OVER ()) FROM t GROUP BY cat.SELECT cat, 1 + (2 + avg(x) OVER ()) FROM t GROUP BY cat.SELECT cat, CASE WHEN avg(x) > 0 THEN avg(x) OVER () END FROM t GROUP BY cat.SELECT cat, abs(avg(x) OVER ()) FROM t GROUP BY cat.SELECT sum(x) + avg(x) OVER () FROM t.SELECT sum(x) + row_number() FROM t.SELECT cat, avg(x) FROM t GROUP BY cat ORDER BY avg(x) - avg(x) OVER ().SELECT cat, max(avg(x) OVER ()) - avg(x) OVER () FROM t GROUP BY cat.SELECT x - avg(x) OVER () FROM t GROUP BY 1(caught by pre-existingvalidateGroupByExpression).New positive tests in the same method:
SELECT cat, max(w) FROM (SELECT cat, avg(x) OVER () AS w FROM t) GROUP BY cat.SELECT (x - avg(x) OVER ()) AS diff FROM tstill works.New test method
testAggregateOverWindowInExpressionFailscovering the aggregate-over-window guard:max(agg(window)) - min(agg(window))with GROUP BY and the same shape without GROUP BY (implicit grouping).max(agg(window)) + literal.max(agg(window)) + agg(col).max(avg(x) OVER ()), two separate columns each withagg(window), plain aggregate arithmetic without any window.pivot_errors.testupdated:test PIVOT (sum(41) over() FOR j IN ('a', 'b'))previously returned a structurally-meaninglessNULL NULLfrom the empty test table; converted to aquery errorblock expecting the new compile-time rejection. The neighbouringsum(sum(41) over())case still passes since the walker skips aggregate subtrees.