Skip to content

fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results#6943

Merged
bluestreak01 merged 10 commits into
masterfrom
sm_fix_nested_window_groupby_validation
Apr 26, 2026
Merged

fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results#6943
bluestreak01 merged 10 commits into
masterfrom
sm_fix_nested_window_groupby_validation

Conversation

@jovfer
Copy link
Copy Markdown
Contributor

@jovfer jovfer commented Apr 7, 2026

Summary

  • Window functions nested inside arithmetic expressions (e.g. avg(x) - avg(x) OVER ()) silently produced wrong results when combined with GROUP BY. The existing check in SqlOptimiser.rewriteSelect0 only inspected REWRITE_STATUS flags on top-level window columns and missed window functions buried inside operators, function calls, or CASE branches.
  • Add findWindowFunctionOutsideAggregatePos: walks each SELECT column's AST and returns the position of the first window function that is not nested inside an aggregate. Skipping aggregate subtrees keeps shapes like max(avg(x) OVER (...)) legal (per fix(sql): fix nested window inside aggregate with explicit GROUP BY #6955) while rejecting unrepresentable windows mixed with GROUP BY. The error points at the offending window (e.g. avg at position 35) instead of the root operator or position 0.
  • The walker matches both forms of window function: function nodes carrying an OVER clause (windowExpression != null) and pure window function names used without OVER (e.g. row_number()). The latter previously fell through to a cryptic runtime "empty window context".
  • The check also covers implicit grouping. The pre-scan sets REWRITE_STATUS_USE_GROUP_BY_MODEL only for real aggregates (checkForChildAggregates skips window functions via windowExpression == null), so the extended guard fires on SELECT sum(x) + avg(x) OVER () FROM t (sibling aggregate triggers grouping) without producing false positives on valid queries like SELECT (x - avg(x) OVER ()) FROM t.
  • Add findInvalidAggregateOverWindowPos for a related shape: a wrapper around an aggregate that contains a window function (arithmetic, function call, CASE branch). PR fix(sql): fix nested window inside aggregate with explicit GROUP BY #6955 wired up the rewriter for the case where the aggregate sits at the column root; the extractor recurses through any non-window subtree underneath that root, so max(avg(x) OVER ()) and max(abs(avg(x) OVER ())) both work. What does NOT work is wrapping the aggregate itself: max(avg(x) OVER ()) - 1 or max(avg(x) OVER ()) + min(avg(x) OVER ()) falls back to a code path that raises a misleading "Aggregate function cannot be passed as an argument" error pointing at the inner window. The new guard rejects those upfront with "Aggregate over window function cannot be combined with other terms. Use a sub-query." pointing at the offending aggregate.
  • Tighten the sentinel of the existing helper findWindowFunctionOrNamePosition from 0 to -1 so a hit at position 0 is no longer mistaken for "not found". All current callers gate on checkForWindowFunctionsOrNames first, so this is a latent-bug fix with no observable change in current call sites.

Known limitations

This PR rejects shapes that previously produced silently wrong output or cryptic errors. Two related cases stay rejected and are documented here as known limitations the planner has yet to grow into:

  • SELECT col - agg(window(x)) FROM t GROUP BY col (window outside any aggregate, mixed with GROUP BY) is rejected with "Window function is not allowed in context of aggregation. Use sub-query." PostgreSQL accepts this and applies the window post-aggregation when the bare argument is a grouped column. QuestDB's planner does not yet support that pattern. Workaround: compute the window in a sub-query.
  • Wrapping agg(window(x)) itself in any operator, function call, or CASE branch (e.g. max(avg(x) OVER ()) + 1, max(avg(x) OVER ()) - min(avg(x) OVER ())) is rejected with "Aggregate over window function cannot be combined with other terms. Use a sub-query." The rewriter recurses through subtrees inside the root aggregate but does not handle wrapping outside it (fix(sql): fix nested window inside aggregate with explicit GROUP BY #6955). Workaround: split into separate columns or compute the window inside a sub-query and aggregate over it.

Test plan

New negative tests added to testNestedWindowFunctionInGroupByContextFails:

  • Nested window inside arithmetic: SELECT cat, avg(x), (avg(x) - avg(x) OVER ()) FROM t GROUP BY cat.
  • Deeper nesting: SELECT cat, 1 + (2 + avg(x) OVER ()) FROM t GROUP BY cat.
  • CASE branch: SELECT cat, CASE WHEN avg(x) > 0 THEN avg(x) OVER () END FROM t GROUP BY cat.
  • Single-argument wrapper: SELECT cat, abs(avg(x) OVER ()) FROM t GROUP BY cat.
  • Implicit GROUP BY: SELECT sum(x) + avg(x) OVER () FROM t.
  • Pure window-function name without OVER: SELECT sum(x) + row_number() FROM t.
  • ORDER BY lifted into bottom-up columns: SELECT cat, avg(x) FROM t GROUP BY cat ORDER BY avg(x) - avg(x) OVER ().
  • Mixed agg-wrapped + bare window: SELECT cat, max(avg(x) OVER ()) - avg(x) OVER () FROM t GROUP BY cat.
  • GROUP BY ordinal resolving to a window-bearing column: SELECT x - avg(x) OVER () FROM t GROUP BY 1 (caught by pre-existing validateGroupByExpression).

New positive tests in the same method:

  • Sub-query workaround: SELECT cat, max(w) FROM (SELECT cat, avg(x) OVER () AS w FROM t) GROUP BY cat.
  • No-GROUP-BY regression: SELECT (x - avg(x) OVER ()) AS diff FROM t still works.

New test method testAggregateOverWindowInExpressionFails covering the aggregate-over-window guard:

  • max(agg(window)) - min(agg(window)) with GROUP BY and the same shape without GROUP BY (implicit grouping).
  • max(agg(window)) + literal.
  • max(agg(window)) + agg(col).
  • Aggregate-wrapped window inside a CASE THEN branch.
  • Positive regression guards: bare max(avg(x) OVER ()), two separate columns each with agg(window), plain aggregate arithmetic without any window.

pivot_errors.test updated:

  • test PIVOT (sum(41) over() FOR j IN ('a', 'b')) previously returned a structurally-meaningless NULL NULL from the empty test table; converted to a query error block expecting the new compile-time rejection. The neighbouring sum(sum(41) over()) case still passes since the walker skips aggregate subtrees.

@jovfer jovfer added Bug Incorrect or unexpected behavior SQL Issues or changes relating to SQL execution labels Apr 7, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 90ad3476-0c07-466e-94fc-26728a9c616e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sm_fix_nested_window_groupby_validation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Member

@bluestreak01 bluestreak01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jovfer — review findings below.

Moderate

1. Code duplication with checkForChildWindowFunctionsSqlOptimiser.java:10876-10892

The new hasNestedWindowExpression is functionally identical to the existing checkForChildWindowFunctions at line 10900. Both walk the AST checking node.windowExpression != null, with the same paramCount < 3 branching logic. The only difference is recursive vs. iterative implementation.

Since the call site at line 9083 is in an instance method of SqlOptimiser, the instance field sqlNodeStack (line 193) is available. The new method should be removed and replaced with a call to the existing one:

if (explicitGroupBy && !qc.isWindowExpression() && checkForChildWindowFunctions(sqlNodeStack, qc.getAst())) {

This eliminates duplication and avoids introducing a recursive traversal where the codebase already uses an iterative stack-based approach for the same logic.

2. Method placement violates member ordering — SqlOptimiser.java:10876

The new method is private static but is placed among instance methods (between wrapWithSelectModel at line 10837 and wrapWithSelectWildcard at line 10894). Per CLAUDE.md, static members should be grouped with other statics and sorted alphabetically. The other private static methods live at lines 431–785; hasNestedWindowExpression belongs between extractAndTerms (431) and isCompileTimeConstant (444). Moot if finding 1 is addressed.

Minor

3. node.args.size() vs node.paramCountSqlOptimiser.java:10886

The new method loops over node.args.size(), while the existing checkForChildWindowFunctions at line 10926 uses node.paramCount. These should match in practice, but paramCount is the canonical bound per the ExpressionNode invariants (line 64–68). Moot if finding 1 is addressed.

4. Test uses multiple INSERT statements — WindowFunctionTest.java:8217-8219

Per coding conventions: "use a single INSERT statement to insert multiple rows." The three separate INSERTs should be consolidated:

execute("INSERT INTO t VALUES (1, 'A', 1000000), (2, 'A', 2000000), (3, 'B', 3000000)");

5. Error position points at the root expression, not the window function — SqlOptimiser.java:9084

qc.getAst().position yields position 33 (the - operator), whereas the actual offending character is the window function avg at position 35. The convention is that the position should point at the specific offending character. The existing top-level check at line 9253 uses hardcoded position 0, so position 33 is still an improvement — this is a nit.

6. Limited test coverage — WindowFunctionTest.java:8214-8229

Only one query shape is tested. Consider adding:

  • Deeper nesting: SELECT cat, 1 + (2 + avg(x) OVER ()) FROM t GROUP BY cat — verifies the AST walk recurses correctly
  • CASE expression: SELECT cat, CASE WHEN avg(x) > 0 THEN avg(x) OVER () END FROM t GROUP BY cat — exercises the paramCount >= 3 branch
  • Positive regression test: a query with nested window functions WITHOUT GROUP BY still works (e.g., SELECT (avg(x) - avg(x) OVER ()) FROM t) — guards against the check accidentally rejecting valid queries

@jovfer jovfer force-pushed the sm_fix_nested_window_groupby_validation branch from ff4b81e to f62483e Compare April 10, 2026 11:43
@jovfer jovfer changed the title fix(sql): reject window functions nested inside GROUP BY expressions fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results Apr 10, 2026
@questdb questdb deleted a comment from glasstiger Apr 10, 2026
@jovfer jovfer force-pushed the sm_fix_nested_window_groupby_validation branch 2 times, most recently from 150dcd9 to 30deaa7 Compare April 10, 2026 14:05
@jovfer jovfer changed the title fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results Apr 10, 2026
@jovfer jovfer force-pushed the sm_fix_nested_window_groupby_validation branch from 59343af to 94e22a4 Compare April 13, 2026 20:33
@jovfer jovfer changed the title fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results [WIP] fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results Apr 13, 2026
@jovfer jovfer changed the base branch from master to sm_fix_window_in_aggregate_groupby April 13, 2026 20:34
@jovfer jovfer changed the title [WIP] fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results Apr 13, 2026
@jovfer jovfer changed the base branch from sm_fix_window_in_aggregate_groupby to master April 14, 2026 10:32
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@jovfer jovfer added the DO NOT MERGE These changes should not be merged to main branch label Apr 14, 2026
@jovfer jovfer changed the title fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results [WIP] fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results Apr 14, 2026
Window functions nested inside arithmetic expressions (e.g.,
avg(x) - avg(x) OVER ()) bypassed the GROUP BY + window function
validation because the top-level column was an operator, not a
WindowExpression. The existing check only detected top-level window
columns via REWRITE_STATUS flags, so such queries silently produced
wrong results.

Walk the AST of each SELECT column at the start of the rewrite
loop to detect nested window expressions that sit outside any
aggregate, and reject the query with the same error as top-level
window + GROUP BY mixing.

The new traversal skips subtrees rooted at aggregate functions
because PR #6955 made max(avg(x) OVER (...)) and similar shapes
legal: those window functions are lowered into an inner window
model. Only windows that reach the top of the SELECT column
without an enclosing aggregate are unrepresentable.

The guard fires both for explicit GROUP BY and for implicit
grouping (bare aggregates without a GROUP BY clause, e.g.
SELECT sum(x) + avg(x) OVER () FROM t). The implicit case is
safe because the pre-scan sets REWRITE_STATUS_USE_GROUP_BY_MODEL
only when real aggregates are present (checkForChildAggregates
skips window functions via windowExpression == null), so queries
like SELECT (x - avg(x) OVER ()) FROM t are unaffected.

The SqlException position points at the nested window function
itself (e.g., avg at position 35) rather than the root operator,
matching the convention that error positions identify the
specific offending character.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jovfer jovfer force-pushed the sm_fix_nested_window_groupby_validation branch from 94e22a4 to 1b01325 Compare April 24, 2026 18:40
@jovfer jovfer changed the title [WIP] fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results fix(sql): fix window functions nested inside GROUP BY expressions producing wrong results Apr 24, 2026
jovfer added 3 commits April 24, 2026 21:49
Extend findWindowFunctionOutsideAggregatePos to also match pure window
function names (row_number, rank, ...) used without an OVER clause next
to an aggregate. The previous form only checked windowExpression, so
SELECT sum(x) + row_number() FROM t fell through to a runtime
"empty window context" rather than a clean compile-time error.

Reword the comment on the aggregate-subtree skip to attribute the
windowExpression invariant to the parser (where it is actually set),
not to checkForChildAggregates, which only reads it.

Correct the adjacent comment describing what the top-level
REWRITE_STATUS check catches - only columns whose AST is itself a
window expression, not every top-level window reference.

Convert the method header to Javadoc to match neighbouring methods.

Add underscore thousands separators to the test data, matching the
project-wide rule for numbers of five or more digits.

Cover two more shapes in the regression test:
- SELECT sum(x) + row_number() FROM t, pinning the new pure-window-name
  path.
- SELECT ... GROUP BY category ORDER BY avg(x) - avg(x) OVER (), which
  moveOrderByFunctionsIntoOuterSelect lifts into the bottom-up column
  list, so the SELECT-list guard now catches it with a precise position.
PIVOT compiles into a SELECT with implicit GROUP BY and aggregates per
pivot value. The aggregator slot accepts whatever the user writes, but
windowExpression-bearing nodes (e.g. sum(41) over()) are not aggregates
and have no valid plan in that position. Before the new guard, the
optimiser silently lowered the window function and produced a
structural NULL NULL row from the empty test table - clearly wrong, but
not detected.

The new guard rejects sum(41) over() in PIVOT with the same
"Window function is not allowed in context of aggregation. Use
sub-query." message as any other window-outside-aggregate-with-GROUP-BY
shape. The neighbouring case sum(sum(41) over()) still passes because
the walker skips aggregate subtrees, matching the surrounding pivot
test expectations.

Convert the entry from "query 2 ... NULL NULL" to "query error ..." in
pivot_errors.test, which already documents PIVOT's other
non-aggregate rejection (now()).
@jovfer jovfer force-pushed the sm_fix_nested_window_groupby_validation branch 2 times, most recently from e47b6cc to 2f936ec Compare April 25, 2026 13:08
@jovfer jovfer removed the DO NOT MERGE These changes should not be merged to main branch label Apr 25, 2026
The rewriter only handles agg(window(x)) when the aggregate sits at the
column root (e.g. SELECT max(avg(x) OVER ()) FROM t GROUP BY cat).
Wrapping that aggregate in arithmetic, a function call, or a CASE branch
falls back to a code path that raises a misleading "Aggregate function
cannot be passed as an argument" error pointing at the inner window
function.

Add findInvalidAggregateOverWindowPos: walks each SELECT column's AST
under the existing GROUP BY context guard. Allows the column when its
root is the supported agg(window()) shape. Otherwise descends the tree
looking for an aggregate node that contains a window function, and
rejects with "Aggregate over window function cannot be combined with
other terms. Use a sub-query." pointing at the offending aggregate.

The walker shares its outer stack with findWindowFunctionOutsideAggregatePos
and reuses the static checkForChildWindowFunctions helper with sqlNodeStack2
to test whether each candidate aggregate's subtree contains a window. Both
guards run in the same call site; the bare-window-outside-aggregate case
takes precedence as the more obvious bug.

Cover the shapes the user-facing error now catches:
- max(avg(x) OVER ()) - min(avg(x) OVER ()) under GROUP BY and the same
  pattern without an explicit GROUP BY (implicit grouping from sibling
  aggregates).
- max(avg(x) OVER ()) + literal and max(avg(x) OVER ()) + agg(col).
- Aggregate-wrapped window inside a CASE THEN branch.

Pin the supported and adjacent cases as positive regression guards:
- Bare max(avg(x) OVER ()) FROM t GROUP BY cat.
- Two separate columns each with a bare aggregate-of-window.
- Plain aggregate arithmetic (max(x) + sum(x)) without any window.

Extend testNestedWindowFunctionInGroupByContextFails with three more
shapes:
- mixed aggregate-wrapped window plus a sibling bare window in the same
  expression - the existing walker correctly skips the aggregate subtree
  and still flags the bare window.
- GROUP BY ordinal that resolves to a window-bearing column - caught by
  the pre-existing validateGroupByExpression flow with the
  "aggregate functions are not allowed in GROUP BY" message.
- sub-query exposing the window column with the outer query aggregating
  it - positive regression guard for the suggested workaround.

Add a one-line ExpressionNode.java:64-68 invariant pointer above the
walk loop in findWindowFunctionOutsideAggregatePos so the
paramCount/lhs/rhs/args contract the walker depends on is documented at
the use site.

PR description gains a Known limitations section listing the two cases
that stay rejected: window outside any aggregate mixed with GROUP BY
(PostgreSQL accepts this) and any aggregate-wrapped window in a
compound expression. Both come with workarounds.
@jovfer jovfer force-pushed the sm_fix_nested_window_groupby_validation branch from 2f936ec to ddac142 Compare April 25, 2026 20:06
jovfer and others added 4 commits April 25, 2026 21:52
Apply review feedback on the window/group-by validation guard:

- findInvalidAggregateOverWindowPos now null-checks variadic args
  before pushing to the stack, matching the convention of every other
  AST walker in SqlOptimiser (checkForChildWindowFunctions,
  checkForWindowFunctionsOrNames, etc.). The variadic-poll also
  guards on isEmpty() for symmetry with the sibling walker.
- findWindowFunctionOrNamePosition now returns -1 for "not found"
  instead of 0, aligning with the two new sibling helpers and
  letting position 0 be a valid result. Internal recursive call
  sites updated from `pos > 0` to `pos >= 0`. The existing top-level
  callers in validateNoWindowFunctionsInWhereClauses are unaffected
  because they only invoke the helper after checkForWindowFunctionsOrNames
  has already confirmed a window function exists.
- Add SAMPLE BY coverage to both negative test methods. SAMPLE BY
  triggers implicit grouping through sibling aggregates, so the
  guard fires on both `bare window in expression` and
  `aggregate-over-window in expression` shapes under SAMPLE BY.
Round of polish on the window/group-by validation guard:

- Drop the redundant explicitGroupBy disjunct from the SELECT-list
  guard. REWRITE_STATUS_USE_GROUP_BY_MODEL is set whenever the GROUP
  BY clause is present and the model is not HORIZON JOIN, which is
  exactly the explicitGroupBy condition. The single bit-test is
  enough.
- Add an assert in findInvalidAggregateOverWindowPos enforcing that
  the caller has already invoked findWindowFunctionOutsideAggregatePos.
  The comment claimed this contract but nothing checked it; a future
  caller that copies one helper without the other would silently miss
  bare windows. The assert turns the implicit precondition into a
  hard failure under -ea.
- Rephrase the ExpressionNode.paramCount comment to point at the
  field's documentation by name instead of citing line 64-68 of
  ExpressionNode.java, which would rot on any reordering.
- Add CTE and JOIN coverage to testNestedWindowFunctionInGroupByContextFails.
  Both flow through rewriteSelectClause0, so the same SELECT-list
  guard catches the offending column with a precise position.
Copy link
Copy Markdown
Member

@bluestreak01 bluestreak01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the doc-comment and PR-description fixes from the review. Approved.

@mtopolnik
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 59 / 61 (96.72%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/SqlOptimiser.java 59 61 96.72%

@bluestreak01 bluestreak01 merged commit 7eec3a8 into master Apr 26, 2026
51 checks passed
@bluestreak01 bluestreak01 deleted the sm_fix_nested_window_groupby_validation branch April 26, 2026 04:39
nwoolmer added a commit that referenced this pull request Apr 28, 2026
Brings in 14 upstream commits (master..errand merge-base 5037278
through b6b3b15). Notable upstream changes:

* JDK 25 migration (#6980) — drops Java 8/11 reflection helpers
  (isJava8Or11, is32BitJVM, getOrdinaryObjectPointersCompressionStatus,
  AccessibleObject_override_fieldOffset, OVERRIDE constant) from
  Unsafe.java. Our errand branch had inherited them; they are dead
  code in our tree too (the only call site was the dropped helper
  itself). Followed upstream and removed them.
* feat(http): QWP egress / WebSocket SQL streaming (#6991, #7004)
* perf(sql): faster GROUP BY / hash join finalizer (#6997)
* fix(sql): nested window inside aggregate (#6943, #6955)
* feat(sql): configurable parquet export encodings (#6949)

Conflict resolutions:

* core/src/main/java/io/questdb/std/Unsafe.java
  Both sides added different methods between setRssMemLimit() and
  checkAllocLimit(). Kept all three: storeFence (upstream),
  chargeExternalRss (ours), and dropped the dead
  AccessibleObject_override_fieldOffset() helper per upstream's
  JDK 25 cleanup.

* core/src/main/java/io/questdb/std/str/GcUtf8String.java
  Our errand branch redesigned this class for lazy native allocation
  (data byte[] field, allocateNative() on first ptr() call) instead
  of upstream's eager DirectByteBuffer + reflection-fetched address.
  Upstream's only post-divergence change was a tiny refactor
  replacing Unsafe.getUnsafe().X with Unsafe.X — irrelevant to our
  redesign. Took ours.

* core/src/main/resources/io/questdb/bin/linux-x86-64/libquestdbr.so
  Built artifact — both sides changed Rust source. Took ours as
  placeholder; will refresh via `mvn package -pl core -am
  -DskipTests` immediately after this commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Incorrect or unexpected behavior SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants