Skip to content

fix: add missing aggregate function recursion in transform_recursive#67

Open
goldmedal wants to merge 1 commit intotobilg:mainfrom
goldmedal:fix/qualify-column-corr
Open

fix: add missing aggregate function recursion in transform_recursive#67
goldmedal wants to merge 1 commit intotobilg:mainfrom
goldmedal:fix/qualify-column-corr

Conversation

@goldmedal
Copy link

@goldmedal goldmedal commented Mar 15, 2026

Summary

transform_recursive was not recursing into aggregate function expressions (Sum, Avg, Min, Max, Count, etc.), treating them as leaf nodes. This meant columns inside aggregates like MAX(a) or AVG(val) were never visited by qualify_columns or other transform passes.

  • Add recurse_agg! macro for AggFunc-based variants (follows existing transform_binary! pattern)
  • Add arms for all 24 missing aggregate functions
  • Handle Count separately (Option<Expression> this field)
  • Add StddevSamp arm (was missing)

Correlated subquery handling has been removed from this PR and tracked separately in #68.

Test plan

  • All 938 existing lib tests pass (cargo test -p polyglot-sql --lib)
  • New test: columns inside MAX(a) and ABS(a) get qualified
  • New test: COUNT(*) (Count { this: None }) doesn't panic

🤖 Generated with Claude Code

@tobilg
Copy link
Owner

tobilg commented Mar 15, 2026

Thanks for the contribution! The two fixes address real gaps. Some issues to address before merge:

Must fix

Regression: BitwiseOrAgg/BitwiseAndAgg/BitwiseXorAgg handlers removed from transform_recursive

The diff replaces the existing three Bitwise*Agg match arms with Sum/Avg/Min without re-adding them. The comment states:

// Note: Stddev, StddevSamp, Variance, ArrayAgg, BitwiseOrAgg, BitwiseAndAgg,
// BitwiseXorAgg are already handled above.

This is incorrect for BitwiseOrAgg, BitwiseAndAgg, and BitwiseXorAgg — they were only handled at this location (the removed arms). After this PR, transform_recursive will no
longer recurse into these three expression types.

Additionally, StddevSamp is listed as "already handled above" but there is no existing Expression::StddevSamp(mut f) arm in transform_recursive. Only Stddev (line 942) and
Variance (line 946) exist. Either add a StddevSamp arm or remove it from the comment.

Should fix

Overly permissive correlated reference check for qualified columns (qualify_columns.rs, first insertion around line 829):

// Column has table qualifier, table NOT in scope, NOT in schema
if resolver.column_exists_in_outer_schema_table(&col.name.name) {
return Ok(());
}

This checks the column name against all outer-scope tables but ignores the table qualifier. So nonexistent_table.real_column passes validation as long as real_column exists in
some outer-scope table — even if nonexistent_table doesn't exist anywhere. This weakens the UnknownTable error.

The second insertion (for unqualified columns, around line 858) is fine — an unqualified column referencing an outer scope is valid correlated subquery behavior.

Suggestions

  • Reduce repetition in transform_recursive: The 22 new match arms are all identical in structure. A macro would reduce ~80 lines to ~25:
  macro_rules! recurse_agg {
      ($variant:ident, $f:ident) => {{
          $f.this = transform_recursive($f.this, transform_fn)?;
          Expression::$variant($f)
      }};
  }
  • table_names() allocates on every call: Returns Vec via .keys().cloned().collect(). In column_exists_in_outer_schema_table this creates a fresh allocation per
    invocation. Consider returning an iterator instead.
  • More test coverage:
    • Correlated EXISTS subquery (the motivating case from the PR description / TPC-H Q4)
    • A negative test verifying truly invalid references still error
    • COUNT(*) (the Count variant has Option — the code handles it correctly but there's no test for the None/star case)

What looks good

  • Fixing transform_recursive to cover the common aggregate functions is the right call — Sum/Avg/Min/Max/Count are by far the most used
  • The Count variant correctly handles Option with take()
  • column_exists_in_outer_schema_table scope-exclusion logic is clean
  • New table_names() trait method + both Schema impls updated
  • Good test for aggregate column qualification (MAX(a)) and scalar functions (ABS(a))

goldmedal added a commit to goldmedal/polyglot that referenced this pull request Mar 16, 2026
Must fix:
- Re-add BitwiseOrAgg/BitwiseAndAgg/BitwiseXorAgg arms to transform_recursive
  (accidentally removed when adding aggregate function coverage)
- Add StddevSamp arm (was incorrectly listed as already handled in comment)
- Fix stale comment that claimed those variants were handled elsewhere

Should fix:
- Remove overly permissive column_exists_in_outer_schema_table fallback for
  qualified columns; nonexistent_table.real_col no longer passes validation
  just because real_col exists in some other schema table

Tests:
- Add correlated EXISTS subquery test (TPC-H Q4 pattern)
- Add negative test: nonexistent_table.id must error even if id exists elsewhere
- Add COUNT(*) test covering the Count { this: None } path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
goldmedal added a commit to goldmedal/polyglot that referenced this pull request Mar 16, 2026
- Add recurse_agg! macro in transform_recursive, collapsing 25 identical
  AggFunc arms (~80 lines) into one-liners; follows the existing
  transform_binary! pattern already in the function
- Change Schema::table_names() to return Box<dyn Iterator<Item = &str>>
  instead of Vec<String>, avoiding an upfront heap allocation on every call
  to column_exists_in_outer_schema_table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@goldmedal
Copy link
Author

Thanks @tobilg for the review. All comment is addressed.

@tobilg
Copy link
Owner

tobilg commented Mar 16, 2026

Thanks again for the update of this PR @goldmedal — solid work on both issues!

Aggregate recursion in transform_recursive — Approved

This was a real gap. All 28 Box variants are now covered. Before this PR, expressions like AVG(val) were hitting the other => other fallthrough, meaning their child
columns were never visited by transforms (including qualify_columns).

The recurse_agg! macro is clean and follows the existing transform_binary! pattern nicely. Count is correctly special-cased for Option to handle COUNT(*).

Issue: AggFunc.filter and AggFunc.order_by are still not recursed into. A column inside AVG(val) FILTER(WHERE status > 0) or ARRAY_AGG(x ORDER BY y) won't have status/y
visited. This is pre-existing behavior (same for ArrayAgg at line 1321), so not a regression — but since you're already touching all these arms, this would be the right time to
fix it. The macro could recurse into f.filter and f.order_by as well. If you'd rather keep this scoped, please at least add a // TODO comment.

Correlated subquery handling — Needs discussion

The column_exists_in_outer_schema_table() heuristic has a false-negative risk: if a user has a typo in a column name that happens to match a column in any other table in the
schema, the UnknownColumn error is silently suppressed. Example:

-- Schema: t1(id, name), t2(id, status), t3(typpo_col)
SELECT id FROM t1 WHERE typpo_col = 1
-- Today: UnknownColumn error (correct)
-- After this PR: silently passes because typpo_col exists in t3

I'd prefer to have it throw the existing error.

Schema::table_names() trait addition

Clean. Using Box<dyn Iterator<Item = &str> + '_> avoids allocation. Both MappingSchema and TypeInferenceSchema implementations are covered.

Tests — Good coverage

  • Agg column qualification (MAX(a), ABS(a))
  • Correlated scalar subquery with AVG(val)
  • Correlated EXISTS (TPC-H Q4 pattern)
  • Negative case: nonexistent_table.id correctly errors
  • COUNT(*) with None this field

Merge conflicts

This PR will have merge conflicts with main due to recent performance work that changed Expression::Column(Column) → Expression::Column(Box) and
Expression::Table(TableRef) → Expression::Table(Box). The conflicts are mechanical — pattern matches on Expression::Table(table) in resolver.rs will need adjustment.
Happy to help resolve if needed.

@goldmedal
Copy link
Author

Thanks for the detailed review — the false-negative concern is valid.

After investigating, I realized the root cause is architectural: qualify_columns uses a bottom-up transform_recursive traversal, so inner SELECTs are qualified before the outer scope is available. There's no way for the inner scope to know whether an unresolvable column is a correlated reference or a genuine typo without access to the outer scope's context.

The column_exists_in_outer_schema_table() heuristic was a workaround to approximate this, but as you pointed out, it's too broad. I explored a retry-based approach (intercept Subquery/Exists nodes on UnknownColumn, clear the error, re-qualify with allow_partial=true, and let the outer scope's pass pick up correlated refs), but this has its own issues:

Requires intercepting every expression type that can wrap a correlated subquery (Subquery, Exists, In, etc.)
Relies on the assumption that a failed qualification doesn't partially mutate the AST
Runs the full qualification pipeline twice for each subquery scope
A proper fix requires refactoring to scope-tree-based top-down qualification — where inner scopes can resolve columns against parent scopes in a single pass, similar to how sqlglot's Python implementation works. That's a significant effort (~40-60 hours, touching ~700-1000 lines) and should be its own initiative.

Given this, I'd like to scope this PR down to just the aggregate function changes (transform_recursive coverage for Sum, Avg, Count, etc.) and remove the correlated subquery handling entirely.

I'll open a separate issue to track the correlated subquery qualification problem with the architectural context above.

What do u think 🤔 ? I'll convert this to a draft first.

@goldmedal goldmedal marked this pull request as draft March 16, 2026 17:44
@tobilg
Copy link
Owner

tobilg commented Mar 16, 2026

Yeah, I think to split this makes absolute sense. Thanks for your support and understanding. I'll think about it as well.

transform_recursive was not recursing into aggregate function expressions
(Sum, Avg, Min, Max, Count, etc.), treating them as leaf nodes. This meant
columns inside aggregates like `MAX(a)` or `AVG(val)` were never visited
by qualify_columns or other transform passes.

- Add recurse_agg! macro for AggFunc-based variants (follows existing
  transform_binary! pattern)
- Add arms for all 24 missing aggregate functions
- Handle Count separately (Option<Expression> `this` field)
- Add StddevSamp arm (was missing)
- Add test coverage for MAX(a), ABS(a), and COUNT(*)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@goldmedal
Copy link
Author

Filed #68 for the correlated subquery issue.

@goldmedal goldmedal changed the title fix: qualify_columns handles correlated subqueries and aggregate functions fix: add missing aggregate function recursion in transform_recursive Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants