Skip to content

Fix qualify_columns failing on correlated scalar subqueries#51

Merged
tobilg merged 3 commits intotobilg:mainfrom
karakanb:fix-subquery-scope-lineage
Mar 3, 2026
Merged

Fix qualify_columns failing on correlated scalar subqueries#51
tobilg merged 3 commits intotobilg:mainfrom
karakanb:fix-subquery-scope-lineage

Conversation

@karakanb
Copy link
Contributor

@karakanb karakanb commented Mar 3, 2026

Summary

  • lineage_with_schema (and qualify_columns) failed when a query contained a correlated scalar subquery referencing an outer table (e.g., SELECT id, (SELECT AVG(val) FROM t2 WHERE t2.id = t1.id) FROM t1)
  • The inner subquery's scope only contained its own sources (t2), so the reference to outer table t1 triggered an UnknownTable error
  • Fix: before erroring on an unresolved table qualifier, check if the table exists in the schema — if it does, treat it as a correlated outer reference and allow it

This fixes #46, #47, and #48

Test plan

  • New test test_qualify_columns_correlated_scalar_subquery verifies qualification succeeds and both inner/outer columns are resolved
  • New test test_qualify_columns_rejects_unknown_table verifies tables in neither scope nor schema still produce errors
  • New test test_lineage_with_schema_correlated_scalar_subquery verifies end-to-end lineage on the exact failing query
  • All 857 existing tests continue to pass

🤖 Generated with Claude Code

karakanb and others added 3 commits March 2, 2026 22:53
When a query contains a correlated scalar subquery (e.g.,
`SELECT id, (SELECT AVG(val) FROM t2 WHERE t2.id = t1.id) FROM t1`),
qualify_columns built an isolated scope for the inner SELECT that only
contained the subquery's own sources (t2). References to the outer
table (t1) triggered an UnknownTable error because the outer scope
was not visible.

The fix checks whether an unresolved table qualifier exists in the
schema before erroring. If the table is known in the schema but not
in the current scope, it is treated as a correlated outer reference
and left as-is. Tables that exist in neither scope nor schema still
produce an error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a query uses JOIN USING(col), the shared column exists in both
tables, making it ambiguous for the resolver. qualify_columns had no
awareness of USING columns and failed to resolve them.

The fix registers USING columns with the resolver before qualifying
expressions. Each USING column is mapped to the first FROM-clause
source that contains it (the left side of the join). The resolver
then checks this mapping when its standard unambiguous-column lookup
fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The resolver's extract_columns_from_source used only the table name
(e.g. "t1") when looking up columns in the schema, ignoring the
schema and catalog qualifiers. When a table was registered as
"raw.t1", the lookup for just "t1" failed because MappingSchema
stores entries hierarchically.

Build the fully qualified name (catalog.schema.table) from the
TableRef before calling schema.column_names().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tobilg tobilg merged commit ca0cae3 into tobilg:main Mar 3, 2026
10 of 15 checks passed
@tobilg
Copy link
Owner

tobilg commented Mar 3, 2026

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lineage_with_schema fails with "Unknown column" when JOIN uses USING instead of ON

2 participants