Skip to content

fix: resolve nested CTE SELECT * in column lineage#69

Open
eitsupi wants to merge 14 commits intotobilg:mainfrom
eitsupi:fix/lineage-cte-select-star
Open

fix: resolve nested CTE SELECT * in column lineage#69
eitsupi wants to merge 14 commits intotobilg:mainfrom
eitsupi:fix/lineage-cte-select-star

Conversation

@eitsupi
Copy link

@eitsupi eitsupi commented Mar 18, 2026

Summary

  • Expand CTE-based SELECT * in lineage() so that columns can be traced through CTEs without requiring external schema metadata
  • Handle nested CTEs (e.g., cte2 AS (SELECT * FROM cte1)) by walking CTE definitions in order and propagating resolved column lists, replacing the previous qualify_columns-based approach which processed each SELECT independently
  • Support edge cases: UNION CTE bodies (left branch), explicit CTE column lists (cte(a, b) AS (...)), qualified stars (cte.*), recursive CTE skip, unknown table graceful fallback
  • Schema-aware expansion: Pass optional schema to expand_cte_stars so that stars from external tables (not CTEs) can also be resolved via schema column lookup
  • Table-qualified star expansion: Expanded columns from SELECT * now carry their source table qualifier, enabling proper lineage tracing through nested CTE chains with JOINs
  • Sibling CTE resolution: resolve_qualified_column now checks ancestor CTE scopes for sibling CTEs in parent WITH clauses, fixing lineage resolution failures when a CTE references another CTE defined in the same WITH block
  • Recursion depth limit: Add MAX_LINEAGE_DEPTH (64) counter to to_node_inner and all recursive callers to prevent stack overflow on circular or deeply nested CTE chains. get_leftmost_select_mut also converted to iterative loop with the same depth guard.
  • SQL identifier case semantics for CTE name matching: Added normalize_cte_name() helper that lowercases unquoted identifiers (case-insensitive per SQL spec) and preserves quoted identifiers as-is (case-sensitive), matching sqlglot normalize_identifiers behavior. Introduced SourceName struct to carry normalization info through star expansion.
  • Make expand_cte_stars public as a building block for consumers who need to pre-process CTE star expansion independently (e.g., with schema information) before running lineage analysis

Known limitations

The star expansion path (expand_cte_stars) correctly handles quoted vs unquoted CTE name matching, but the broader scope/lineage resolution paths (add_table_to_scope in scope.rs, resolve_qualified_column in lineage.rs) still use eq_ignore_ascii_case without respecting the quoted flag on identifiers. Known-bug tests document this for future work.

Test plan

  • Single CTE, nested (2-level, 3-level), UNION body, explicit column lists, qualified stars
  • Schema-based expansion: external tables, 3-part qualified names, nested CTEs with JOINs
  • Backward compatibility without schema
  • CTE alias resolution with and without schema
  • Quoted/unquoted CTE case sensitivity: case-insensitive unquoted, case-preserved quoted, case-mismatch no expansion, mixed chains
  • Known-bug tests for scope/lineage path identifier case sensitivity

🤖 Generated with Claude Code

eitsupi and others added 5 commits March 18, 2026 13:33
lineage() (without schema) failed to resolve columns through CTEs
when the outer query used SELECT *, because find_select_expr() could
not match column names against the Star expression.

Pre-qualify the AST with an empty MappingSchema to expand stars using
CTE column metadata before building the lineage scope. A new
`qualify_columns` option on QualifyColumnsOptions allows skipping
column qualification while still performing star expansion.

Known limitation: nested CTE star expansion (e.g.,
cte2 AS (SELECT * FROM cte1)) is not yet supported because
qualify_columns processes each SELECT independently.
Replace qualify_columns-based star expansion with a dedicated
expand_cte_stars preprocessing step that walks CTEs in definition
order and propagates resolved column lists. This enables nested
CTE patterns like `cte2 AS (SELECT * FROM cte1)` which
qualify_columns could not resolve because it processes each SELECT
independently via transform_recursive.

Also removes the now-unused `qualify_columns` flag from
QualifyColumnsOptions since its only consumer was the replaced
code path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move expand_cte_stars to public callers (lineage, lineage_with_schema)
  so it runs on already-owned expressions, removing a redundant deep
  clone in lineage_from_expression
- Merge extract_and_expand_select_columns and expand_select_stars into
  a single rewrite_stars_in_select function
- Add test for qualified star expansion (SELECT cte.* FROM cte)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allow callers to use expand_cte_stars() for pre-processing SQL expressions
before extracting column information, enabling column inference from
compiled SQL in tools that build schemas from dbt manifests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eitsupi eitsupi marked this pull request as draft March 18, 2026 15:02
When CTEs reference external tables via SELECT *, the star cannot be
expanded using CTE-only resolution. By passing the optional schema to
expand_cte_stars, external table columns can be looked up as a fallback,
enabling correct lineage tracing through patterns like:

  WITH orders AS (SELECT * FROM stg_orders)
  SELECT orders.* FROM orders

This is essential for dbt projects where compiled SQL uses fully-qualified
table references (e.g., "db"."schema"."table") wrapped in CTEs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eitsupi eitsupi marked this pull request as ready for review March 18, 2026 15:12
@eitsupi eitsupi marked this pull request as draft March 18, 2026 16:11
When expand_cte_stars expanded SELECT * expressions, the resulting
column references lost their source table qualifier. This caused
lineage resolution to fail for nested CTE chains (e.g.,
base -> with_payments (JOIN) -> final -> outer SELECT) where
resolve_unqualified_column could not determine which source a
column belonged to when there were multiple FROM sources.

Changes:
- expand_star_from_sources now returns (source_alias, column_name)
  pairs instead of just column names
- rewrite_stars_in_select sets the table qualifier on expanded columns
  from unqualified stars, enabling proper lineage tracing
- resolve_qualified_column now checks ancestor CTE scopes in addition
  to current scope's cte_sources for sibling CTE references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eitsupi eitsupi marked this pull request as ready for review March 18, 2026 16:31
@eitsupi eitsupi marked this pull request as draft March 18, 2026 17:17
Add a MAX_LINEAGE_DEPTH (64) counter to to_node_inner and all recursive
callers (handle_set_operation, resolve_qualified_column,
resolve_unqualified_column) to prevent stack overflow on circular or
deeply nested CTE chains. Returns an error instead of crashing when
the depth limit is exceeded.

Fixes stack overflow on queries with CTE+SELECT*+JOIN+CASE patterns
such as jaffle-shop's orders model.
@eitsupi eitsupi marked this pull request as ready for review March 18, 2026 18:07
When a query uses `FROM my_cte AS alias`, the scope's `sources` map
stores the alias name as the key, but `cte_sources` only contains the
original CTE name. This caused `resolve_qualified_column` to fail the
CTE check and fall through to a terminal node, stopping lineage tracing
at the alias instead of tracing through the CTE body.

Add `resolve_cte_alias()` to detect when a source name is an alias for
a CTE by checking if the source's expression is a CTE expression, and
extract the original CTE name for scope lookup.

This mirrors the behavior of Python sqlglot where `scope.sources[alias]`
directly returns the CTE Scope object, making alias resolution implicit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eitsupi
Copy link
Author

eitsupi commented Mar 20, 2026

@tobilg Could you take a look at this?

@tobilg
Copy link
Owner

tobilg commented Mar 20, 2026

Code Review: PR #69

Overview

This PR adds CTE star expansion to the lineage engine, solving a real limitation where SELECT * in CTEs broke column lineage tracing. It rewrites star expressions into explicit column references before lineage resolution runs, handling nested CTE chains, schema-based external table lookup, qualified stars (cte.*), and CTE alias resolution.

Strengths

  • Well-scoped approach: Expanding stars as a pre-processing pass before lineage resolution is clean and avoids deep changes to the core algorithm
  • Excellent test coverage: 18 new tests covering single CTEs, nested chains (2 & 3 levels), UNION bodies, explicit column lists, qualified stars, schema-based expansion, CTE aliases with JOINs, and unknown table graceful fallback
  • Recursion depth guard: MAX_LINEAGE_DEPTH = 64 prevents stack overflow on circular CTEs — returns a proper error instead of crashing
  • Backward compatibility: The no-schema path continues to work; schema is opt-in

Issues & Suggestions

1. Clone in lineage() — unnecessary allocation (minor)

The schema-less lineage() now always clones the expression even when there are no CTEs or no stars. Consider a fast-path check (e.g., does the expression even have a WITH clause?) before cloning.

2. Unqualified star expansion is all-or-nothing (potential correctness issue)

In expand_star_from_sources, if any source in an unqualified SELECT * can't be resolved, the entire expansion is aborted. This means a query like WITH cte AS (SELECT a FROM t) SELECT * FROM cte JOIN unknown_table won't expand at all, even though the CTE columns are known. This is a conservative choice but worth documenting or reconsidering — partial expansion could be useful.

3. UNION column resolution only uses left branch

get_leftmost_select_mut drills into the left branch for UNION/INTERSECT/EXCEPT. This is fine for column names (SQL standard says the left branch determines names), but worth a comment noting this is intentional and matches sqlglot's behavior.

4. Case sensitivity handling

CTE name lookups use .to_lowercase() throughout. Correct for most dialects but could be surprising for case-sensitive dialects (e.g., some Postgres or DuckDB configurations with quoted identifiers). Not a blocker — just worth noting as a known limitation.

5. expand_cte_stars is made pub — API surface

The PR description mentions making this public for downstream use. Consider whether this should be a documented public API or pub(crate) to avoid committing to a stable interface prematurely.

6. Two star AST patterns need handling

The Star variant check and Column with name == "*" check represent two different AST representations of star expressions. A brief comment explaining why both patterns need handling would help future maintainers.

@eitsupi
Copy link
Author

eitsupi commented Mar 20, 2026

Thank you for the thorough review! I have addressed all six points.

Commits

  • 9492636 — address review findings (points 1, 2, 3, 6)
  • a627cee — implement SQL identifier case semantics for CTE name matching (point 4)
  • 6ee6726 — add known-bug tests for remaining identifier case sensitivity issues in scope/lineage
  • a5b0009 — convert get_leftmost_select_mut to iterative loop, fix doc comment placement

Changes made

Point 1: Clone optimization in lineage() (9492636) — Added a fast-path check: if the expression has no WITH clause, we skip the clone entirely and call lineage_from_expression directly.

Point 2: Unqualified star expansion is all-or-nothing (9492636) — This is intentional. Partial expansion (resolving only some sources) would produce an incomplete column list, causing downstream lineage resolution to silently omit columns or produce incorrect results. I confirmed that sqlglot also requires all sources to be resolvable — it raises SqlglotError when schema is missing, while our implementation conservatively returns None. Added a comment explaining the rationale.

Point 3: UNION column resolution uses left branch (9492636) — Added a doc comment noting this follows the SQL standard and matches sqlglot behavior. Also converted get_leftmost_select_mut from recursive to iterative with a MAX_LINEAGE_DEPTH guard to prevent stack overflow on deeply nested set operations (a5b0009).

Point 4: Case sensitivity (a627cee) — Implemented proper SQL identifier case semantics rather than just documenting a limitation. Added a normalize_cte_name() helper that lowercases unquoted identifiers (case-insensitive per SQL spec) and preserves quoted identifiers as-is (case-sensitive). Introduced a SourceName struct and updated expand_star_from_sources to accept Option<&Identifier> for qualifiers so that quoting information flows through the star expansion path. This matches sqlglot normalize_identifiers behavior. Added four test cases validated against sqlglot v30.0.3.

Note: the star expansion path (expand_cte_stars) now handles quoted identifiers correctly, but the broader scope/lineage resolution paths (add_table_to_scope in scope.rs, resolve_qualified_column in lineage.rs) still use eq_ignore_ascii_case without respecting the quoted flag. Known-bug tests are added in 6ee6726 documenting this. Fixing these paths consistently requires changes across multiple modules and is tracked separately.

Point 5: expand_cte_stars visibility — Keeping as pub. This function is a useful building block for consumers who need to pre-process CTE star expansion independently (e.g., with schema information) before running lineage analysis. The higher-level lineage() and lineage_with_schema() call it internally, but direct access enables more flexible usage patterns.

Point 6: Two star AST patterns (9492636) — The parser produces two distinct AST representations for star expressions: SELECT * becomes Expression::Star, while SELECT table.* becomes Expression::Column { name: "*", table: Some(...) }. Both represent "all columns" intent but with different syntactic forms, so the expansion logic must match both patterns. Added a comment in rewrite_stars_in_select explaining this.

- Skip unnecessary clone in lineage() when no WITH clause is present
- Document intentional conservative behavior for partial star expansion
- Add comments for UNION left-branch column resolution (SQL standard)
- Note case-insensitive CTE matching as a known limitation
- Explain dual Star AST representation (Star vs Column{name:"*"})

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eitsupi eitsupi force-pushed the fix/lineage-cte-select-star branch from c93dfec to 9492636 Compare March 20, 2026 13:28
@eitsupi eitsupi marked this pull request as draft March 20, 2026 13:34
eitsupi and others added 3 commits March 20, 2026 13:48
Unquoted CTE names are compared case-insensitively (lowercased),
quoted names preserve their original case. This matches sqlglot's
normalize_identifiers behavior.

- Add normalize_cte_name() helper that respects Identifier.quoted
- Introduce SourceName struct to carry normalized names through
  star expansion pipeline
- Update expand_star_from_sources to accept Identifier qualifiers
- Add tests for quoted/unquoted/mixed CTE name matching

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…neage

Add two tests that assert the current buggy behavior where scope.rs
add_table_to_scope uses eq_ignore_ascii_case for all identifiers
including quoted ones. Per SQL semantics (and sqlglot behavior),
quoted identifiers should be case-sensitive: "mycte" should NOT
match CTE "MyCte".

These tests document the bug and will fail when it is fixed,
prompting the assertions to be updated to correct behavior.

The fix requires changes across scope.rs and lineage.rs CTE
resolution, which is broader than the star expansion scope.
…mment placement

- Convert recursive get_leftmost_select_mut to iterative loop with
  MAX_LINEAGE_DEPTH guard to prevent stack overflow on pathologically
  deep set operation nesting.
- Move doc comment from SourceName struct to get_select_source_names
  function where it belongs.
@eitsupi
Copy link
Author

eitsupi commented Mar 20, 2026

@tobilg I apologize for the multiple edits. Please check it again.

You are free to edit this PR as you wish.
Please let me know if this does not meet your needs.

@eitsupi eitsupi marked this pull request as ready for review March 20, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants