fix: resolve nested CTE SELECT * in column lineage#69
fix: resolve nested CTE SELECT * in column lineage#69eitsupi wants to merge 14 commits intotobilg:mainfrom
Conversation
lineage() (without schema) failed to resolve columns through CTEs when the outer query used SELECT *, because find_select_expr() could not match column names against the Star expression. Pre-qualify the AST with an empty MappingSchema to expand stars using CTE column metadata before building the lineage scope. A new `qualify_columns` option on QualifyColumnsOptions allows skipping column qualification while still performing star expansion. Known limitation: nested CTE star expansion (e.g., cte2 AS (SELECT * FROM cte1)) is not yet supported because qualify_columns processes each SELECT independently.
Replace qualify_columns-based star expansion with a dedicated expand_cte_stars preprocessing step that walks CTEs in definition order and propagates resolved column lists. This enables nested CTE patterns like `cte2 AS (SELECT * FROM cte1)` which qualify_columns could not resolve because it processes each SELECT independently via transform_recursive. Also removes the now-unused `qualify_columns` flag from QualifyColumnsOptions since its only consumer was the replaced code path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move expand_cte_stars to public callers (lineage, lineage_with_schema) so it runs on already-owned expressions, removing a redundant deep clone in lineage_from_expression - Merge extract_and_expand_select_columns and expand_select_stars into a single rewrite_stars_in_select function - Add test for qualified star expansion (SELECT cte.* FROM cte) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allow callers to use expand_cte_stars() for pre-processing SQL expressions before extracting column information, enabling column inference from compiled SQL in tools that build schemas from dbt manifests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When CTEs reference external tables via SELECT *, the star cannot be expanded using CTE-only resolution. By passing the optional schema to expand_cte_stars, external table columns can be looked up as a fallback, enabling correct lineage tracing through patterns like: WITH orders AS (SELECT * FROM stg_orders) SELECT orders.* FROM orders This is essential for dbt projects where compiled SQL uses fully-qualified table references (e.g., "db"."schema"."table") wrapped in CTEs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When expand_cte_stars expanded SELECT * expressions, the resulting column references lost their source table qualifier. This caused lineage resolution to fail for nested CTE chains (e.g., base -> with_payments (JOIN) -> final -> outer SELECT) where resolve_unqualified_column could not determine which source a column belonged to when there were multiple FROM sources. Changes: - expand_star_from_sources now returns (source_alias, column_name) pairs instead of just column names - rewrite_stars_in_select sets the table qualifier on expanded columns from unqualified stars, enabling proper lineage tracing - resolve_qualified_column now checks ancestor CTE scopes in addition to current scope's cte_sources for sibling CTE references Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a MAX_LINEAGE_DEPTH (64) counter to to_node_inner and all recursive callers (handle_set_operation, resolve_qualified_column, resolve_unqualified_column) to prevent stack overflow on circular or deeply nested CTE chains. Returns an error instead of crashing when the depth limit is exceeded. Fixes stack overflow on queries with CTE+SELECT*+JOIN+CASE patterns such as jaffle-shop's orders model.
When a query uses `FROM my_cte AS alias`, the scope's `sources` map stores the alias name as the key, but `cte_sources` only contains the original CTE name. This caused `resolve_qualified_column` to fail the CTE check and fall through to a terminal node, stopping lineage tracing at the alias instead of tracing through the CTE body. Add `resolve_cte_alias()` to detect when a source name is an alias for a CTE by checking if the source's expression is a CTE expression, and extract the original CTE name for scope lookup. This mirrors the behavior of Python sqlglot where `scope.sources[alias]` directly returns the CTE Scope object, making alias resolution implicit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@tobilg Could you take a look at this? |
Code Review: PR #69OverviewThis PR adds CTE star expansion to the lineage engine, solving a real limitation where Strengths
Issues & Suggestions1. Clone in
|
|
Thank you for the thorough review! I have addressed all six points. Commits
Changes madePoint 1: Clone optimization in Point 2: Unqualified star expansion is all-or-nothing (9492636) — This is intentional. Partial expansion (resolving only some sources) would produce an incomplete column list, causing downstream lineage resolution to silently omit columns or produce incorrect results. I confirmed that sqlglot also requires all sources to be resolvable — it raises Point 3: UNION column resolution uses left branch (9492636) — Added a doc comment noting this follows the SQL standard and matches sqlglot behavior. Also converted Point 4: Case sensitivity (a627cee) — Implemented proper SQL identifier case semantics rather than just documenting a limitation. Added a Note: the star expansion path ( Point 5: Point 6: Two star AST patterns (9492636) — The parser produces two distinct AST representations for star expressions: |
- Skip unnecessary clone in lineage() when no WITH clause is present
- Document intentional conservative behavior for partial star expansion
- Add comments for UNION left-branch column resolution (SQL standard)
- Note case-insensitive CTE matching as a known limitation
- Explain dual Star AST representation (Star vs Column{name:"*"})
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c93dfec to
9492636
Compare
Unquoted CTE names are compared case-insensitively (lowercased), quoted names preserve their original case. This matches sqlglot's normalize_identifiers behavior. - Add normalize_cte_name() helper that respects Identifier.quoted - Introduce SourceName struct to carry normalized names through star expansion pipeline - Update expand_star_from_sources to accept Identifier qualifiers - Add tests for quoted/unquoted/mixed CTE name matching Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…neage Add two tests that assert the current buggy behavior where scope.rs add_table_to_scope uses eq_ignore_ascii_case for all identifiers including quoted ones. Per SQL semantics (and sqlglot behavior), quoted identifiers should be case-sensitive: "mycte" should NOT match CTE "MyCte". These tests document the bug and will fail when it is fixed, prompting the assertions to be updated to correct behavior. The fix requires changes across scope.rs and lineage.rs CTE resolution, which is broader than the star expansion scope.
…mment placement - Convert recursive get_leftmost_select_mut to iterative loop with MAX_LINEAGE_DEPTH guard to prevent stack overflow on pathologically deep set operation nesting. - Move doc comment from SourceName struct to get_select_source_names function where it belongs.
|
@tobilg I apologize for the multiple edits. Please check it again. You are free to edit this PR as you wish. |
Summary
SELECT *inlineage()so that columns can be traced through CTEs without requiring external schema metadatacte2 AS (SELECT * FROM cte1)) by walking CTE definitions in order and propagating resolved column lists, replacing the previousqualify_columns-based approach which processed each SELECT independentlycte(a, b) AS (...)), qualified stars (cte.*), recursive CTE skip, unknown table graceful fallbackexpand_cte_starsso that stars from external tables (not CTEs) can also be resolved via schema column lookupSELECT *now carry their source table qualifier, enabling proper lineage tracing through nested CTE chains with JOINsresolve_qualified_columnnow checks ancestor CTE scopes for sibling CTEs in parent WITH clauses, fixing lineage resolution failures when a CTE references another CTE defined in the same WITH blockMAX_LINEAGE_DEPTH(64) counter toto_node_innerand all recursive callers to prevent stack overflow on circular or deeply nested CTE chains.get_leftmost_select_mutalso converted to iterative loop with the same depth guard.normalize_cte_name()helper that lowercases unquoted identifiers (case-insensitive per SQL spec) and preserves quoted identifiers as-is (case-sensitive), matching sqlglotnormalize_identifiersbehavior. IntroducedSourceNamestruct to carry normalization info through star expansion.expand_cte_starspublic as a building block for consumers who need to pre-process CTE star expansion independently (e.g., with schema information) before running lineage analysisKnown limitations
The star expansion path (
expand_cte_stars) correctly handles quoted vs unquoted CTE name matching, but the broader scope/lineage resolution paths (add_table_to_scopein scope.rs,resolve_qualified_columnin lineage.rs) still useeq_ignore_ascii_casewithout respecting thequotedflag on identifiers. Known-bug tests document this for future work.Test plan
🤖 Generated with Claude Code