Skip to content

feat(lineage): support all-columns mode and on_node callback#7575

Merged
georgesittas merged 2 commits intomainfrom
jo/top_down_lineage
Apr 29, 2026
Merged

feat(lineage): support all-columns mode and on_node callback#7575
georgesittas merged 2 commits intomainfrom
jo/top_down_lineage

Conversation

@georgesittas
Copy link
Copy Markdown
Collaborator

@georgesittas georgesittas commented Apr 28, 2026

Adds an extension to lineage() so that passing column=None produces a dict[str, Node] mapping every top-level output column name to its lineage Node. The single-column form ((str | exp.Column)) is unchanged and continues to return a Node. Typing overloads disambiguate the two return shapes for callers.

A new on_node callback is invoked for every Node created during the walk, after its downstream is populated. Combined with Node.payload (a caller-managed dict), this lets callers thread per-node data through the lineage graph during construction without subclassing Node or rewalking it after the fact.

Performance:

  • Resolving a column to its select expression scanned selectable.selects on every to_node call. Wide queries with many output columns made this O(N^2). Memoize a per-scope {name: select} map and the selectable.is_star bit on first lookup instead.
  • Compile sqlglot/lineage.py via mypyc by listing it in sqlglotc's _source_files. Together with the memoization above, this shrinks end-to-end all-columns lineage cost on large CTE-heavy queries by roughly 2x compared to the unmemoized pure-Python path.

Adds an extension to lineage() so that passing column=None produces a
dict[str, Node] mapping every top-level output column name to its
lineage Node. The single-column form (str | exp.Column) is unchanged
and continues to return a Node. Typing overloads disambiguate the two
return shapes for callers.

A new on_node callback is invoked for every Node created during the
walk, after its downstream is populated. Combined with Node.payload —
a caller-managed dict — this lets callers thread per-node data through
the lineage graph during construction without subclassing Node or
rewalking it after the fact.

Performance:
  * Resolving a column to its select expression scanned
    selectable.selects on every to_node call. Wide queries with many
    output columns made this O(N^2). Memoize a per-scope
    {name: select} map and the selectable.is_star bit on first lookup
    instead.
  * Compile sqlglot/lineage.py via mypyc by listing it in sqlglotc's
    _source_files. Together with the memoization above, this shrinks
    end-to-end all-columns lineage cost on large CTE-heavy queries by
    roughly 2x compared to the unmemoized pure-Python path.
Adds tests for the column=None form of lineage() and the on_node
callback contract:

  * column=None returns a dict keyed by every top-level output column,
    with each entry shaped like single-column lineage().
  * shared upstream Nodes are deduplicated across output columns by
    the per-call cache (same source column referenced from multiple
    selects yields a single shared downstream Node).
  * UNION CTEs fan out correctly — each output column points at one
    downstream per branch and bottoms out at every branch's base table.
  * passing a pre-built Scope returns the same Node tree as the
    no-scope path, with no second qualify pass.
  * the on_node callback fires children before parents, so callers
    can populate Node.payload bottom-up from already-finalized
    children.
  * on_node fires exactly once per Node, even when a Node is reached
    from multiple parents.
@georgesittas georgesittas changed the title jo/top down lineage feat(lineage): support all-columns mode and on_node callback Apr 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

SQLGlot Integration Test Results

Comparing:

  • this branch (sqlglot:jo/top_down_lineage, sqlglot version: jo/top_down_lineage)
  • baseline (main, sqlglot version: 0.0.1.dev1)

By Dialect

dialect main sqlglot:jo/top_down_lineage transitions links
bigquery -> bigquery 24645/24650 passed (100.0%) 23495/23495 passed (100.0%) No change full result / delta
bigquery -> duckdb 867/1154 passed (75.1%) 0/0 passed (0.0%) Results not found full result / delta
duckdb -> duckdb 5823/5823 passed (100.0%) 0/0 passed (0.0%) Results not found full result / delta
snowflake -> duckdb 1063/1961 passed (54.2%) 0/0 passed (0.0%) Results not found full result / delta
snowflake -> snowflake 65133/65133 passed (100.0%) 63027/63027 passed (100.0%) No change full result / delta
databricks -> databricks 1370/1370 passed (100.0%) 1370/1370 passed (100.0%) No change full result / delta
postgres -> postgres 6042/6042 passed (100.0%) 6042/6042 passed (100.0%) No change full result / delta
redshift -> redshift 7101/7101 passed (100.0%) 7101/7101 passed (100.0%) No change full result / delta

Overall

main: 113234 total, 112044 passed (pass rate: 98.9%), sqlglot version: 0.0.1.dev1

sqlglot:jo/top_down_lineage: 101035 total, 101035 passed (pass rate: 100.0%), sqlglot version: jo/top_down_lineage

Transitions:
No change

Dialect pair changes: 0 previous results not found, 3 current results not found

✅ 37 test(s) passed

@georgesittas georgesittas merged commit d146dcd into main Apr 29, 2026
10 of 12 checks passed
@georgesittas georgesittas deleted the jo/top_down_lineage branch April 29, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants