feat(lineage): support all-columns mode and on_node callback#7575
Merged
georgesittas merged 2 commits intomainfrom Apr 29, 2026
Merged
feat(lineage): support all-columns mode and on_node callback#7575georgesittas merged 2 commits intomainfrom
georgesittas merged 2 commits intomainfrom
Conversation
Adds an extension to lineage() so that passing column=None produces a
dict[str, Node] mapping every top-level output column name to its
lineage Node. The single-column form (str | exp.Column) is unchanged
and continues to return a Node. Typing overloads disambiguate the two
return shapes for callers.
A new on_node callback is invoked for every Node created during the
walk, after its downstream is populated. Combined with Node.payload —
a caller-managed dict — this lets callers thread per-node data through
the lineage graph during construction without subclassing Node or
rewalking it after the fact.
Performance:
* Resolving a column to its select expression scanned
selectable.selects on every to_node call. Wide queries with many
output columns made this O(N^2). Memoize a per-scope
{name: select} map and the selectable.is_star bit on first lookup
instead.
* Compile sqlglot/lineage.py via mypyc by listing it in sqlglotc's
_source_files. Together with the memoization above, this shrinks
end-to-end all-columns lineage cost on large CTE-heavy queries by
roughly 2x compared to the unmemoized pure-Python path.
Adds tests for the column=None form of lineage() and the on_node
callback contract:
* column=None returns a dict keyed by every top-level output column,
with each entry shaped like single-column lineage().
* shared upstream Nodes are deduplicated across output columns by
the per-call cache (same source column referenced from multiple
selects yields a single shared downstream Node).
* UNION CTEs fan out correctly — each output column points at one
downstream per branch and bottoms out at every branch's base table.
* passing a pre-built Scope returns the same Node tree as the
no-scope path, with no second qualify pass.
* the on_node callback fires children before parents, so callers
can populate Node.payload bottom-up from already-finalized
children.
* on_node fires exactly once per Node, even when a Node is reached
from multiple parents.
Contributor
SQLGlot Integration Test ResultsComparing:
By Dialect
Overallmain: 113234 total, 112044 passed (pass rate: 98.9%), sqlglot version: sqlglot:jo/top_down_lineage: 101035 total, 101035 passed (pass rate: 100.0%), sqlglot version: Transitions: Dialect pair changes: 0 previous results not found, 3 current results not found ✅ 37 test(s) passed |
geooo109
approved these changes
Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds an extension to
lineage()so that passingcolumn=Noneproduces adict[str, Node]mapping every top-level output column name to its lineageNode. The single-column form ((str | exp.Column)) is unchanged and continues to return aNode. Typing overloads disambiguate the two return shapes for callers.A new
on_nodecallback is invoked for everyNodecreated during the walk, after its downstream is populated. Combined withNode.payload(a caller-managed dict), this lets callers thread per-node data through the lineage graph during construction without subclassingNodeor rewalking it after the fact.Performance:
selectable.selectson everyto_nodecall. Wide queries with many output columns made thisO(N^2). Memoize a per-scope{name: select}map and theselectable.is_starbit on first lookup instead.sqlglot/lineage.pyvia mypyc by listing it in sqlglotc's_source_files. Together with the memoization above, this shrinks end-to-end all-columns lineage cost on large CTE-heavy queries by roughly 2x compared to the unmemoized pure-Python path.