# Table Valued Functions Lineage

**Example: Table-Valued Functions (TVF) Support for Column Lineage**


Demonstrates how clgraph tracks column lineage through Table-Valued Functions:
- Generator TVFs (GENERATE_SERIES, GENERATE_DATE_ARRAY)
- External data TVFs (READ_CSV)
- Synthetic column marking
- TVF parameters in lineage

### Imports

In [1]:
from clgraph import JSONExporter, Pipeline, RecursiveLineageBuilder

### Example 1: GENERATE_SERIES (PostgreSQL/DuckDB)

In [2]:
print("Example 1: GENERATE_SERIES - Generator TVF")

sql_generate_series = """
SELECT num * 2 AS doubled
FROM GENERATE_SERIES(1, 10) AS t(num)
"""

builder = RecursiveLineageBuilder(sql_generate_series, dialect="postgres")
graph = builder.build()

print("\nSQL:")
print(sql_generate_series)

print("\nColumn Lineage:")
for name, node in graph.nodes.items():
    if node.is_synthetic:
        print(f"  {name} [SYNTHETIC from {node.synthetic_source}]")
    else:
        print(f"  {name}")

print("\nEdges:")
for edge in graph.edges:
    tvf_marker = " [TVF output]" if edge.is_tvf_output else ""
    print(f"  {edge.from_node.full_name} -> {edge.to_node.full_name}{tvf_marker}")

Example 1: GENERATE_SERIES - Generator TVF

SQL:

SELECT num * 2 AS doubled
FROM GENERATE_SERIES(1, 10) AS t(num)


Column Lineage:
  output.doubled
  t.num [SYNTHETIC from generate_series]

Edges:
  t.num -> output.doubled [TVF output]


### Example 2: Multiple TVFs in JOIN

In [3]:
print("Example 2: Multiple TVFs with JOIN")

sql_multi_tvf = """
SELECT
    d.num AS day_num,
    h.num AS hour_num
FROM GENERATE_SERIES(1, 31) AS d(num)
CROSS JOIN GENERATE_SERIES(0, 23) AS h(num)
"""

builder2 = RecursiveLineageBuilder(sql_multi_tvf, dialect="postgres")
graph2 = builder2.build()

print("\nSQL:")
print(sql_multi_tvf)

print("\nSynthetic columns from TVFs:")
for name, node in graph2.nodes.items():
    if node.is_synthetic:
        print(f"  {name}: generated by {node.synthetic_source}")

Example 2: Multiple TVFs with JOIN

SQL:

SELECT
    d.num AS day_num,
    h.num AS hour_num
FROM GENERATE_SERIES(1, 31) AS d(num)
CROSS JOIN GENERATE_SERIES(0, 23) AS h(num)


Synthetic columns from TVFs:
  d.num: generated by generate_series
  h.num: generated by generate_series


### Example 3: TVF in Pipeline with CREATE TABLE

In [4]:
print("Example 3: TVF in Pipeline")

sql_pipeline = """
CREATE TABLE date_range AS
SELECT num AS day_number
FROM GENERATE_SERIES(1, 365) AS t(num)
"""

pipeline = Pipeline([("create_dates", sql_pipeline)], dialect="postgres")

print("\nSQL:")
print(sql_pipeline)

print("\nPipeline columns:")
for name, col in pipeline.column_graph.columns.items():
    synth = f" [from {col.synthetic_source}]" if col.is_synthetic else ""
    print(f"  {name}{synth}")

print("\nPipeline edges:")
for edge in pipeline.column_graph.edges:
    print(f"  {edge.from_node.full_name} -> {edge.to_node.full_name} [{edge.edge_type}]")

Example 3: TVF in Pipeline

SQL:

CREATE TABLE date_range AS
SELECT num AS day_number
FROM GENERATE_SERIES(1, 365) AS t(num)


Pipeline columns:
  date_range.day_number
  create_dates:unknown.num [from generate_series]

Pipeline edges:
  create_dates:unknown.num -> date_range.day_number [tvf_output]


### Example 4: JSON Export with TVF metadata

In [5]:
import json

print("Example 4: JSON Export with TVF Metadata")

exporter = JSONExporter()
export_data = exporter.export(pipeline)

print("\nExported columns with synthetic info:")
for col in export_data.get("columns", []):
    if col.get("is_synthetic"):
        print(json.dumps(col, indent=2))

print("\nExported edges with TVF info:")
for edge in export_data.get("edges", []):
    if edge.get("is_tvf_output"):
        print(json.dumps(edge, indent=2))
        break

Example 4: JSON Export with TVF Metadata

Exported columns with synthetic info:
{
  "full_name": "create_dates:unknown.num",
  "column_name": "num",
  "table_name": "t",
  "query_id": "create_dates",
  "node_type": "tvf_synthetic",
  "expression": "generate_series(...)",
  "operation": "tvf_synthetic",
  "is_synthetic": true,
  "synthetic_source": "generate_series",
  "tvf_parameters": {
    "start": "1",
    "end": "365"
  },
  "description": null,
  "description_source": null,
  "owner": null,
  "pii": false,
  "tags": [],
  "custom_metadata": {}
}

Exported edges with TVF info:


### Example 5: READ_CSV (DuckDB) - External TVF

In [6]:
from clgraph.query_parser import RecursiveQueryParser

print("Example 5: READ_CSV - External Data TVF")

sql_read_csv = """
SELECT *
FROM READ_CSV('s3://bucket/data.csv') AS t
"""

parser = RecursiveQueryParser(sql_read_csv, dialect="duckdb")
unit_graph = parser.parse()

print("\nSQL:")
print(sql_read_csv)

print("\nTVF sources detected:")
for unit in unit_graph.units.values():
    for alias, tvf in unit.tvf_sources.items():
        print(f"  {alias}: {tvf.function_name} ({tvf.tvf_type.value})")
        if tvf.external_source:
            print(f"    External source: {tvf.external_source}")

# Summary
print("Summary")
print(
    """
e-Valued Functions support captures:
nerator TVFs (GENERATE_SERIES, GENERATE_DATE_ARRAY)
ternal data TVFs (READ_CSV, READ_PARQUET)
lumn aliases from TVF AS t(col1, col2) syntax
F parameters (start, end, step values)
ternal source paths for file-reading TVFs

hetic columns are marked with:
_synthetic: True
nthetic_source: function name (e.g., "generate_series")
f_parameters: parameter dictionary

 metadata is preserved through:
cursiveLineageBuilder (single query analysis)
peline (multi-query analysis)
ON export
"""
)

Example 5: READ_CSV - External Data TVF

SQL:

SELECT *
FROM READ_CSV('s3://bucket/data.csv') AS t


TVF sources detected:
  t: read_csv (external)
    External source: s3://bucket/data.csv
Summary

e-Valued Functions support captures:
nerator TVFs (GENERATE_SERIES, GENERATE_DATE_ARRAY)
ternal data TVFs (READ_CSV, READ_PARQUET)
lumn aliases from TVF AS t(col1, col2) syntax
F parameters (start, end, step values)
ternal source paths for file-reading TVFs

hetic columns are marked with:
_synthetic: True
nthetic_source: function name (e.g., "generate_series")
f_parameters: parameter dictionary

 metadata is preserved through:
cursiveLineageBuilder (single query analysis)
peline (multi-query analysis)
ON export



### Visualize Pipeline Lineage

Display the simplified column lineage for table-valued function pipelines.

In [None]:
import shutil

from clgraph import visualize_pipeline_lineage

if shutil.which("dot") is None:
    print("⚠️  Graphviz not installed. Install with: brew install graphviz")
else:
    print("TVF Pipeline - Simplified Lineage:")
    display(visualize_pipeline_lineage(pipeline.column_graph.to_simplified()))