Lineage Impact

Youtube video https://www.youtube.com/watch?v=1himkwnKP3Y

A Python tool for generating large-scale data lineage graphs with customizable parameters.

Overview

The LineageGraphGenerator creates realistic data lineage graphs that model relationships between various data assets across different teams and data sources. These graphs can be used for:

Testing graph visualization and analysis tools
Benchmarking graph database performance
Simulating data lineage scenarios
Developing and testing impact analysis algorithms

Features

Generate large graphs (100K+ nodes)
Multiple data source types (warehouse, database, BI, etc.)
Realistic asset types with proper restrictions
Team-based ownership
Field-level lineage connections
Various output formats (JSON, GEXF, GraphML)

Data Source Restrictions

Each data source is restricted to specific asset types it can contain:

Data Warehouses (Snowflake, BigQuery, Redshift): schema, table, view, column
Databases (PostgreSQL, MySQL, ArangoDB): schema, table, view, column
Transformation (dbt): model, source, column
Orchestration (Airflow): workflow, job
BI Tools (Tableau, Looker, Power BI): dashboard, report, metric, dimension, measure
Streaming (Kafka): topic, schema
Storage (S3): bucket

Usage

from lineage_generator import LineageGraphGenerator

# Create a generator
generator = LineageGraphGenerator(
    min_nodes=100000,
    edge_multiplier=5,
    num_teams=10,
    output_dir="./output"
)

# Generate the graph
graph = generator.generate_graph()

# Get statistics
stats = generator.get_graph_stats()
print(stats)

# Save in different formats
generator.save_graph(format="json")
generator.save_graph(format="gexf")

You can also use the provided script:

python generate_lineage_graph.py

Graph Structure

The generated graph includes:

Nodes: Represent various data assets (tables, views, models, reports, etc.)
Edges: Represent relationships between assets (source_to_target, parent_child, etc.)
Properties: Each node and edge has properties like id, name, type, team, etc.

Customization

You can customize the graph generation by modifying:

Number of nodes and edges
Number of teams
Valid relationships between asset types
Data source restrictions

Output Files

Output files are saved in the specified output directory with timestamp-based filenames:

lineage_graph_YYYYMMDD_HHMMSS.json
lineage_graph_YYYYMMDD_HHMMSS.gexf
lineage_graph_YYYYMMDD_HHMMSS.graphml

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
static		static
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
data_lineage_analyzer.ipynb		data_lineage_analyzer.ipynb
graph_stats.json		graph_stats.json
lineage_generator.py		lineage_generator.py
pyproject.toml		pyproject.toml
relationship_validator.py		relationship_validator.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lineage Impact

Youtube video https://www.youtube.com/watch?v=1himkwnKP3Y

Overview

Features

Data Source Restrictions

Usage

Graph Structure

Customization

Output Files

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sumandas0/Arango-networkx-cuGraph

Folders and files

Latest commit

History

Repository files navigation

Lineage Impact

Youtube video https://www.youtube.com/watch?v=1himkwnKP3Y

Overview

Features

Data Source Restrictions

Usage

Graph Structure

Customization

Output Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages