Skip to content

Commit

Permalink
docs: how to use MetaDataProvider (#534)
Browse files Browse the repository at this point in the history
  • Loading branch information
reata committed Jan 7, 2024
1 parent 7740d10 commit 078b9b0
Show file tree
Hide file tree
Showing 5 changed files with 203 additions and 4 deletions.
36 changes: 35 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ Intermediate Tables:
```

### Dialect-Awareness Lineage
By default, sqllineage use `ansi` dialect to validate and parse your SQL. However, some SQL syntax you take for granted
By default, sqllineage use `ansi` dialect to parse and validate your SQL. However, some SQL syntax you take for granted
in daily life might not be in ANSI standard. In addition, different SQL dialects have different set of SQL keywords,
further weakening sqllineage's capabilities when keyword used as table name or column name. To get the most out of
sqllineage, we strongly encourage you to pass the dialect to assist the lineage analyzing.
Expand Down Expand Up @@ -163,6 +163,40 @@ $ sqllineage -f test.sql -l column
<default>.foo.col4 <- col4
```

### MetaData-Awareness Lineage
By observing the column lineage generated from previous step, you'll possibly notice that:
1. `<default>.foo.* <- <default>.quux.*`: the wildcard is not expanded.
2. `<default>.foo.col4 <- col4`: col4 is not assigned with source table.

It's not perfect because we don't know the columns encoded in `*` of table `quux`. Likewise, given the context,
col4 could be coming from `bar`, `baz` or `quux`. Without metadata, this is the best sqllineage can do.

User can optionally provide the metadata information to sqllineage to improve the lineage result.

Suppose all the tables are created in sqlite database with a file called `db.db`. In particular,
table `quux` has columns `col5` and `col6` and `baz` has column `col4`.
```shell
sqlite3 db.db 'CREATE TABLE IF NOT EXISTS baz (bar_id int, col1 int, col4 int)';
sqlite3 db.db 'CREATE TABLE IF NOT EXISTS quux (quux_id int, col5 int, col6 int)';
```

Now given the same SQL, column lineage is fully resolved.
```shell
$ SQLLINEAGE_DEFAULT_SCHEMA=main sqllineage -f test.sql -l column --sqlalchemy_url=sqlite:///db.db
main.corge.col1 <- main.foo.col1 <- main.bar.col1
main.corge.col2 <- main.foo.col2 <- main.bar.col1
main.corge.col2 <- main.grault.col2
main.foo.col3 <- c.col3_sum <- main.qux.col3
main.foo.col4 <- main.baz.col4
main.foo.col5 <- main.quux.col5
main.foo.col6 <- main.quux.col6
```
The default schema name in sqlite is called `main`, we have to specify here because the tables in SQL file are unqualified.

SQLLineage leverages [`sqlalchemy`](https://github.com/sqlalchemy/sqlalchemy) to retrieve metadata from different SQL databases.
Check for more details on SQLLineage [MetaData](https://sqllineage.readthedocs.io/en/latest/gear_up/metadata.html).


### Lineage Visualization
One more cool feature, if you want a graph visualization for the lineage result, toggle graph-visualization option

Expand Down
46 changes: 45 additions & 1 deletion docs/first_steps/advanced_usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ And if you want to see lineage for each SQL statement, just toggle verbose optio
Dialect-Awareness Lineage
=========================
By default, sqllineage use `ansi` dialect to validate and parse your SQL. However, some SQL syntax you take for granted
By default, sqllineage use `ansi` dialect to parse and validate your SQL. However, some SQL syntax you take for granted
in daily life might not be in ANSI standard. In addition, different SQL dialects have different set of SQL keywords,
further weakening sqllineage's capabilities when keyword used as table name or column name. To get the most out of
sqllineage, we strongly encourage you to pass the dialect to assist the lineage analyzing.
Expand Down Expand Up @@ -126,6 +126,50 @@ Suppose this sql is stored in a file called test.sql
<default>.foo.col4 <- col4
MetaData-Awareness Lineage
==========================
By observing the column lineage generated from previous step, you'll possibly notice that:
1. `<default>.foo.* <- <default>.quux.*`: the wildcard is not expanded.
2. `<default>.foo.col4 <- col4`: col4 is not assigned with source table.
It's not perfect because we don't know the columns encoded in `*` of table `quux`. Likewise, given the context,
col4 could be coming from `bar`, `baz` or `quux`. Without metadata, this is the best sqllineage can do.
User can optionally provide the metadata information to sqllineage to improve the lineage result.
Suppose all the tables are created in sqlite database with a file called `db.db`. In particular,
table `quux` has columns `col5` and `col6` and `baz` has column `col4`.
.. code-block:: bash
sqlite3 db.db 'CREATE TABLE IF NOT EXISTS baz (bar_id int, col1 int, col4 int)';
sqlite3 db.db 'CREATE TABLE IF NOT EXISTS quux (quux_id int, col5 int, col6 int)';
Now given the same SQL, column lineage is fully resolved.
.. code-block:: bash
$ SQLLINEAGE_DEFAULT_SCHEMA=main sqllineage -f test.sql -l column --sqlalchemy_url=sqlite:///db.db
main.corge.col1 <- main.foo.col1 <- main.bar.col1
main.corge.col2 <- main.foo.col2 <- main.bar.col1
main.corge.col2 <- main.grault.col2
main.foo.col3 <- c.col3_sum <- main.qux.col3
main.foo.col4 <- main.baz.col4
main.foo.col5 <- main.quux.col5
main.foo.col6 <- main.quux.col6
The default schema name in sqlite is called `main`, we have to specify here because the tables in SQL file are unqualified.
SQLLineage leverages `sqlalchemy`_ to retrieve metadata from different SQL databases.
Check for more details on SQLLineage `MetaData`_.
.. _sqlalchemy: https://github.com/sqlalchemy/sqlalchemy
.. _MetaData: https://sqllineage.readthedocs.io/en/latest/gear_up/metadata.html
Lineage Visualization
=====================
Expand Down
111 changes: 111 additions & 0 deletions docs/gear_up/metadata.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
********
MetaData
********

Column lineage requires metadata to accurately handle case like ``select *`` or select unqualified columns in case of join.
Without metadata, SQLLineage output partially accurate column lineage.

MetaDataProvider is a mechanism sqllineage offers so that user can optionally provide metadata information to sqllineage
to improve the accuracy.

There are two MetaDataProvider implementations that sqllineage ships with. You can also build your own by extending base
class :class:`sqllineage.core.metadata_provider.MetaDataProvider`.


DummyMetaDataProvider
=====================

.. autoclass:: sqllineage.core.metadata.dummy.DummyMetaDataProvider

By default a DummyMetaDataProvider instance constructed with an empty dict will be passed to LineageRunner.
User can instantiate DummyMetaDataProvider with metadata dict of their own instead.

.. code-block:: python
>>> from sqllineage.core.metadata.dummy import DummyMetaDataProvider
>>> from sqllineage.runner import LineageRunner
>>> sql1 = "insert into main.foo select * from main.bar"
>>> metadata = {"main.bar": ["col1", "col2"]}
>>> provider = DummyMetaDataProvider(metadata)
>>> LineageRunner(sql1, metadata_provider=provider).print_column_lineage()
main.foo.col1 <- main.bar.col1
main.foo.col2 <- main.bar.col2
>>> sql2 = "insert into main.foo select * from main.baz"
main.foo.* <- main.baz.*
DummyMetaDataProvider is mostly used for testing purposes. The demo above shows that when there is another SQL query like
``insert into main.foo select * from main.baz``, this provider won't help because it only knows column information for
table ``main.bar``.

However, if somehow user can retrieve metadata for all the tables from a bulk process, then as long as memory allows,
it can still be used in production.


SQLAlchemyMetaDataProvider
==========================

.. autoclass:: sqllineage.core.metadata.sqlalchemy.SQLAlchemyMetaDataProvider

On the other hand, SQLAlchemyMetaDataProvider doesn't require user to provide metadata for all the tables needed at once.
It only requires database connection information and will query the database for table metadata when needed.

.. code-block:: python
>>> from sqllineage.core.metadata.sqlalchemy import SQLAlchemyMetaDataProvider
>>> from sqllineage.runner import LineageRunner
>>> sql1 = "insert into main.foo select * from main.bar"
>>> url = "sqlite:///db.db"
>>> provider = SQLAlchemyMetaDataProvider(url)
>>> LineageRunner(sql1, metadata_provider=provider).print_column_lineage()
As long as ``sqlite:///db.db`` is the correct source that this SQL runs on, sqllineage will generate the correct lineage.

As the name suggests, sqlalchemy is used to connect to the databases. SQLAlchemyMetaDataProvider is just a thin wrapper
on sqlalchemy ``engine``. SQLAlchemy is capable of connecting to multiple data sources with correct driver installed.

Please refer to SQLAlchemy `Dialect`_ documentation for connection information if you haven't used sqlalchemy before.


.. note::
**SQLLineage only adds sqlalchemy library as dependency. All the drivers are not bundled, meaning user have to install
on their own**. For example, if you want to connect to snowflake using `snowflake-sqlalchemy`_ in sqllineage, then
you need to run

.. code-block:: bash
pip install snowflake-sqlalchemy
to install the driver. After that is done, you can use snowflake sqlalchemy url like:

.. code-block:: python
>>> use, password, account = "<your_user_login_name>", "<your_password>", "<your_account_name>"
>>> provider = SQLAlchemyMetaDataProvider(f"snowflake://{user}:{password}@{account}/")
Make sure <your_user_login_name>, <your_password>, and <your_account_name> are replaced with the appropriate values
for your Snowflake account and user.

SQLLineage will try connecting to the data source when SQLAlchemyMetaDataProvider is constructed and throws
MetaDataProviderException immediately if connection fails.


.. note::
**Some drivers allow extra connection arguments.** For example, in `sqlalchemy-bigquery`_, to specify location of
your datasets, you can pass `location` to sqlalchemy ``creation_engine`` function:

.. code-block:: python
>>> engine = create_engine('bigquery://project', location="asia-northeast1")
this translates to the following SQLAlchemyMetaDataProvider code:

.. code-block:: python
>>> provider = SQLAlchemyMetaDataProvider('bigquery://project', engine_kwargs={"location": "asia-northeast1"})
.. _Dialect: https://docs.sqlalchemy.org/en/20/dialects/
.. _snowflake-sqlalchemy: https://github.com/snowflakedb/snowflake-sqlalchemy
.. _sqlalchemy-bigquery: https://github.com/googleapis/python-bigquery-sqlalchemy
9 changes: 8 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,13 @@ SQLLineage: SQL Lineage Analysis Tool Powered by Python
Never get the hang of a SQL parser? SQLLineage comes to the rescue. Given a SQL command, SQLLineage will tell you its
source and target tables, without worrying about Tokens, Keyword, Identified and all the jagons used by a SQL parser.

Behind the scene, SQLLineage uses the fantastic `sqlparse`_ library to parse the SQL command, and bring you all the
Behind the scene, SQLLineage pluggable leverages parser library `sqlfluff`_ and `sqlparse`_ to parse the SQL command,
analyze the AST, stores the lineage information in a graph (using graph library `networkx`_), and bring you all the
human-readable result with ease.

.. _sqlfluff: https://github.com/sqlfluff/sqlfluff
.. _sqlparse: https://github.com/andialbrecht/sqlparse
.. _networkx: https://github.com/networkx/networkx

First steps
===========
Expand Down Expand Up @@ -40,10 +43,14 @@ Gear Up
:caption: Gear up

gear_up/configuration
gear_up/metadata

:doc:`gear_up/configuration`
Learn how to configure sqllineage

:doc:`gear_up/metadata`
Learn how to use MetaDataProvider


Behind the scene
================
Expand Down
5 changes: 4 additions & 1 deletion sqllineage/core/metadata/dummy.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,13 @@

class DummyMetaDataProvider(MetaDataProvider):
"""
A Dummy MetaDataProvider that accept a dict with table name as key and a set of column name as value
A Dummy MetaDataProvider that accept metadata as a dict
"""

def __init__(self, metadata: Optional[Dict[str, List[str]]] = None):
"""
:param metadata: a dict with schema.table name as key and a list of unqualified column name as value
"""
super().__init__()
self.metadata = metadata if metadata is not None else {}

Expand Down

0 comments on commit 078b9b0

Please sign in to comment.