docs: how to use MetaDataProvider (#534)

reata · Jan 7, 2024 · 078b9b0 · 078b9b0
1 parent 7740d10
commit 078b9b0
Show file tree

Hide file tree

Showing 5 changed files with 203 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -96,7 +96,7 @@ Intermediate Tables:
 ```
 
 ### Dialect-Awareness Lineage
-By default, sqllineage use `ansi` dialect to validate and parse your SQL. However, some SQL syntax you take for granted
+By default, sqllineage use `ansi` dialect to parse and validate your SQL. However, some SQL syntax you take for granted
 in daily life might not be in ANSI standard. In addition, different SQL dialects have different set of SQL keywords,
 further weakening sqllineage's capabilities when keyword used as table name or column name. To get the most out of
 sqllineage, we strongly encourage you to pass the dialect to assist the lineage analyzing.
@@ -163,6 +163,40 @@ $ sqllineage -f test.sql -l column
 <default>.foo.col4 <- col4
 ```
 
+### MetaData-Awareness Lineage
+By observing the column lineage generated from previous step, you'll possibly notice that:
+1. `<default>.foo.* <- <default>.quux.*`: the wildcard is not expanded.
+2. `<default>.foo.col4 <- col4`: col4 is not assigned with source table.
+
+It's not perfect because we don't know the columns encoded in `*` of table `quux`. Likewise, given the context,
+col4 could be coming from `bar`, `baz` or `quux`. Without metadata, this is the best sqllineage can do.
+
+User can optionally provide the metadata information to sqllineage to improve the lineage result.
+
+Suppose all the tables are created in sqlite database with a file called `db.db`. In particular, 
+table `quux` has columns `col5` and `col6` and `baz` has column `col4`. 
+```shell
+sqlite3 db.db 'CREATE TABLE IF NOT EXISTS baz (bar_id int, col1 int, col4 int)';
+sqlite3 db.db 'CREATE TABLE IF NOT EXISTS quux (quux_id int, col5 int, col6 int)';
+```
+
+Now given the same SQL, column lineage is fully resolved.
+```shell
+$ SQLLINEAGE_DEFAULT_SCHEMA=main sqllineage -f test.sql -l column --sqlalchemy_url=sqlite:///db.db
+main.corge.col1 <- main.foo.col1 <- main.bar.col1
+main.corge.col2 <- main.foo.col2 <- main.bar.col1
+main.corge.col2 <- main.grault.col2
+main.foo.col3 <- c.col3_sum <- main.qux.col3
+main.foo.col4 <- main.baz.col4
+main.foo.col5 <- main.quux.col5
+main.foo.col6 <- main.quux.col6
+```
+The default schema name in sqlite is called `main`, we have to specify here because the tables in SQL file are unqualified.
+
+SQLLineage leverages [`sqlalchemy`](https://github.com/sqlalchemy/sqlalchemy) to retrieve metadata from different SQL databases. 
+Check for more details on SQLLineage [MetaData](https://sqllineage.readthedocs.io/en/latest/gear_up/metadata.html).
+
+
 ### Lineage Visualization
 One more cool feature, if you want a graph visualization for the lineage result, toggle graph-visualization option
 

diff --git a/docs/first_steps/advanced_usage.rst b/docs/first_steps/advanced_usage.rst
@@ -52,7 +52,7 @@ And if you want to see lineage for each SQL statement, just toggle verbose optio
 
 Dialect-Awareness Lineage
 =========================
-By default, sqllineage use `ansi` dialect to validate and parse your SQL. However, some SQL syntax you take for granted
+By default, sqllineage use `ansi` dialect to parse and validate your SQL. However, some SQL syntax you take for granted
 in daily life might not be in ANSI standard. In addition, different SQL dialects have different set of SQL keywords,
 further weakening sqllineage's capabilities when keyword used as table name or column name. To get the most out of
 sqllineage, we strongly encourage you to pass the dialect to assist the lineage analyzing.
@@ -126,6 +126,50 @@ Suppose this sql is stored in a file called test.sql
     <default>.foo.col4 <- col4
 
 
+MetaData-Awareness Lineage
+==========================
+
+By observing the column lineage generated from previous step, you'll possibly notice that:
+
+1. `<default>.foo.* <- <default>.quux.*`: the wildcard is not expanded.
+2. `<default>.foo.col4 <- col4`: col4 is not assigned with source table.
+
+It's not perfect because we don't know the columns encoded in `*` of table `quux`. Likewise, given the context,
+col4 could be coming from `bar`, `baz` or `quux`. Without metadata, this is the best sqllineage can do.
+
+User can optionally provide the metadata information to sqllineage to improve the lineage result.
+
+Suppose all the tables are created in sqlite database with a file called `db.db`. In particular,
+table `quux` has columns `col5` and `col6` and `baz` has column `col4`.
+
+.. code-block:: bash
+
+    sqlite3 db.db 'CREATE TABLE IF NOT EXISTS baz (bar_id int, col1 int, col4 int)';
+    sqlite3 db.db 'CREATE TABLE IF NOT EXISTS quux (quux_id int, col5 int, col6 int)';
+
+Now given the same SQL, column lineage is fully resolved.
+
+.. code-block:: bash
+
+    $ SQLLINEAGE_DEFAULT_SCHEMA=main sqllineage -f test.sql -l column --sqlalchemy_url=sqlite:///db.db
+    main.corge.col1 <- main.foo.col1 <- main.bar.col1
+    main.corge.col2 <- main.foo.col2 <- main.bar.col1
+    main.corge.col2 <- main.grault.col2
+    main.foo.col3 <- c.col3_sum <- main.qux.col3
+    main.foo.col4 <- main.baz.col4
+    main.foo.col5 <- main.quux.col5
+    main.foo.col6 <- main.quux.col6
+
+The default schema name in sqlite is called `main`, we have to specify here because the tables in SQL file are unqualified.
+
+SQLLineage leverages `sqlalchemy`_ to retrieve metadata from different SQL databases.
+Check for more details on SQLLineage `MetaData`_.
+
+
+.. _sqlalchemy: https://github.com/sqlalchemy/sqlalchemy
+.. _MetaData: https://sqllineage.readthedocs.io/en/latest/gear_up/metadata.html
+
+
 Lineage Visualization
 =====================
 

diff --git a/docs/gear_up/metadata.rst b/docs/gear_up/metadata.rst
@@ -0,0 +1,111 @@
+********
+MetaData
+********
+
+Column lineage requires metadata to accurately handle case like ``select *`` or select unqualified columns in case of join.
+Without metadata, SQLLineage output partially accurate column lineage.
+
+MetaDataProvider is a mechanism sqllineage offers so that user can optionally provide metadata information to sqllineage
+to improve the accuracy.
+
+There are two MetaDataProvider implementations that sqllineage ships with. You can also build your own by extending base
+class :class:`sqllineage.core.metadata_provider.MetaDataProvider`.
+
+
+DummyMetaDataProvider
+=====================
+
+.. autoclass:: sqllineage.core.metadata.dummy.DummyMetaDataProvider
+
+By default a DummyMetaDataProvider instance constructed with an empty dict will be passed to LineageRunner.
+User can instantiate DummyMetaDataProvider with metadata dict of their own instead.
+
+.. code-block:: python
+
+    >>> from sqllineage.core.metadata.dummy import DummyMetaDataProvider
+    >>> from sqllineage.runner import LineageRunner
+    >>> sql1 = "insert into main.foo select * from main.bar"
+    >>> metadata = {"main.bar": ["col1", "col2"]}
+    >>> provider = DummyMetaDataProvider(metadata)
+    >>> LineageRunner(sql1, metadata_provider=provider).print_column_lineage()
+    main.foo.col1 <- main.bar.col1
+    main.foo.col2 <- main.bar.col2
+    >>> sql2 = "insert into main.foo select * from main.baz"
+    main.foo.* <- main.baz.*
+
+DummyMetaDataProvider is mostly used for testing purposes. The demo above shows that when there is another SQL query like
+``insert into main.foo select * from main.baz``, this provider won't help because it only knows column information for
+table ``main.bar``.
+
+However, if somehow user can retrieve metadata for all the tables from a bulk process, then as long as memory allows,
+it can still be used in production.
+
+
+SQLAlchemyMetaDataProvider
+==========================
+
+.. autoclass:: sqllineage.core.metadata.sqlalchemy.SQLAlchemyMetaDataProvider
+
+On the other hand, SQLAlchemyMetaDataProvider doesn't require user to provide metadata for all the tables needed at once.
+It only requires database connection information and will query the database for table metadata when needed.
+
+.. code-block:: python
+
+    >>> from sqllineage.core.metadata.sqlalchemy import SQLAlchemyMetaDataProvider
+    >>> from sqllineage.runner import LineageRunner
+    >>> sql1 = "insert into main.foo select * from main.bar"
+    >>> url = "sqlite:///db.db"
+    >>> provider = SQLAlchemyMetaDataProvider(url)
+    >>> LineageRunner(sql1, metadata_provider=provider).print_column_lineage()
+
+As long as ``sqlite:///db.db`` is the correct source that this SQL runs on, sqllineage will generate the correct lineage.
+
+As the name suggests, sqlalchemy is used to connect to the databases. SQLAlchemyMetaDataProvider is just a thin wrapper
+on sqlalchemy ``engine``. SQLAlchemy is capable of connecting to multiple data sources with correct driver installed.
+
+Please refer to SQLAlchemy `Dialect`_ documentation for connection information if you haven't used sqlalchemy before.
+
+
+.. note::
+     **SQLLineage only adds sqlalchemy library as dependency. All the drivers are not bundled, meaning user have to install
+     on their own**. For example, if you want to connect to snowflake using `snowflake-sqlalchemy`_ in sqllineage, then
+     you need to run
+
+     .. code-block:: bash
+
+        pip install snowflake-sqlalchemy
+
+
+     to install the driver. After that is done, you can use snowflake sqlalchemy url like:
+
+     .. code-block:: python
+
+        >>> use, password, account = "<your_user_login_name>", "<your_password>", "<your_account_name>"
+        >>> provider = SQLAlchemyMetaDataProvider(f"snowflake://{user}:{password}@{account}/")
+
+     Make sure <your_user_login_name>, <your_password>, and <your_account_name> are replaced with the appropriate values
+     for your Snowflake account and user.
+
+     SQLLineage will try connecting to the data source when SQLAlchemyMetaDataProvider is constructed and throws
+     MetaDataProviderException immediately if connection fails.
+
+
+.. note::
+     **Some drivers allow extra connection arguments.** For example, in `sqlalchemy-bigquery`_, to specify location of
+     your datasets, you can pass `location` to sqlalchemy ``creation_engine`` function:
+
+     .. code-block:: python
+
+        >>> engine = create_engine('bigquery://project', location="asia-northeast1")
+
+     this translates to the following SQLAlchemyMetaDataProvider code:
+
+     .. code-block:: python
+
+        >>> provider = SQLAlchemyMetaDataProvider('bigquery://project', engine_kwargs={"location": "asia-northeast1"})
+
+
+
+.. _Dialect: https://docs.sqlalchemy.org/en/20/dialects/
+.. _snowflake-sqlalchemy: https://github.com/snowflakedb/snowflake-sqlalchemy
+.. _sqlalchemy-bigquery: https://github.com/googleapis/python-bigquery-sqlalchemy
diff --git a/docs/index.rst b/docs/index.rst
@@ -4,10 +4,13 @@ SQLLineage: SQL Lineage Analysis Tool Powered by Python
 Never get the hang of a SQL parser? SQLLineage comes to the rescue. Given a SQL command, SQLLineage will tell you its
 source and target tables, without worrying about Tokens, Keyword, Identified and all the jagons used by a SQL parser.
 
-Behind the scene, SQLLineage uses the fantastic `sqlparse`_ library to parse the SQL command, and bring you all the
+Behind the scene, SQLLineage pluggable leverages parser library `sqlfluff`_ and `sqlparse`_ to parse the SQL command,
+analyze the AST, stores the lineage information in a graph (using graph library `networkx`_), and bring you all the
 human-readable result with ease.
 
+.. _sqlfluff: https://github.com/sqlfluff/sqlfluff
 .. _sqlparse: https://github.com/andialbrecht/sqlparse
+.. _networkx: https://github.com/networkx/networkx
 
 First steps
 ===========
@@ -40,10 +43,14 @@ Gear Up
    :caption: Gear up
 
    gear_up/configuration
+   gear_up/metadata
 
 :doc:`gear_up/configuration`
     Learn how to configure sqllineage
 
+:doc:`gear_up/metadata`
+    Learn how to use MetaDataProvider
+
 
 Behind the scene
 ================

diff --git a/sqllineage/core/metadata/dummy.py b/sqllineage/core/metadata/dummy.py
@@ -5,10 +5,13 @@
 
 class DummyMetaDataProvider(MetaDataProvider):
     """
-    A Dummy MetaDataProvider that accept a dict with table name as key and a set of column name as value
+    A Dummy MetaDataProvider that accept metadata as a dict
     """
 
     def __init__(self, metadata: Optional[Dict[str, List[str]]] = None):
+        """
+        :param metadata: a dict with schema.table name as key and a list of unqualified column name as value
+        """
         super().__init__()
         self.metadata = metadata if metadata is not None else {}