Skip to content

Commit

Permalink
docs: sqlfluff as default parser
Browse files Browse the repository at this point in the history
  • Loading branch information
reata committed Jan 1, 2024
1 parent b620d78 commit 292eb61
Show file tree
Hide file tree
Showing 18 changed files with 174 additions and 194 deletions.
47 changes: 23 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ Target Tables:
## Advanced Usage

### Multiple SQL Statements
Lineage result combined for multiple SQL statements, with intermediate tables identified:
Lineage is combined from multiple SQL statements, with intermediate tables identified:
```
$ sqllineage -e "insert into db1.table1 select * from db2.table2; insert into db3.table3 select * from db1.table1;"
Statements(#): 2
Expand All @@ -69,7 +69,7 @@ Intermediate Tables:
```

### Verbose Lineage Result
And if you want to see lineage result for every SQL statement, just toggle verbose option
And if you want to see lineage for each SQL statement, just toggle verbose option
```
$ sqllineage -v -e "insert into db1.table1 select * from db2.table2; insert into db3.table3 select * from db1.table1;"
Statement #1: insert into db1.table1 select * from db2.table2;
Expand All @@ -96,30 +96,29 @@ Intermediate Tables:
```

### Dialect-Awareness Lineage
By default, sqllineage doesn't validate your SQL and could give confusing result in case of invalid SQL syntax.
In addition, different SQL dialect has different set of keywords, further weakening sqllineage's capabilities when
keyword used as table name or column name. To reduce the impact, user are strongly encouraged to pass the dialect to
assist the lineage analyzing.
By default, sqllineage use `ansi` dialect to validate and parse your SQL. However, some SQL syntax you take for granted
in daily life might not be in ANSI standard. In addition, different SQL dialects have different set of SQL keywords,
further weakening sqllineage's capabilities when keyword used as table name or column name. To get the most out of
sqllineage, we strongly encourage you to pass the dialect to assist the lineage analyzing.

Take below example, `analyze` is a reserved keyword in PostgreSQL. Default non-validating dialect gives incomplete result,
while ansi dialect gives the correct one and postgres dialect tells you this causes syntax error:
Take below example, `INSERT OVERWRITE` statement is only supported by big data solutions like Hive/SparkSQL, and `MAP`
is a reserved keyword in Hive thus can not be used as table name while it is not for SparkSQL. Both ansi and hive dialect
tell you this causes syntax error and sparksql gives the correct result:
```
$ sqllineage -e "insert into analyze select * from foo;"
Statements(#): 1
Source Tables:
<default>.foo
Target Tables:
$ sqllineage -e "insert into analyze select * from foo;" --dialect=ansi
$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo"
...
sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL
$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo" --dialect=hive
...
sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL
$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo" --dialect=sparksql
Statements(#): 1
Source Tables:
<default>.foo
Target Tables:
<default>.analyze
$ sqllineage -e "insert into analyze select * from foo;" --dialect=postgres
...
sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL
<default>.map
```

Use `sqllineage --dialects` to see all available dialects.
Expand All @@ -129,7 +128,7 @@ We also support column level lineage in command line interface, set level option
be printed.

```sql
INSERT OVERWRITE TABLE foo
INSERT INTO foo
SELECT a.col1,
b.col1 AS col2,
c.col3_sum AS col3,
Expand All @@ -144,18 +143,18 @@ FROM bar a
ON a.id = sq.bar_id
CROSS JOIN quux d;

INSERT OVERWRITE TABLE corge
INSERT INTO corge
SELECT a.col1,
a.col2 + b.col2 AS col2
FROM foo a
LEFT JOIN grault b
ON a.col1 = b.col1;
```

Suppose this sql is stored in a file called foo.sql
Suppose this sql is stored in a file called test.sql

```
$ sqllineage -f foo.sql -l column
$ sqllineage -f test.sql -l column
<default>.corge.col1 <- <default>.foo.col1 <- <default>.bar.col1
<default>.corge.col2 <- <default>.foo.col2 <- <default>.baz.col1
<default>.corge.col2 <- <default>.grault.col2
Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
***************
LineageAnalyzer
***************
********
Analyzer
********

LineageAnalyzer is an abstract class, supposed to include the core processing logic for one-statement SQL analysis.

Each parser implementation will inherit LineageAnalyzer and do parser specific analysis based on the AST they generates
and store the result in ``sqllineage.core.holders``.
and store the result in ``StatementLineageHolder``.

LineageAnalyzer
========================================
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
***************
LineageHolder
***************
******
Holder
******

LineageHolder is an abstraction to hold the lineage result analyzed by LineageAnalyzer at different level.

At the bottom, we have :class:`sqllineage.core.holder.SubQueryLineageHolder` to hold lineage at subquery level.
This is used internally for :class:`sqllineage.core.analyzer.Analyzer`, which generate
:class:`sqllineage.core.holder.StatementLineageHolder` as the result of lineage at SQL statement level.
And to assemble multiple :class:`sqllineage.core.holder.StatementLineageHolder` into a DAG based data structure serving
At the bottom, we have :class:`sqllineage.core.holders.SubQueryLineageHolder` to hold lineage at subquery level.
This is used internally by :class:`sqllineage.core.analyzer.LineageAnalyzer`.

LineageAnalyzer generates :class:`sqllineage.core.holder.StatementLineageHolder`
as the result of lineage at SQL statement level.

To assemble multiple :class:`sqllineage.core.holder.StatementLineageHolder` into a DAG based data structure serving
for the final output, we have :class:`sqllineage.core.holders.SQLLineageHolder`


Expand Down
9 changes: 9 additions & 0 deletions docs/basic_concepts/metadata_provider.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
****************
MetaDataProvider
****************

sqllineage.core.metadata_provider.MetaDataProvider
==================================================

.. autoclass:: sqllineage.core.metadata_provider.MetaDataProvider
:members:
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
*************
LineageModels
*************
*****
Model
*****

Several data classes in this module.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
*************
LineageRunner
*************
******
Runner
******

LineageRunner is the entry point for SQLLineage core processing logic. After parsing command-line options, a string
representation of SQL statements will be fed to LineageRunner for processing. From a bird's-eye view, it contains
three steps:

1. Calling ``sqllineage.utils.helpers.split`` function to split string-base SQL statements into a list of ``str`` statement.
1. Calling ``sqllineage.utils.helpers.split`` function to split string-base SQL statements into a list of ``str`` statements.

2. Calling :class:`sqllineage.core.analyzer.LineageAnalyzer` to analyze each one statement sql string and return a list of
2. Calling :class:`sqllineage.core.analyzer.LineageAnalyzer` to analyze each one statement sql string. Get a list of
:class:`sqllineage.core.holders.StatementLineageHolder` .

3. Calling :class:`sqllineage.core.holders.SQLLineageHolder.of` function to assemble the list of
Expand Down
17 changes: 1 addition & 16 deletions docs/behind_the_scene/column-level_lineage_design.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,26 +44,12 @@ Questions Before Implementation
JOIN tab3
ON tab2.col1 = tab3.col1
**Answer**: dd two edges, tab2.col2 -> tab1.col2, tab3.col2 -> tab1.col2. Meanwhile, these two edges should be marked
**Answer**: Add two edges, tab2.col2 -> tab1.col2, tab3.col2 -> tab1.col2. Meanwhile, these two edges should be marked
so that later in visualization, they can be drawn differently, like in dot line.

Implementation Plan
===================

With `6308b50`_ splitting the logic into different handlers, we now have SourceHandler, TargetHandler and CTEHandler to
handle table level lineage. They're subclass of NextTokenBaseHandler, an abstract class to address an extract pattern
when a specified token indicates we should extract something from next token.

A newly introduced ColumnHandler will also be based on NextTokenBaseHandler (column token followed by keyword SELECT)
plus a end-of-(sub)query hook. Because only until end of query could we know all the source tables. If we don't have
all the source tables and their alias, we can't assign the column to table correctly.

.. warning::
To handle UNION clause, ColumnHandler is now merged into SourceHandler, due to the fact that we need source tables
info breaking down into sub-statement level, end of the whole query would be to late.

**Steps for Full Implementation**

1. Atomic column logic handling: alias, case when, function, expression, etc.
2. Subquery recognition and lineage transition from subquery to statement
3. Column to table assignment in case of table join
Expand All @@ -75,4 +61,3 @@ all the source tables and their alias, we can't assign the column to table corre


.. _JanusGraph docs: https://docs.janusgraph.org/schema/
.. _6308b50: https://github.com/reata/sqllineage/commit/6308b50e0b087e1bdab722dd531282a169131f4b
8 changes: 4 additions & 4 deletions docs/behind_the_scene/dialect-awareness_lineage_design.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ Problem Statement
=================
As of v1.3.x release, table level lineage is perfectly production-ready. Column level lineage, under the no-metadata
background, is also as good as it can be. And yet we still have a lot of corner cases that are not yet supported.
This is really due to the long-tail of SQL language features and fragmentation of various SQL dialect.
This is really due to the long-tail of SQL language features and fragmentation of various SQL dialects.

Some typical issues:
Here are some typical issues:

* How to check whether syntax is valid or not?

Expand All @@ -30,7 +30,7 @@ Some typical issues:
* Presto UNNEST
* Snowflake GENERATOR

Over the years, we already have several monkey patches and utils on sqlparse, to tweak the AST generated, either because
Over the years, we already have several monkey patches and utils on sqlparse to tweak the AST generated, either because
of incorrect parsing result (e.g. parenthesized query followed by INSERT INTO table parsed as function) or not yet
supported token grouping (e.g. window function for example). Due to the non-validating nature of sqlparse, that's the
bitter pill to swallow when we enjoyed tons of convenience.
Expand Down Expand Up @@ -75,7 +75,7 @@ From code structure perspective, we refactored the whole code base to introduce
* LineageAnalyzer now accepts single statement SQL string, split by LineageRunner, and returns StatementLineageHolder
as before
* Each parser implementations sit in folder **sqllineage.core.parser**. They're extending the LineageAnalyzer, common
Models, and leverage Holders at different layer.
Models, and leverage Holders at different layers.

.. note::
Dialect-awareness lineage is now released with v1.4.0
Expand Down
17 changes: 6 additions & 11 deletions docs/behind_the_scene/dos_and_donts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,10 @@ DOs

DONTs
=====
* Column-level lineage will not be 100% accurate because that would require metadata information. However, there's no
unified metadata service for all kinds of SQL systems. For the moment, in column-level lineage, column-to-table
resolution is conducted in a best-effort way, meaning we only provide possible table candidates for situation like
``select *`` or ``select col from tab1 join tab2``.
* Likewise for Partition-level lineage. Until we find a way to not involve metadata service, we will not go for this.

.. note::
100% accurate Column-level lineage is still do-able if we can provide some kind of a plugin system for user to
register their metadata instead of us maintaining it. Let's see what will happen in future versions.
* Column-level lineage will not be 100% accurate because that would require metadata information. It's optional for user
to leverage MetaDataProvider functionality so sqllineage can query metadata when analyzing. If not provided,
column-to-table resolution will be conducted in a best-effort way, meaning we only provide possible table candidates
for situation like ``select *`` or ``select col from tab1 join tab2``.

Static Code Analysis Approach Explained
=======================================
Expand Down Expand Up @@ -49,5 +44,5 @@ The alternative way is starting the lineage analysis on the abstraction layer of
ties lineage analysis tightly with the SQL system so it won't function without a live connection to database. But that will
give user an accurate result and the source code of database can be used to save a lot of coding effort.

To combine the good side of both approaches, in the long term, SQLLineage will introduce an optional resolution phase,
followed by the current unresolved lineage result, where user can register metadata information in a programmatic way.
To combine the good side of both approaches, SQLLineage introduces an optional MetaDataProvider, where user can register
metadata information in a programmatic way to assist column-to-table resolution.
55 changes: 16 additions & 39 deletions docs/behind_the_scene/how_sqllineage_work.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,23 @@ Basically a sql parser will parse the SQL statement(s) into `AST`_ (Abstract Syn
is a tree representation of the abstract syntactic structure of source code (in our case, SQL code, of course). This is
where SQLLineage takes over.

With AST generated, SQLLineage will traverse through this tree, apply some pre-defined rules, so as to extract the part
With AST generated, SQLLineage will traverse through this tree and apply some pre-defined rules to extract the parts
we're interested in. With that being said, SQLLineage is an AST application, while there's actually more you can do with
AST:

- **born duty of AST: the starting point for optimization.** In compiler world, machine code,
or optionally IR (Intermediate Representation), will be generated based on the AST, and then code optimization,
resulting in an optimized machine code. In data world, it's basically the same thing with different words,
and different optimization target. AST will be converted to query execution plan for query execution optimization,
using strategy like RBO(Rule Based Optimization) or CBO(Cost Based Optimization), so that database/data warehouse
query engine can have an optimized physical plan for execution.
and different optimization target. AST will be converted to query execution plan for query execution optimization.
Using strategy like RBO(Rule Based Optimization) or CBO(Cost Based Optimization), the database/data warehouse
query engine outputs an optimized physical plan for execution.

- **linter**: quoting wikipedia, `linter`_ is a static code analysis tool used to flag programming errors, bugs,
stylistic errors and suspicious constructs. Oftentimes it's used interchangeably with a code formatter. Famous tools
like flake8 for Python, ESLint for JavaScript. Golang even provide an official gofmt program in their standard library.
Meanwhile, although not yet widely adopted in data world, we can also lint SQL code. `sqlfluff`_ is such an great tool.
Guess how it works to detect a smelly "`SELECT *`" or a mixture of leading and trailing commas. The answer is AST!
stylistic errors and suspicious constructs. Oftentimes the name linter is used interchangeably with code formatter.
Famous tools like flake8 for Python, ESLint for JavaScript are example of real life linters. Golang even provide an
official gofmt program in their standard library. Meanwhile, although not yet widely adopted in data world, we can
also lint SQL code. `sqlfluff`_ is such an great tool. Guess how it works to detect a smelly "`SELECT *`" or a mixture
of leading and trailing commas. The answer is AST!

- **transpiler**: This use case is most famous in JavaScript world, where they're proactively using syntax defined in
latest language specification which is not supported by mainstream web browsers yet. Quote from its offical document,
Expand All @@ -34,37 +35,14 @@ AST:
- **structure analysis**: IDE leverages this a lot. Scenarios like duplicate code detection, code refactor. Basically
this is to analyze the code structure. SQLLineage also falls into this category.

`sqlparse`_ is the underlying parser SQLLineage uses to get the AST. It gives a simple `example`_ to extract table names,
through which you can get a rough idea of how SQLLineage works. At the core is when a token is Keyword and its value is
"FROM", then the next token will either be subquery or table. For subquery, we just recursively calling extract function.
For table, there's a way to get its name.
`sqlfluff`_ is the underlying parser SQLLineage uses to get the AST. You heard it right! Even though sqlfluff is
mostly famous as a SQL linter, it also ships a parser so lint can be done. The various SQL `dialects`_ it supports
greatly saves our time.

.. warning::
This is just an over-simplified explanation. In reality, we could easily see ``Comment`` coming after "FROM", or
subquery without alias (valid syntax in certain SQL dialect) mistakenly parsed as ``Parenthesis``. These are all
corner cases we should resolve in real world.
As mentioned, at the core of sqllineage is to traverse through the AST. Different SQL statement type requires different
analyzing logic. We collect all kinds of sql, handle various edge cases and make our logic robust enough.

.. note::
Strictly speaking, sqlparse is generating a parse tree instead of an abstract syntax tree. There two terms are often
used interchangeably, and indeed they're similar conceptually. They're both tree structure in slightly different
abstraction layer. In the AST, information like comments and grouping symbols (parenthesis) are not represented.
Removing comment doesn't change the code logic and parenthesis are already implicitly represented by the tree structure.

Some other simple rules in SQLLineage:

1. Things go after Keyword **"FROM"**, all kinds of **"JOIN"** will be source table.

2. Things go after Keyword **"INTO"**, **"OVERWRITE"**, **"TABLE"**, **"VIEW"** will be target table. (Though there are
exceptions like drop table statement)

3. Things go after Keyword **"With"** will be CTE (Common Table Expression).

4. Things go after Keyword **"SELECT"** will be column(s).

The rest thing is just tedious work. We collect all kinds of sql, handle various edge cases and make these simple rules
robust enough.

That's it for single statement SQL lineage analysis. For multiple statements SQL, it requires some more extra work to
This is for single statement SQL lineage analysis. For multiple statements SQL, it requires some more extra work to
assemble the lineage from single statements.

We choose a `DAG`_ based data structure to represent multiple statements SQL lineage. Table/View will be vertex in this
Expand All @@ -76,8 +54,7 @@ easy to visualize lineage.
.. _AST: https://en.wikipedia.org/wiki/Abstract_syntax_tree
.. _linter: https://en.wikipedia.org/wiki/Lint_(software)
.. _sqlfluff: https://github.com/sqlfluff/sqlfluff
.. _dialects: https://docs.sqlfluff.com/en/stable/dialects.html
.. _Babel: https://babeljs.io/
.. _sqlglot: https://github.com/tobymao/sqlglot
.. _sqlparse: https://github.com/andialbrecht/sqlparse
.. _example: https://github.com/andialbrecht/sqlparse/blob/master/examples/extract_table_names.py
.. _DAG: https://en.wikipedia.org/wiki/Directed_acyclic_graph

0 comments on commit 292eb61

Please sign in to comment.