docs: sqlfluff as default parser

reata · Jan 1, 2024 · 292eb61 · 292eb61
1 parent b620d78
commit 292eb61
Show file tree

Hide file tree

Showing 18 changed files with 174 additions and 194 deletions.
diff --git a/README.md b/README.md
@@ -56,7 +56,7 @@ Target Tables:
 ## Advanced Usage
 
 ### Multiple SQL Statements
-Lineage result combined for multiple SQL statements, with intermediate tables identified:
+Lineage is combined from multiple SQL statements, with intermediate tables identified:
 ```
 $ sqllineage -e "insert into db1.table1 select * from db2.table2; insert into db3.table3 select * from db1.table1;"
 Statements(#): 2
@@ -69,7 +69,7 @@ Intermediate Tables:
 ```
 
 ### Verbose Lineage Result
-And if you want to see lineage result for every SQL statement, just toggle verbose option
+And if you want to see lineage for each SQL statement, just toggle verbose option
 ```
 $ sqllineage -v -e "insert into db1.table1 select * from db2.table2; insert into db3.table3 select * from db1.table1;"
 Statement #1: insert into db1.table1 select * from db2.table2;
@@ -96,30 +96,29 @@ Intermediate Tables:
 ```
 
 ### Dialect-Awareness Lineage
-By default, sqllineage doesn't validate your SQL and could give confusing result in case of invalid SQL syntax.
-In addition, different SQL dialect has different set of keywords, further weakening sqllineage's capabilities when 
-keyword used as table name or column name. To reduce the impact, user are strongly encouraged to pass the dialect to 
-assist the lineage analyzing. 
+By default, sqllineage use `ansi` dialect to validate and parse your SQL. However, some SQL syntax you take for granted
+in daily life might not be in ANSI standard. In addition, different SQL dialects have different set of SQL keywords,
+further weakening sqllineage's capabilities when keyword used as table name or column name. To get the most out of
+sqllineage, we strongly encourage you to pass the dialect to assist the lineage analyzing.
 
-Take below example, `analyze` is a reserved keyword in PostgreSQL. Default non-validating dialect gives incomplete result,
-while ansi dialect gives the correct one and postgres dialect tells you this causes syntax error:
+Take below example, `INSERT OVERWRITE` statement is only supported by big data solutions like Hive/SparkSQL, and `MAP`
+is a reserved keyword in Hive thus can not be used as table name while it is not for SparkSQL. Both ansi and hive dialect
+tell you this causes syntax error and sparksql gives the correct result:
 ```
-$ sqllineage -e "insert into analyze select * from foo;"
-Statements(#): 1
-Source Tables:
-    <default>.foo
-Target Tables:
-    
-$ sqllineage -e "insert into analyze select * from foo;" --dialect=ansi
+$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo"
+...
+sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL
+
+$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo" --dialect=hive
+...
+sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL
+
+$ sqllineage -e "INSERT OVERWRITE TABLE map SELECT * FROM foo" --dialect=sparksql
 Statements(#): 1
 Source Tables:
     <default>.foo
 Target Tables:
-    <default>.analyze
-
-$ sqllineage -e "insert into analyze select * from foo;" --dialect=postgres
-...
-sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL
+    <default>.map
 ```
 
 Use `sqllineage --dialects` to see all available dialects.
@@ -129,7 +128,7 @@ We also support column level lineage in command line interface, set level option
 be printed.
 
 ```sql
-INSERT OVERWRITE TABLE foo
+INSERT INTO foo
 SELECT a.col1,
        b.col1     AS col2,
        c.col3_sum AS col3,
@@ -144,18 +143,18 @@ FROM bar a
                    ON a.id = sq.bar_id
          CROSS JOIN quux d;
 
-INSERT OVERWRITE TABLE corge
+INSERT INTO corge
 SELECT a.col1,
        a.col2 + b.col2 AS col2
 FROM foo a
          LEFT JOIN grault b
               ON a.col1 = b.col1;
 ```
 
-Suppose this sql is stored in a file called foo.sql
+Suppose this sql is stored in a file called test.sql
 
 ```
-$ sqllineage -f foo.sql -l column
+$ sqllineage -f test.sql -l column
 <default>.corge.col1 <- <default>.foo.col1 <- <default>.bar.col1
 <default>.corge.col2 <- <default>.foo.col2 <- <default>.baz.col1
 <default>.corge.col2 <- <default>.grault.col2

diff --git a/docs/basic_concepts/lineage_analyzer.rst → docs/basic_concepts/analyzer.rst b/docs/basic_concepts/lineage_analyzer.rst → docs/basic_concepts/analyzer.rst
@@ -1,11 +1,11 @@
-***************
-LineageAnalyzer
-***************
+********
+Analyzer
+********
 
 LineageAnalyzer is an abstract class, supposed to include the core processing logic for one-statement SQL analysis.
 
 Each parser implementation will inherit LineageAnalyzer and do parser specific analysis based on the AST they generates
-and store the result in ``sqllineage.core.holders``.
+and store the result in ``StatementLineageHolder``.
 
 LineageAnalyzer
 ========================================

diff --git a/docs/basic_concepts/lineage_holder.rst → docs/basic_concepts/holder.rst b/docs/basic_concepts/lineage_holder.rst → docs/basic_concepts/holder.rst
@@ -1,13 +1,16 @@
-***************
-LineageHolder
-***************
+******
+Holder
+******
 
 LineageHolder is an abstraction to hold the lineage result analyzed by LineageAnalyzer at different level.
 
-At the bottom, we have :class:`sqllineage.core.holder.SubQueryLineageHolder` to hold lineage at subquery level.
-This is used internally for :class:`sqllineage.core.analyzer.Analyzer`, which generate
-:class:`sqllineage.core.holder.StatementLineageHolder` as the result of lineage at SQL statement level.
-And to assemble multiple :class:`sqllineage.core.holder.StatementLineageHolder` into a DAG based data structure serving
+At the bottom, we have :class:`sqllineage.core.holders.SubQueryLineageHolder` to hold lineage at subquery level.
+This is used internally by :class:`sqllineage.core.analyzer.LineageAnalyzer`.
+
+LineageAnalyzer generates :class:`sqllineage.core.holder.StatementLineageHolder`
+as the result of lineage at SQL statement level.
+
+To assemble multiple :class:`sqllineage.core.holder.StatementLineageHolder` into a DAG based data structure serving
 for the final output, we have :class:`sqllineage.core.holders.SQLLineageHolder`
 
 

diff --git a/docs/basic_concepts/metadata_provider.rst b/docs/basic_concepts/metadata_provider.rst
@@ -0,0 +1,9 @@
+****************
+MetaDataProvider
+****************
+
+sqllineage.core.metadata_provider.MetaDataProvider
+==================================================
+
+.. autoclass:: sqllineage.core.metadata_provider.MetaDataProvider
+    :members:
diff --git a/docs/basic_concepts/lineage_model.rst → docs/basic_concepts/model.rst b/docs/basic_concepts/lineage_model.rst → docs/basic_concepts/model.rst
@@ -1,6 +1,6 @@
-*************
-LineageModels
-*************
+*****
+Model
+*****
 
 Several data classes in this module.
 

diff --git a/docs/basic_concepts/lineage_runner.rst → docs/basic_concepts/runner.rst b/docs/basic_concepts/lineage_runner.rst → docs/basic_concepts/runner.rst
@@ -1,14 +1,14 @@
-*************
-LineageRunner
-*************
+******
+Runner
+******
 
 LineageRunner is the entry point for SQLLineage core processing logic. After parsing command-line options, a string
 representation of SQL statements will be fed to LineageRunner for processing. From a bird's-eye view, it contains
 three steps:
 
-1. Calling ``sqllineage.utils.helpers.split`` function to split string-base SQL statements into a list of ``str`` statement.
+1. Calling ``sqllineage.utils.helpers.split`` function to split string-base SQL statements into a list of ``str`` statements.
 
-2. Calling :class:`sqllineage.core.analyzer.LineageAnalyzer` to analyze each one statement sql string and return a list of
+2. Calling :class:`sqllineage.core.analyzer.LineageAnalyzer` to analyze each one statement sql string. Get a list of
    :class:`sqllineage.core.holders.StatementLineageHolder` .
 
 3. Calling :class:`sqllineage.core.holders.SQLLineageHolder.of` function to assemble the list of

diff --git a/docs/behind_the_scene/column-level_lineage_design.rst b/docs/behind_the_scene/column-level_lineage_design.rst
@@ -44,26 +44,12 @@ Questions Before Implementation
     JOIN tab3
     ON tab2.col1 = tab3.col1
 
-**Answer**: dd two edges, tab2.col2 -> tab1.col2, tab3.col2 -> tab1.col2. Meanwhile, these two edges should be marked
+**Answer**: Add two edges, tab2.col2 -> tab1.col2, tab3.col2 -> tab1.col2. Meanwhile, these two edges should be marked
 so that later in visualization, they can be drawn differently, like in dot line.
 
 Implementation Plan
 ===================
 
-With `6308b50`_ splitting the logic into different handlers, we now have SourceHandler, TargetHandler and CTEHandler to
-handle table level lineage. They're subclass of NextTokenBaseHandler, an abstract class to address an extract pattern
-when a specified token indicates we should extract something from next token.
-
-A newly introduced ColumnHandler will also be based on NextTokenBaseHandler (column token followed by keyword SELECT)
-plus a end-of-(sub)query hook. Because only until end of query could we know all the source tables. If we don't have
-all the source tables and their alias, we can't assign the column to table correctly.
-
-.. warning::
-    To handle UNION clause, ColumnHandler is now merged into SourceHandler, due to the fact that we need source tables
-    info breaking down into sub-statement level, end of the whole query would be to late.
-
-**Steps for Full Implementation**
-
 1. Atomic column logic handling: alias, case when, function, expression, etc.
 2. Subquery recognition and lineage transition from subquery to statement
 3. Column to table assignment in case of table join
@@ -75,4 +61,3 @@ all the source tables and their alias, we can't assign the column to table corre
 
 
 .. _JanusGraph docs: https://docs.janusgraph.org/schema/
-.. _6308b50: https://github.com/reata/sqllineage/commit/6308b50e0b087e1bdab722dd531282a169131f4b
diff --git a/docs/behind_the_scene/dialect-awareness_lineage_design.rst b/docs/behind_the_scene/dialect-awareness_lineage_design.rst
@@ -6,9 +6,9 @@ Problem Statement
 =================
 As of v1.3.x release, table level lineage is perfectly production-ready. Column level lineage, under the no-metadata
 background, is also as good as it can be. And yet we still have a lot of corner cases that are not yet supported.
-This is really due to the long-tail of SQL language features and fragmentation of various SQL dialect.
+This is really due to the long-tail of SQL language features and fragmentation of various SQL dialects.
 
-Some typical issues:
+Here are some typical issues:
 
 * How to check whether syntax is valid or not?
 
@@ -30,7 +30,7 @@ Some typical issues:
   * Presto UNNEST
   * Snowflake GENERATOR
 
-Over the years, we already have several monkey patches and utils on sqlparse, to tweak the AST generated, either because
+Over the years, we already have several monkey patches and utils on sqlparse to tweak the AST generated, either because
 of incorrect parsing result (e.g. parenthesized query followed by INSERT INTO table parsed as function) or not yet
 supported token grouping (e.g. window function for example). Due to the non-validating nature of sqlparse, that's the
 bitter pill to swallow when we enjoyed tons of convenience.
@@ -75,7 +75,7 @@ From code structure perspective, we refactored the whole code base to introduce
 * LineageAnalyzer now accepts single statement SQL string, split by LineageRunner, and returns StatementLineageHolder
   as before
 * Each parser implementations sit in folder **sqllineage.core.parser**. They're extending the LineageAnalyzer, common
-  Models, and leverage Holders at different layer.
+  Models, and leverage Holders at different layers.
 
 .. note::
     Dialect-awareness lineage is now released with v1.4.0

diff --git a/docs/behind_the_scene/dos_and_donts.rst b/docs/behind_the_scene/dos_and_donts.rst
@@ -13,15 +13,10 @@ DOs
 
 DONTs
 =====
-* Column-level lineage will not be 100% accurate because that would require metadata information. However, there's no
-  unified metadata service for all kinds of SQL systems. For the moment, in column-level lineage, column-to-table
-  resolution is conducted in a best-effort way, meaning we only provide possible table candidates for situation like
-  ``select *`` or ``select col from tab1 join tab2``.
-* Likewise for Partition-level lineage. Until we find a way to not involve metadata service, we will not go for this.
-
-.. note::
-    100% accurate Column-level lineage is still do-able if we can provide some kind of a plugin system for user to
-    register their metadata instead of us maintaining it. Let's see what will happen in future versions.
+* Column-level lineage will not be 100% accurate because that would require metadata information. It's optional for user
+  to leverage MetaDataProvider functionality so sqllineage can query metadata when analyzing. If not provided,
+  column-to-table resolution will be conducted in a best-effort way, meaning we only provide possible table candidates
+  for situation like ``select *`` or ``select col from tab1 join tab2``.
 
 Static Code Analysis Approach Explained
 =======================================
@@ -49,5 +44,5 @@ The alternative way is starting the lineage analysis on the abstraction layer of
 ties lineage analysis tightly with the SQL system so it won't function without a live connection to database. But that will
 give user an accurate result and the source code of database can be used to save a lot of coding effort.
 
-To combine the good side of both approaches, in the long term, SQLLineage will introduce an optional resolution phase,
-followed by the current unresolved lineage result, where user can register metadata information in a programmatic way.
+To combine the good side of both approaches, SQLLineage introduces an optional MetaDataProvider, where user can register
+metadata information in a programmatic way to assist column-to-table resolution.
diff --git a/docs/behind_the_scene/how_sqllineage_work.rst b/docs/behind_the_scene/how_sqllineage_work.rst
@@ -6,22 +6,23 @@ Basically a sql parser will parse the SQL statement(s) into `AST`_ (Abstract Syn
 is a tree representation of the abstract syntactic structure of source code (in our case, SQL code, of course). This is
 where SQLLineage takes over.
 
-With AST generated, SQLLineage will traverse through this tree, apply some pre-defined rules, so as to extract the part
+With AST generated, SQLLineage will traverse through this tree and apply some pre-defined rules to extract the parts
 we're interested in. With that being said, SQLLineage is an AST application, while there's actually more you can do with
 AST:
 
 - **born duty of AST: the starting point for optimization.** In compiler world, machine code,
   or optionally IR (Intermediate Representation), will be generated based on the AST, and then code optimization,
   resulting in an optimized machine code. In data world, it's basically the same thing with different words,
-  and different optimization target. AST will be converted to query execution plan for query execution optimization,
-  using strategy like RBO(Rule Based Optimization) or CBO(Cost Based Optimization), so that database/data warehouse
-  query engine can have an optimized physical plan for execution.
+  and different optimization target. AST will be converted to query execution plan for query execution optimization.
+  Using strategy like RBO(Rule Based Optimization) or CBO(Cost Based Optimization), the database/data warehouse
+  query engine outputs an optimized physical plan for execution.
 
 - **linter**: quoting wikipedia, `linter`_ is a static code analysis tool used to flag programming errors, bugs,
-  stylistic errors and suspicious constructs. Oftentimes it's used interchangeably with a code formatter. Famous tools
-  like flake8 for Python, ESLint for JavaScript. Golang even provide an official gofmt program in their standard library.
-  Meanwhile, although not yet widely adopted in data world, we can also lint SQL code. `sqlfluff`_ is such an great tool.
-  Guess how it works to detect a smelly "`SELECT *`" or a mixture of leading and trailing commas. The answer is AST!
+  stylistic errors and suspicious constructs. Oftentimes the name linter is used interchangeably with code formatter.
+  Famous tools like flake8 for Python, ESLint for JavaScript are example of real life linters. Golang even provide an
+  official gofmt program in their standard library. Meanwhile, although not yet widely adopted in data world, we can
+  also lint SQL code. `sqlfluff`_ is such an great tool. Guess how it works to detect a smelly "`SELECT *`" or a mixture
+  of leading and trailing commas. The answer is AST!
 
 - **transpiler**: This use case is most famous in JavaScript world, where they're proactively using syntax defined in
   latest language specification which is not supported by mainstream web browsers yet. Quote from its offical document,
@@ -34,37 +35,14 @@ AST:
 - **structure analysis**: IDE leverages this a lot. Scenarios like duplicate code detection, code refactor. Basically
   this is to analyze the code structure. SQLLineage also falls into this category.
 
-`sqlparse`_ is the underlying parser SQLLineage uses to get the AST. It gives a simple `example`_ to extract table names,
-through which you can get a rough idea of how SQLLineage works. At the core is when a token is Keyword and its value is
-"FROM", then the next token will either be subquery or table. For subquery, we just recursively calling extract function.
-For table, there's a way to get its name.
+`sqlfluff`_ is the underlying parser SQLLineage uses to get the AST. You heard it right! Even though sqlfluff is
+mostly famous as a SQL linter, it also ships a parser so lint can be done. The various SQL `dialects`_ it supports
+greatly saves our time.
 
-.. warning::
-    This is just an over-simplified explanation. In reality, we could easily see ``Comment`` coming after "FROM", or
-    subquery without alias (valid syntax in certain SQL dialect) mistakenly parsed as ``Parenthesis``. These are all
-    corner cases we should resolve in real world.
+As mentioned, at the core of sqllineage is to traverse through the AST. Different SQL statement type requires different
+analyzing logic. We collect all kinds of sql, handle various edge cases and make our logic robust enough.
 
-.. note::
-    Strictly speaking, sqlparse is generating a parse tree instead of an abstract syntax tree. There two terms are often
-    used interchangeably, and indeed they're similar conceptually. They're both tree structure in slightly different
-    abstraction layer. In the AST, information like comments and grouping symbols (parenthesis) are not represented.
-    Removing comment doesn't change the code logic and parenthesis are already implicitly represented by the tree structure.
-
-Some other simple rules in SQLLineage:
-
-1. Things go after Keyword **"FROM"**, all kinds of **"JOIN"** will be source table.
-
-2. Things go after Keyword **"INTO"**, **"OVERWRITE"**, **"TABLE"**, **"VIEW"** will be target table. (Though there are
-   exceptions like drop table statement)
-
-3. Things go after Keyword **"With"** will be CTE (Common Table Expression).
-
-4. Things go after Keyword **"SELECT"** will be column(s).
-
-The rest thing is just tedious work. We collect all kinds of sql, handle various edge cases and make these simple rules
-robust enough.
-
-That's it for single statement SQL lineage analysis. For multiple statements SQL, it requires some more extra work to
+This is for single statement SQL lineage analysis. For multiple statements SQL, it requires some more extra work to
 assemble the lineage from single statements.
 
 We choose a `DAG`_ based data structure to represent multiple statements SQL lineage. Table/View will be vertex in this
@@ -76,8 +54,7 @@ easy to visualize lineage.
 .. _AST: https://en.wikipedia.org/wiki/Abstract_syntax_tree
 .. _linter: https://en.wikipedia.org/wiki/Lint_(software)
 .. _sqlfluff: https://github.com/sqlfluff/sqlfluff
+.. _dialects: https://docs.sqlfluff.com/en/stable/dialects.html
 .. _Babel: https://babeljs.io/
 .. _sqlglot: https://github.com/tobymao/sqlglot
-.. _sqlparse: https://github.com/andialbrecht/sqlparse
-.. _example: https://github.com/andialbrecht/sqlparse/blob/master/examples/extract_table_names.py
 .. _DAG: https://en.wikipedia.org/wiki/Directed_acyclic_graph