[NOT FOR MERGE] Feat/sql string parsing #131

tools4origins · 2020-10-18T08:51:24Z

Let's introduce SQL parsing! And support things like spark main example:

But first let's figure out how 😄

This PR adds one dependency (Antlr4) and one file (SqlBase.g4).
It also contains many files generated based on SqlBase.g4, and that's why it is so big (and should not be merged).

Antlr is a parser, and SqlBase.g4 defines a SQL grammar: It formalizes how are structured SQL strings such as SELECT * FROM table WHERE column IS NOT NULL.

Note: SqlBase.g4 is derived from spark who itself is derived from Presto's one: This reinforces that we introduce the same SQL grammar as the one used by Spark.

Based on this grammar, Antlr4 will convert each string into a syntax tree, where each syntaxical component is a Node with a predefined type and predefined children that are themself trees. It make SQL string parsing much easier as SQL is a bit complex.

For instance it converts 42 > 1 into a tree like:

| ComparisonContext 
|-- ValueExpressionDefaultContext 
|---- ConstantDefaultContext 
|------ NumericLiteralContext 
|-------- IntegerLiteralContext 
|---------- TerminalNodeImpl                # 42
|-- ComparisonOperatorContext 
|---- TerminalNodeImpl                      # >
|-- ValueExpressionDefaultContext 
|---- ConstantDefaultContext 
|------ NumericLiteralContext 
|-------- IntegerLiteralContext 
|---------- TerminalNodeImpl                # 1

I am not opening this PR in order to have it merged: I do not think that we should add generated code to git.

Rather, I am opening it to discuss how to automatized the code generation.

Currently, it requires the following steps:

Download antlr-4.7.1-complete.jarfrom https://www.antlr.org/download/
Run java -Xmx500M -cp "/path/to/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool ${project_dir}/pysparkling/sql/ast/grammar/SqlBase.g4 -o ${project_dir}/pysparkling/sql/ast/generated

But that's only for developers: I think we will want to package the app with these generated files.

These are the steps why I think a bit more automation in the app lifecyle would be nice.

What do you think?

svenkreiss · 2020-10-19T08:24:18Z

Thanks for sharing the parser. That adds some context to the previous discussion.

I am not familiar with antlr. I had a brief look at the documentation and it seems it can generate a Python parser?
I don't see that in the generated files here.
During runtime, it runs as pure Python, right? Or does it do the parsing inside the JVM and that's why it generated java files?

tools4origins · 2020-10-19T10:28:16Z

Oops, the actual command to generated the files is:
java -Xmx500M -cp "/path/to/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool ${project_dir}/pysparkling/sql/ast/grammar/SqlBase.g4 -o ${project_dir}/pysparkling/sql/ast/generated -Dlanguage=Python3.

Now the Lexer, Listener and Parser files are in Python.

It does run as pure Python, the antlr jar is not necessary, nor any jvm, at run time.

It is not in this PR but here is how the Python parser is invoked: https://github.com/tools4origins/pysparkling/blob/feat/antl4rGrammar/pysparkling/sql/ast/parser.py. It is at this step that it is made case insensitive and where, like in Spark, the backquotes requirements for SQL identifiers is removed.

This invocation can then be used using:

parser = ast_parser(string)
syntax_tree = parser.singleDataType()  # singleExpression is a name of one of the defined grammar elements

It is a WIP but you can find a parse_datatype implementation here, which on top of calls convert_tree that will convert the syntax tree to a Python object (here a DataType from pysparkling.sql.types):
https://github.com/tools4origins/pysparkling/blob/feat/antl4rGrammar/pysparkling/sql/ast/ast_to_python.py
And its usage:
https://github.com/tools4origins/pysparkling/blob/feat/antl4rGrammar/pysparkling/sql/ast/tests/test_type_parsing.py

svenkreiss · 2020-10-22T04:42:35Z

Great, thanks. I am understanding this better now. This might be unpopular but how about following the Cython model for generated code: http://docs.cython.org/en/latest/src/userguide/source_files_and_compilation.html#distributing-cython-modules

It’s certainly not my preference to have generated code under version control but it also means that developers who are not working on the parser won’t need Java. I am using this approach in OpenPifPaf and it has been working well for me. Generated code can be excluded from the GitHub code statistics with an entry in .gitattributes.

Do you think this could work here?

tools4origins · 2020-10-23T10:31:00Z

An alternate solution would be to have a dedicated project for the antlr4 grammar, which generates the python files when packaged and which is used as a dependency of pysparkling.

This way too we ensure that building pysparkling itself does not require using Java

svenkreiss · 2020-10-23T11:38:20Z

@tools4origins Yes, I like that! There are actually some python packages that are pure sql parsers already but I assume they are not compatible with the grammar you need here.

tools4origins · 2020-10-25T10:44:00Z

I extracted https://github.com/pysparkling/python-sql-parser/ and will refactor the code to rely on it as a dependency

tools4origins · 2020-10-26T17:43:15Z

Closing as the whole PR logic will now be contained in an additional requirement

tools4origins added 3 commits October 18, 2020 10:21

Add Antl4r as dependency

cf65f5c

Add SQL grammar definition

ca63708

Temporarily add generated files to VCS

b43004d

Use Python as target language instead of java

c3b051d

tools4origins closed this Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NOT FOR MERGE] Feat/sql string parsing #131

[NOT FOR MERGE] Feat/sql string parsing #131

tools4origins commented Oct 18, 2020

svenkreiss commented Oct 19, 2020

tools4origins commented Oct 19, 2020

svenkreiss commented Oct 22, 2020

tools4origins commented Oct 23, 2020

svenkreiss commented Oct 23, 2020

tools4origins commented Oct 25, 2020

tools4origins commented Oct 26, 2020

[NOT FOR MERGE] Feat/sql string parsing #131

[NOT FOR MERGE] Feat/sql string parsing #131

Conversation

tools4origins commented Oct 18, 2020

svenkreiss commented Oct 19, 2020

tools4origins commented Oct 19, 2020

svenkreiss commented Oct 22, 2020

tools4origins commented Oct 23, 2020

svenkreiss commented Oct 23, 2020

tools4origins commented Oct 25, 2020

tools4origins commented Oct 26, 2020