Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOT FOR MERGE] Feat/sql string parsing #131

Closed

Conversation

tools4origins
Copy link
Collaborator

Let's introduce SQL parsing! And support things like spark main example:

image

But first let's figure out how 馃槃

This PR adds one dependency (Antlr4) and one file (SqlBase.g4).
It also contains many files generated based on SqlBase.g4, and that's why it is so big (and should not be merged).

Antlr is a parser, and SqlBase.g4 defines a SQL grammar: It formalizes how are structured SQL strings such as SELECT * FROM table WHERE column IS NOT NULL.

Note: SqlBase.g4 is derived from spark who itself is derived from Presto's one: This reinforces that we introduce the same SQL grammar as the one used by Spark.

Based on this grammar, Antlr4 will convert each string into a syntax tree, where each syntaxical component is a Node with a predefined type and predefined children that are themself trees. It make SQL string parsing much easier as SQL is a bit complex.

For instance it converts 42 > 1 into a tree like:

| ComparisonContext 
|-- ValueExpressionDefaultContext 
|---- ConstantDefaultContext 
|------ NumericLiteralContext 
|-------- IntegerLiteralContext 
|---------- TerminalNodeImpl                # 42
|-- ComparisonOperatorContext 
|---- TerminalNodeImpl                      # >
|-- ValueExpressionDefaultContext 
|---- ConstantDefaultContext 
|------ NumericLiteralContext 
|-------- IntegerLiteralContext 
|---------- TerminalNodeImpl                # 1

I am not opening this PR in order to have it merged: I do not think that we should add generated code to git.

Rather, I am opening it to discuss how to automatized the code generation.

Currently, it requires the following steps:

  1. Download antlr-4.7.1-complete.jarfrom https://www.antlr.org/download/
  2. Run java -Xmx500M -cp "/path/to/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool ${project_dir}/pysparkling/sql/ast/grammar/SqlBase.g4 -o ${project_dir}/pysparkling/sql/ast/generated

But that's only for developers: I think we will want to package the app with these generated files.

These are the steps why I think a bit more automation in the app lifecyle would be nice.

What do you think?

@svenkreiss
Copy link
Owner

Thanks for sharing the parser. That adds some context to the previous discussion.

I am not familiar with antlr. I had a brief look at the documentation and it seems it can generate a Python parser?
I don't see that in the generated files here.
During runtime, it runs as pure Python, right? Or does it do the parsing inside the JVM and that's why it generated java files?

@tools4origins
Copy link
Collaborator Author

Oops, the actual command to generated the files is:
java -Xmx500M -cp "/path/to/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool ${project_dir}/pysparkling/sql/ast/grammar/SqlBase.g4 -o ${project_dir}/pysparkling/sql/ast/generated -Dlanguage=Python3.

Now the Lexer, Listener and Parser files are in Python.

It does run as pure Python, the antlr jar is not necessary, nor any jvm, at run time.

It is not in this PR but here is how the Python parser is invoked: https://github.com/tools4origins/pysparkling/blob/feat/antl4rGrammar/pysparkling/sql/ast/parser.py. It is at this step that it is made case insensitive and where, like in Spark, the backquotes requirements for SQL identifiers is removed.

This invocation can then be used using:

parser = ast_parser(string)
syntax_tree = parser.singleDataType()  # singleExpression is a name of one of the defined grammar elements

It is a WIP but you can find a parse_datatype implementation here, which on top of calls convert_tree that will convert the syntax tree to a Python object (here a DataType from pysparkling.sql.types):
https://github.com/tools4origins/pysparkling/blob/feat/antl4rGrammar/pysparkling/sql/ast/ast_to_python.py
And its usage:
https://github.com/tools4origins/pysparkling/blob/feat/antl4rGrammar/pysparkling/sql/ast/tests/test_type_parsing.py

@svenkreiss
Copy link
Owner

Great, thanks. I am understanding this better now. This might be unpopular but how about following the Cython model for generated code: http://docs.cython.org/en/latest/src/userguide/source_files_and_compilation.html#distributing-cython-modules

It鈥檚 certainly not my preference to have generated code under version control but it also means that developers who are not working on the parser won鈥檛 need Java. I am using this approach in OpenPifPaf and it has been working well for me. Generated code can be excluded from the GitHub code statistics with an entry in .gitattributes.

Do you think this could work here?

@tools4origins
Copy link
Collaborator Author

An alternate solution would be to have a dedicated project for the antlr4 grammar, which generates the python files when packaged and which is used as a dependency of pysparkling.

This way too we ensure that building pysparkling itself does not require using Java

@svenkreiss
Copy link
Owner

@tools4origins Yes, I like that! There are actually some python packages that are pure sql parsers already but I assume they are not compatible with the grammar you need here.

@tools4origins
Copy link
Collaborator Author

I extracted https://github.com/pysparkling/python-sql-parser/ and will refactor the code to rely on it as a dependency

@tools4origins
Copy link
Collaborator Author

Closing as the whole PR logic will now be contained in an additional requirement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants