Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark3: JOIN clause enhancements #2570

Merged

Conversation

R7L208
Copy link
Contributor

@R7L208 R7L208 commented Feb 7, 2022

Brief summary of the change made

Enhancements to joins in Spark3

  • Add additional join types that are permitted
  • Add NATURAL joins syntax support
  • Additional test cases

Are there any other side effects of this change that we should be aware of?

N/A for now; will update this section once ready for review.

Pull Request checklist

  • Please confirm you have completed any of the necessary steps below.

  • Included test cases to demonstrate any code changes, which may be one or more of the following:

    • .yml rule test cases in test/fixtures/rules/std_rule_cases.
    • .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
    • Full autofix test cases in test/fixtures/linter/autofix.
    • Other.
  • Added appropriate documentation for the change.

  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

@R7L208
Copy link
Contributor Author

R7L208 commented Feb 7, 2022

Output from running SQLFluff below but hitting a few problems.

The updates to AliasExpressionSegment to allow for implicit aliasing (ie table_name alias_name where they keyword AS is not used) causes the first join keyword to be matched as table_expression. I've managed to add match_grammar to both AliasExpressionSegment and JoinClauseSegment to prevent this.

For some reason though, it works with every keyword BUT SEMI and ANTI. This causes an Exception in L031 and an Unparsabel Section caused by the parser still expecting a TableExpressionSegment.

Even with disabling L031 through noqa, the unparsable section still occurs. I cannot for the life of me figure out why this is ONLY happening with SEMI and ANTI when every other join keyword parses fine.

> sqlfluff lint --dialect spark3 -v test/fixtures/dialects/spark3/join_types.sql

==== sqlfluff ====
sqlfluff:                0.9.4 python:                 3.9.10
implementation:        cpython dialect:                spark3
verbosity:                   1 templater:               jinja

==== readout ====

=== [ path: test/fixtures/dialects/spark3/join_types.sql ] ===

CRITICAL   [L031] Applying rule L031 threw an Exception: local variable 'alias_name' referenced before assignment
Traceback (most recent call last):
  File "/Users/lorin.dawson/repos/sqlfluff/src/sqlfluff/core/rules/base.py", line 519, in crawl
    res = self._eval(
  File "/Users/lorin.dawson/repos/sqlfluff/src/sqlfluff/rules/L031.py", line 110, in _eval
    self._lint_aliases_in_join(
  File "/Users/lorin.dawson/repos/sqlfluff/src/sqlfluff/rules/L031.py", line 210, in _lint_aliases_in_join
    and used_alias_ref.raw == alias_name
UnboundLocalError: local variable 'alias_name' referenced before assignment
CRITICAL   [L031] Applying rule L031 threw an Exception: local variable 'alias_name' referenced before assignment
Traceback (most recent call last):
  File "/Users/lorin.dawson/repos/sqlfluff/src/sqlfluff/core/rules/base.py", line 519, in crawl
    res = self._eval(
  File "/Users/lorin.dawson/repos/sqlfluff/src/sqlfluff/rules/L031.py", line 110, in _eval
    self._lint_aliases_in_join(
  File "/Users/lorin.dawson/repos/sqlfluff/src/sqlfluff/rules/L031.py", line 210, in _lint_aliases_in_join
    and used_alias_ref.raw == alias_name
UnboundLocalError: local variable 'alias_name' referenced before assignment
== [test/fixtures/dialects/spark3/join_types.sql] FAIL
L:  28 | P:   1 | L055 | Use 'LEFT JOIN' instead of 'RIGHT JOIN'.
L:  61 | P:   1 |  PRS | Line 61, Position 1: Found unparsable section: 'SEMI'
L:  61 | P:   1 | L011 | Implicit/explicit aliasing of table.
L:  61 | P:   1 | L014 | Unquoted identifiers must be consistently lower case.
L:  61 | P:   6 | L003 | Indentation not consistent with line #60
L:  61 | P:   6 | L051 | 'INNER JOIN' must be fully qualified.
L:  70 | P:   1 | L031 | Unexpected exception: local variable 'alias_name'
                       | referenced before assignment; Could you open an issue at
                       | https://github.com/sqlfluff/sqlfluff/issues ? You can
                       | ignore this exception for now, by adding '-- noqa: L031'
                       | at the end of line 70
L:  70 | P:   7 | L039 | Unnecessary whitespace found.
L:  72 | P:   1 |  PRS | Line 72, Position 1: Found unparsable section: 'ANTI'
L:  72 | P:   1 | L011 | Implicit/explicit aliasing of table.
L:  72 | P:   1 | L014 | Unquoted identifiers must be consistently lower case.
L:  72 | P:   6 | L003 | Indentation not consistent with line #71
L:  72 | P:   6 | L051 | 'INNER JOIN' must be fully qualified.
L: 100 | P:  15 | L055 | Use 'LEFT JOIN' instead of 'RIGHT JOIN'.
==== summary ====
violations:       14 status:         FAIL
All Finished 📜 🎉!

@tunetheweb
Copy link
Member

It's because SEMI and ANTI are not reserved words so it swallows those up as alias's.

This is due to a mistake in the AliasExpressionSegment where it's missing a Sequence and so the terminator clause is not being used properly. This fixes it:

@spark3_dialect.segment(replace=True)
class AliasExpressionSegment(BaseSegment):
...
    parse_grammar = StartsWith(
        Sequence(
            Ref.keyword("AS", optional=True),
            OneOf(
                # maybe table alias and column aliases
                Sequence(
                    Ref("SingleIdentifierGrammar", optional=True),
                    Bracketed(Ref("SingleIdentifierListSegment")),
                ),
                # just a table alias
                Ref("SingleIdentifierGrammar"),
            ),
        ),
        terminator=OneOf(
            Ref("JoinTypeKeywords"),
            Ref("FromClauseTerminatorGrammar"),
        )
    )

@R7L208
Copy link
Contributor Author

R7L208 commented Feb 8, 2022

This removes the unparsable section error but ANTI and SEMI are still swallowed up as aliases.

[L: 69, P:  1]      |    comment:                                                  '-- anti join'
[L: 69, P: 13]      |    newline:                                                  '\n'
[L: 70, P:  1]      |    base:
[L: 70, P:  1]      |        select_statement:
[L: 70, P:  1]      |            select_clause:
[L: 70, P:  1]      |                keyword:                                      'SELECT'
[L: 70, P:  7]      |                [META] indent:
[L: 70, P:  7]      |                whitespace:                                   '  '
[L: 70, P:  9]      |                select_clause_element:
[L: 70, P:  9]      |                    column_reference:
[L: 70, P:  9]      |                        identifier:                           'employee'
[L: 70, P: 17]      |                        dot:                                  '.'
[L: 70, P: 18]      |                        identifier:                           'id'
[L: 70, P: 20]      |            newline:                                          '\n'
[L: 71, P:  1]      |            [META] dedent:
[L: 71, P:  1]      |            from_clause:
[L: 71, P:  1]      |                keyword:                                      'FROM'
[L: 71, P:  5]      |                whitespace:                                   ' '
[L: 71, P:  6]      |                from_expression:
[L: 71, P:  6]      |                    [META] indent:
[L: 71, P:  6]      |                    from_expression_element:
[L: 71, P:  6]      |                        table_expression:
[L: 71, P:  6]      |                            table_reference:
[L: 71, P:  6]      |                                identifier:                   'employee'
[L: 71, P: 14]      |                        newline:                              '\n'
[L: 72, P:  1]      |                        alias_expression:
[L: 72, P:  1]      |                            identifier:                       'ANTI'
[L: 72, P:  5]      |                    whitespace:                               ' '
[L: 72, P:  6]      |                    [META] dedent:
[L: 72, P:  6]      |                    join_clause:
[L: 72, P:  6]      |                        keyword:                              'JOIN'
[L: 72, P: 10]      |                        [META] indent:
[L: 72, P: 10]      |                        whitespace:                           ' '
[L: 72, P: 11]      |                        from_expression_element:
[L: 72, P: 11]      |                            table_expression:
[L: 72, P: 11]      |                                table_reference:
[L: 72, P: 11]      |                                    identifier:               'department'
[L: 72, P: 21]      |                        newline:                              '\n'
[L: 73, P:  1]      |                        whitespace:                           '    '
[L: 73, P:  5]      |                        join_on_condition:
[L: 73, P:  5]      |                            keyword:                          'ON'
[L: 73, P:  7]      |                            [META] indent:
[L: 73, P:  7]      |                            whitespace:                       ' '
[L: 73, P:  8]      |                            expression:
[L: 73, P:  8]      |                                column_reference:
[L: 73, P:  8]      |                                    identifier:               'employee'
[L: 73, P: 16]      |                                    dot:                      '.'
[L: 73, P: 17]      |                                    identifier:               'deptno'
[L: 73, P: 23]      |                                whitespace:                   ' '
[L: 73, P: 24]      |                                comparison_operator:
[L: 73, P: 24]      |                                    raw_comparison_operator:  '='
[L: 73, P: 25]      |                                whitespace:                   ' '
[L: 73, P: 26]      |                                column_reference:
[L: 73, P: 26]      |                                    identifier:               'department'
[L: 73, P: 36]      |                                    dot:                      '.'
[L: 73, P: 37]      |                                    identifier:               'deptno'
[L: 73, P: 43]      |                            [META] dedent:
[L: 73, P: 43]      |                        [META] dedent:
[L: 73, P: 43]      |    statement_terminator:                                     ';'

Sounds like a simple fix would be to add these to the reserved keyword list but they are not technically reserved keywords. I had a previous iteration on AliasExpressionSegment with Sequence wrapping contents of StartsWith.

I'm still struggling to understand why these are being swallowed as table aliases if they're included in the terminator argument via JoinTypeKeywords.

@codecov
Copy link

codecov bot commented Feb 8, 2022

Codecov Report

Merging #2570 (3d3c8f2) into main (c9c5acf) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main     #2570   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          163       163           
  Lines        11862     11863    +1     
=========================================
+ Hits         11862     11863    +1     
Impacted Files Coverage Δ
src/sqlfluff/dialects/dialect_spark3.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9c5acf...3d3c8f2. Read the comment docs.

@tunetheweb
Copy link
Member

tunetheweb commented Feb 8, 2022

I'm still struggling to understand why these are being swallowed as table aliases if they're included in the terminator argument via JoinTypeKeywords.

I see a couple of issues:

  1. You've set the terminator on the parse_grammar not the match_grammar (looks like you have these the wrong way around).
  2. However when changing these it still didn't work. Although your alias is optional, that means it's either used or it's not. And if it's not, it never sees the terminator clause.

I think the easiest thing is to change the Alias expression to this:

    type = "alias_expression"

    match_grammar = Sequence(
        Ref.keyword("AS", optional=True),
        OneOf(
            # maybe table alias and column aliases
            Sequence(
                Ref("SingleIdentifierGrammar", optional=True),
                Bracketed(Ref("SingleIdentifierListSegment")),
            ),
            # just a table alias
            Ref("SingleIdentifierGrammar"),
            exclude=Ref("JoinTypeKeywords"),
        ),
    )

That is, add an explicit exclude clause. Which is similar to terminator.

And just lose the parse_grammar completely. It was basically a repeat of the match_grammar anyway.

That seems to work for me.

@R7L208
Copy link
Contributor Author

R7L208 commented Feb 8, 2022

That is, add an explicit exclude clause. Which is similar to terminator.

This worked like a charm! Thank you

I'll clean everything up and take it out of Draft once all checks pass.

@R7L208 R7L208 changed the title [DRAFT] Spark3: JOIN clause enhancements Spark3: JOIN clause enhancements Feb 8, 2022
@tunetheweb tunetheweb merged commit c8da62d into sqlfluff:main Feb 9, 2022
@R7L208 R7L208 deleted the r7l208/spark3-join-type-enhancements branch August 30, 2022 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants