Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark3: Add Direct File Query #2553

Merged
merged 25 commits into from
Feb 4, 2022

Conversation

R7L208
Copy link
Contributor

@R7L208 R7L208 commented Feb 3, 2022

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

  • Reorganized some grammar for Files to isolate File keywords that support this clause.

  • Removed "XML" File format. No test cases and requires additional work before it will parse correctly.

  • Linting fails due to special characters in identifiers (L057). These special keywords are required so would be an exception to L057. I've not yet worked on rules so would appreciate any guidance on where to get started on fixing this particular bug.

> sqlfluff lint --dialect spark3 test/fixtures/dialects/spark3/select_from_file.sql
== [test/fixtures/dialects/spark3/select_from_file.sql] FAIL
L:   6 | P:  14 | L057 | Do not use special characters in identifiers.
L:  13 | P:  14 | L057 | Do not use special characters in identifiers.
L:  20 | P:  10 | L057 | Do not use special characters in identifiers.
L:  27 | P:  11 | L057 | Do not use special characters in identifiers.
L:  34 | P:  11 | L057 | Do not use special characters in identifiers.
L:  41 | P:  11 | L057 | Do not use special characters in identifiers.
L:  48 | P:  11 | L057 | Do not use special characters in identifiers.
L:  55 | P:  17 | L057 | Do not use special characters in identifiers.
L:  62 | P:  17 | L057 | Do not use special characters in identifiers.
L:  69 | P:  10 | L057 | Do not use special characters in identifiers.
L:  76 | P:  12 | L057 | Do not use special characters in identifiers.
All Finished 📜 🎉!
[L: 76, P:  6]      |                from_expression:
[L: 76, P:  6]      |                    [META] indent:
[L: 76, P:  6]      |                    from_expression_element:
[L: 76, P:  6]      |                        table_expression:
[L: 76, P:  6]      |                            file_reference:
[L: 76, P:  6]      |                                keyword:                      'DELTA'
[L: 76, P: 11]      |                                dot:                          '.'
[L: 76, P: 12]      |                                identifier:                   '`/mnt/datalake/table`'
[L: 76, P: 33]      |                    [META] dedent:
[L: 76, P: 33]      |    statement_terminator:                                     ';'
[L: 76, P: 34]      |    newline:                                                  '\n'

Pull Request checklist

  • Please confirm you have completed any of the necessary steps below.

  • Included test cases to demonstrate any code changes, which may be one or more of the following:

    • .yml rule test cases in test/fixtures/rules/std_rule_cases.
    • .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
    • Full autofix test cases in test/fixtures/linter/autofix.
    • Other.
  • Added appropriate documentation for the change.

  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

reorganized file grammar to isolate file formats that allow this

removed xml file format for now since requires additonal work
@R7L208 R7L208 marked this pull request as draft February 3, 2022 16:09
@codecov
Copy link

codecov bot commented Feb 3, 2022

Codecov Report

Merging #2553 (cdad74f) into main (a8234db) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main     #2553   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          163       163           
  Lines        11828     11838   +10     
=========================================
+ Hits         11828     11838   +10     
Impacted Files Coverage Δ
src/sqlfluff/dialects/dialect_spark3.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L057.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8234db...cdad74f. Read the comment docs.

@R7L208 R7L208 marked this pull request as ready for review February 3, 2022 19:42
Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One question and one nit.

src/sqlfluff/dialects/dialect_spark3.py Outdated Show resolved Hide resolved
src/sqlfluff/dialects/dialect_spark3.py Show resolved Hide resolved
@R7L208
Copy link
Contributor Author

R7L208 commented Feb 4, 2022

@tunetheweb - I'll work on the review comments today. Thank you!

Do you have a suggestion or similar PR you could point me to for adjusting L057 since identifier is required in this case? It shouldn't raise the rule here.

@tunetheweb
Copy link
Member

tunetheweb commented Feb 4, 2022

@tunetheweb - I'll work on the review comments today. Thank you!

Do you have a suggestion or similar PR you could point me to for adjusting L057 since identifier is required in this case? It shouldn't raise the rule here.

In L057 there is already an exception for BigQuery:

            # BigQuery table references are quoted in back ticks so allow dots
            #
            # It also allows a star at the end of table_references for wildcards
            # (https://cloud.google.com/bigquery/docs/querying-wildcard-tables)
            #
            # Strip both out before testing the identifier
            if (
                context.dialect.name in ["bigquery"]
                and context.parent_stack
                and context.parent_stack[-1].name == "TableReferenceSegment"
            ):
                if identifier[-1] == "*":
                    identifier = identifier[:-1]
                identifier = identifier.replace(".", "")

You could do something similar for this:

            # Allow slashes in file references
            if (
                context.parent_stack
                and context.parent_stack[-1].type == "file_reference"
            ):
                identifier = identifier.replace("/", "")

src/sqlfluff/rules/L057.py Outdated Show resolved Hide resolved
src/sqlfluff/rules/L057.py Show resolved Hide resolved
@R7L208
Copy link
Contributor Author

R7L208 commented Feb 4, 2022

Yep, just added them and running pytest locally then will push

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tunetheweb tunetheweb merged commit cc414e5 into sqlfluff:main Feb 4, 2022
@R7L208 R7L208 mentioned this pull request Feb 7, 2022
1 task
@R7L208 R7L208 deleted the r7l208/spark3-select-from-file branch August 30, 2022 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with Databricks absolute path tables
2 participants