Spark3: Add Direct File Query #2553

R7L208 · 2022-02-03T16:09:00Z

Brief summary of the change made

Adds support to directly query files of supported types.
Fixes Issue with Databricks absolute path tables #602

Are there any other side effects of this change that we should be aware of?

Reorganized some grammar for Files to isolate File keywords that support this clause.
Removed "XML" File format. No test cases and requires additional work before it will parse correctly.
Linting fails due to special characters in identifiers (L057). These special keywords are required so would be an exception to L057. I've not yet worked on rules so would appreciate any guidance on where to get started on fixing this particular bug.

> sqlfluff lint --dialect spark3 test/fixtures/dialects/spark3/select_from_file.sql
== [test/fixtures/dialects/spark3/select_from_file.sql] FAIL
L:   6 | P:  14 | L057 | Do not use special characters in identifiers.
L:  13 | P:  14 | L057 | Do not use special characters in identifiers.
L:  20 | P:  10 | L057 | Do not use special characters in identifiers.
L:  27 | P:  11 | L057 | Do not use special characters in identifiers.
L:  34 | P:  11 | L057 | Do not use special characters in identifiers.
L:  41 | P:  11 | L057 | Do not use special characters in identifiers.
L:  48 | P:  11 | L057 | Do not use special characters in identifiers.
L:  55 | P:  17 | L057 | Do not use special characters in identifiers.
L:  62 | P:  17 | L057 | Do not use special characters in identifiers.
L:  69 | P:  10 | L057 | Do not use special characters in identifiers.
L:  76 | P:  12 | L057 | Do not use special characters in identifiers.
All Finished 📜 🎉!

[L: 76, P:  6]      |                from_expression:
[L: 76, P:  6]      |                    [META] indent:
[L: 76, P:  6]      |                    from_expression_element:
[L: 76, P:  6]      |                        table_expression:
[L: 76, P:  6]      |                            file_reference:
[L: 76, P:  6]      |                                keyword:                      'DELTA'
[L: 76, P: 11]      |                                dot:                          '.'
[L: 76, P: 12]      |                                identifier:                   '`/mnt/datalake/table`'
[L: 76, P: 33]      |                    [META] dedent:
[L: 76, P: 33]      |    statement_terminator:                                     ';'
[L: 76, P: 34]      |    newline:                                                  '\n'

Pull Request checklist

Please confirm you have completed any of the necessary steps below.
Included test cases to demonstrate any code changes, which may be one or more of the following:
- .yml rule test cases in test/fixtures/rules/std_rule_cases.
- .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
- Full autofix test cases in test/fixtures/linter/autofix.
- Other.
Added appropriate documentation for the change.
Created GitHub issues for any relevant followup/future enhancements if appropriate.

reorganized file grammar to isolate file formats that allow this removed xml file format for now since requires additonal work

codecov · 2022-02-03T16:21:27Z

Codecov Report

Merging #2553 (cdad74f) into main (a8234db) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #2553   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          163       163           
  Lines        11828     11838   +10     
=========================================
+ Hits         11828     11838   +10

Impacted Files	Coverage Δ
src/sqlfluff/dialects/dialect_spark3.py	`100.00% <100.00%> (ø)`
src/sqlfluff/rules/L057.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8234db...cdad74f. Read the comment docs.

this ensures a match on file reference and files are not matched as tables

…erenceSegment ran sqlfluff fix on tests and updated yml again

tunetheweb

Looks good. One question and one nit.

src/sqlfluff/dialects/dialect_spark3.py

R7L208 · 2022-02-04T11:37:47Z

@tunetheweb - I'll work on the review comments today. Thank you!

Do you have a suggestion or similar PR you could point me to for adjusting L057 since identifier is required in this case? It shouldn't raise the rule here.

tunetheweb · 2022-02-04T11:51:47Z

@tunetheweb - I'll work on the review comments today. Thank you!

Do you have a suggestion or similar PR you could point me to for adjusting L057 since identifier is required in this case? It shouldn't raise the rule here.

In L057 there is already an exception for BigQuery:

            # BigQuery table references are quoted in back ticks so allow dots
            #
            # It also allows a star at the end of table_references for wildcards
            # (https://cloud.google.com/bigquery/docs/querying-wildcard-tables)
            #
            # Strip both out before testing the identifier
            if (
                context.dialect.name in ["bigquery"]
                and context.parent_stack
                and context.parent_stack[-1].name == "TableReferenceSegment"
            ):
                if identifier[-1] == "*":
                    identifier = identifier[:-1]
                identifier = identifier.replace(".", "")

You could do something similar for this:

            # Allow slashes in file references
            if (
                context.parent_stack
                and context.parent_stack[-1].type == "file_reference"
            ):
                identifier = identifier.replace("/", "")

src/sqlfluff/rules/L057.py

R7L208 · 2022-02-04T16:49:09Z

Yep, just added them and running pytest locally then will push

tunetheweb

LGTM

R7L208 added 2 commits February 3, 2022 11:00

Add direct file query

e940988

reorganized file grammar to isolate file formats that allow this removed xml file format for now since requires additonal work

black

50877c8

R7L208 marked this pull request as draft February 3, 2022 16:09

R7L208 added 5 commits February 3, 2022 11:24

fix segment decorator and add segment

402e79d

move FileReferenceSegment before TableReferenceSegment

35a4f0c

this ensures a match on file reference and files are not matched as tables

sqlfluff fix test file and update yml

e6d0d7f

update docstring and add link to docs

48ae3c7

switch QuotedIdentifierSegment to QuotedLiteralSegment inside FileRef…

2d7c8b2

…erenceSegment ran sqlfluff fix on tests and updated yml again

R7L208 marked this pull request as ready for review February 3, 2022 19:42

tunetheweb requested changes Feb 4, 2022

View reviewed changes

src/sqlfluff/dialects/dialect_spark3.py Outdated Show resolved Hide resolved

src/sqlfluff/dialects/dialect_spark3.py Show resolved Hide resolved

R7L208 added 9 commits February 4, 2022 09:49

additional test cases for L057 exception

e343994

minor formatting

925801c

add semicolons at end of new tests

71ab460

noqa for line to long on html link

a4b45c3

revert back to QuotedIdentifierSegment and add NB

cdc7ed8

sqlfluff fix test cases

1c6cfc5

add exceptions for identifiers to LO57

2846171

refresh yml

fc1e4a8

black

2f97dae

tunetheweb reviewed Feb 4, 2022

View reviewed changes

src/sqlfluff/rules/L057.py Outdated Show resolved Hide resolved

R7L208 added 4 commits February 4, 2022 11:30

return None for identifiers in FileReferenceSegment

8138624

cleanup L057

b4fc23d

black

cd3159a

remove redundant test case

dd5cdf8

tunetheweb requested changes Feb 4, 2022

View reviewed changes

src/sqlfluff/rules/L057.py Outdated Show resolved Hide resolved

src/sqlfluff/rules/L057.py Show resolved Hide resolved

new test cases for L057

4c46a18

R7L208 and others added 4 commits February 4, 2022 12:30

update generated yml

b1aeb6d

remove noqa

ade272e

Merge branch 'main' into r7l208/spark3-select-from-file

d3d0d65

Update src/sqlfluff/dialects/dialect_spark3.py

cdad74f

tunetheweb approved these changes Feb 4, 2022

View reviewed changes

tunetheweb merged commit cc414e5 into sqlfluff:main Feb 4, 2022

R7L208 mentioned this pull request Feb 7, 2022

[draft] Spark3: Join Improvements #2569

Closed

1 task

R7L208 deleted the r7l208/spark3-select-from-file branch August 30, 2022 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark3: Add Direct File Query #2553

Spark3: Add Direct File Query #2553

R7L208 commented Feb 3, 2022 •

edited

codecov bot commented Feb 3, 2022 •

edited

tunetheweb left a comment

R7L208 commented Feb 4, 2022

tunetheweb commented Feb 4, 2022 •

edited

R7L208 commented Feb 4, 2022

tunetheweb left a comment

Spark3: Add Direct File Query #2553

Spark3: Add Direct File Query #2553

Conversation

R7L208 commented Feb 3, 2022 • edited

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

codecov bot commented Feb 3, 2022 • edited

Codecov Report

tunetheweb left a comment

Choose a reason for hiding this comment

R7L208 commented Feb 4, 2022

tunetheweb commented Feb 4, 2022 • edited

R7L208 commented Feb 4, 2022

tunetheweb left a comment

Choose a reason for hiding this comment

R7L208 commented Feb 3, 2022 •

edited

codecov bot commented Feb 3, 2022 •

edited

tunetheweb commented Feb 4, 2022 •

edited