Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake: Add CREATE FILE FORMAT Support #3104

Merged
merged 17 commits into from May 5, 2022

Conversation

jmc-bbk
Copy link
Contributor

@jmc-bbk jmc-bbk commented Apr 15, 2022

CREATE FILE FORMAT ... is currently unparsable as it is defined in CreateStatementSegment() which does not include parameters for file formats (it currently handles Warehouses, Pipes, and soon Integrations #3075).

An example of unparsable code:

CREATE OR REPLACE FILE FORMAT my_file_format
    TYPE = CSV
    SKIP_HEADER = 1
    COMPRESSION = NONE
    FIELD_DELIMITER = '|'
    DATE_FORMAT = 'DD/MM/YYYY'
;

In this PR, I want to introduce CreateFileFormatSegment() so that we can:

  • Begin to breakdown CreateStatementSegment() into it's component parts.
  • Provide better support for optional parameters for file formats (e.g. CSV, JSON, ARVO).

I have made some good progress, and the above would now be parsable. But I want to get advice from the maintainers before I continue building out this pattern. I will also post on the Slack channel for some help.

One big issue that I'm unsure on how to resolve is that both comma separation and no comma separation are valid syntax:

-- No comma separation.
CREATE OR REPLACE FILE FORMAT my_file_format
    TYPE = CSV
    SKIP_HEADER = 1

-- Comma separation.
CREATE OR REPLACE FILE FORMAT my_file_format
    TYPE = CSV,
    SKIP_HEADER = 1,

This PR currently only handles for no comma separation.

To do

  • Handle for both comma and non-comma separated sets.
  • Update other similar classes (e.g. FileFormatSegment) to reduce duplication / redundant code.
  • Add test cases and a specific create_file_format.sql file.
  • Determine whether the above pattern is correct (or too specific) - and make changes.

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

  • Please confirm you have completed any of the necessary steps below.

  • Included test cases to demonstrate any code changes, which may be one or more of the following:

    • .yml rule test cases in test/fixtures/rules/std_rule_cases.
    • .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
    • Full autofix test cases in test/fixtures/linter/autofix.
    • Other.
  • Added appropriate documentation for the change.

  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

@codecov
Copy link

codecov bot commented Apr 15, 2022

Codecov Report

Merging #3104 (96f4457) into main (6954d68) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #3104   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          167       167           
  Lines        12507     12531   +24     
=========================================
+ Hits         12507     12531   +24     
Impacted Files Coverage Δ
...rc/sqlfluff/dialects/dialect_snowflake_keywords.py 100.00% <ø> (ø)
src/sqlfluff/dialects/dialect_snowflake.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6954d68...96f4457. Read the comment docs.

@tunetheweb
Copy link
Member

One big issue that I'm unsure on how to resolve is that both comma separation and no comma separation are valid syntax:

We don't currently have an OptionallyDelimted type, and don't think it's worth introducing for this.

Instead what I'd suggest something like this:

        OneOf(
            Delimited(
                Ref("CsvFileFormatTypeOptions"),  # TODO: Add Other File Formats.
             ),
            AnySetOf(
                Ref("CsvFileFormatTypeOptions"),  # TODO: Add Other File Formats.
            ),
        ),

@jmc-bbk
Copy link
Contributor Author

jmc-bbk commented Apr 26, 2022

Starting to go a bit crazy with this one. Here's some examples of "valid" syntax on Snowflake:

-- Repeated arguments, some delimited
CREATE OR REPLACE FILE FORMAT my_file_format
    TYPE = CSV
    TYPE = CSV,
    TYPE = CSV,
    TYPE = CSV
;
-- No type specified (valid in CREATE OR REPLACE but not elsewhere)
CREATE OR REPLACE FILE FORMAT my_file_format
    SKIP_HEADER = 1,
    RECORD_DELIMITER = '|'
;
-- Duplicated type, mixed delimited, type succeeds optional parameters
CREATE or replace EXTERNAL TABLE dev_test
LOCATION = @database.schema.my_stage
FILE_FORMAT = (SKIP_HEADER = 1 TYPE = CSV, TYPE = CSV);

And then an example of "not valid"

-- No type specified (note this is acceptable in create or replace)
CREATE or replace EXTERNAL TABLE dev_test
LOCATION = @database.schema.my_stage
FILE_FORMAT = (SKIP_HEADER = 1);

🤷‍♂️

@tunetheweb
Copy link
Member

I suggest you get the basic working. Can always enhance it later.

@jmc-bbk
Copy link
Contributor Author

jmc-bbk commented Apr 30, 2022

I've added in the remaining file types, so made a bit more progress. But think it would be helpful for your opinions at this stage @tunetheweb !

The code can now handle:

  • Delimited arguments a = 1, b = 2, c = 3
  • Non-delimited arguments a = 1 b = 2 c = 3

But I have deliberately ignored mixed arguments a = 1 b = 2, c = 3 and these would cause a parse error.

Another thing I can't currently handle is:

Example (1)

-- TYPE coming after optional file parameters (which is valid syntax in CREATE OR REPLACE).

CREATE OR REPLACE my_file_format
  SKIP_HEADER = 1
  TYPE = CSV 

The above 👆 could be solved by moving type into the file optional parameters but this feels wrong, as it isn't really an optional parameter.

Moving into into the optional parameters, would mean that we would parse this "invalid" syntax where TYPE is required:

Example (2)

-- No type specified (note this is acceptable in CREATE OR REPLACE)
CREATE or replace EXTERNAL TABLE dev_test
LOCATION = @database.schema.my_stage
FILE_FORMAT = (SKIP_HEADER = 1);

Last thing!

We can't currently handle TYPE being anywhere in the parameters, as shown by this snippet in class FileFormatSegment(BaseSegment): that expects TYPE to come first.


                Sequence(
                    "TYPE",
                    Ref("EqualsSegment"),
                    Ref("FileType"),
                    # formatTypeOptions - To Do to make this more specific
                    Ref("FormatTypeOptionsSegment", optional=True),
                ),

If we did move type into the optional parameters I've created here, we would resolve the issue of TYPE appearing anywhere (see Example 1) - but introduce the issue that we parse code without TYPE (see Example 2).

Sorry that's really long to explain - Snowflake's parser makes really weird decisions in this area.

@tunetheweb
Copy link
Member

I would move it in to the params. SQLFluff syntax is not expected to 100% match the actual dialect - in part because we can't always match.

#2426 is a suggestion to add an ability to make an option mandatory, but that's not be implemented yet.

@jmc-bbk
Copy link
Contributor Author

jmc-bbk commented May 3, 2022

Okay, I think this is ready for review. Jumped around a bit here but the current thing "works".

Changes:

  • Replaced CompressionTypeGrammar with CompressionType, this can handle quoted compressions.
  • Removed FileFormatSegment which is replaced with specific classes (e.g. CsvFileFormatParameters).
  • No longer use FileType but have left this present. Prefer embedded RegexParser to make exact type match (e.g. only parse JSON parameters, when matched to JSON ~ rather than any valid file type).
  • Question whether a similar approach to above ☝️ would be valid for CompressionType. Here I have the opposite pattern, and will match to any compression type (I did this to avoid multiple parsers that are all similar to one another).

Improvements:

  • Can handle TYPE at any position in the parameters.
  • Can handle quoted compression types.
  • Begin to decompose CreateStatementSegment.
  • Better parse tree with more specific types (e.g. xml_file_format_parameters).
  • Separation of file format parameters allows easier future development.

Concerns:

  • Handle for both delimited (e.g. a, b, c) and non-delimited arguments (e.g. a b c) but Snowflake allows mixed arguments.
  • Consistency between CompressionType (non-specific) and file types (e.g. RegexParser(r"'?JSON'? ...) that are specific.
  • Now redundant FileType parser (can delete this if above approach is okay).
  • Added keywords to Snowflake dialect, that may no longer be required with above parsers.

@jmc-bbk jmc-bbk marked this pull request as ready for review May 3, 2022 18:46
@jmc-bbk jmc-bbk changed the title [WIP] Snowflake: Add Create File Format Segment, Enhance Parseable Grammar Snowflake: Add Create File Format Segment, Enhance Parseable Grammar May 3, 2022
src/sqlfluff/dialects/dialect_snowflake.py Outdated Show resolved Hide resolved
src/sqlfluff/dialects/dialect_snowflake.py Show resolved Hide resolved
src/sqlfluff/dialects/dialect_snowflake.py Outdated Show resolved Hide resolved
src/sqlfluff/dialects/dialect_snowflake.py Outdated Show resolved Hide resolved
@jmc-bbk
Copy link
Contributor Author

jmc-bbk commented May 5, 2022

Hopefully quick question @tunetheweb and @WittierDinosaur - and I will try make suggested changes today (or ASAP).

Should we:

(a) leave the file_type in the individual classes (e.g. CsvFileFormatTypeParameters) but with the StringParser()?

(b) change this pattern to align with CompressionType and have something like FileType =...Parser('(CSV|JSON|AVRO... and then have a single FileFormatParametersSegment class?

Advantages of (a):

  • Can make changes to individual file format classes without affecting the others
  • Parse tree will recognise (e.g.) csv_file_format_parameters

Advantages of (b):

  • Generalisable class handles all parameters
  • Changes affect all file formats / non-specific parse tree

The reason why I want to make sure, is there is the exact same argument for CompressionType - but at the moment I'm handling that in the style of (b) rather than (a).

The reason I went with the generalisable compression type is that:

  • CSV/JSON/AVRO/XML all handle the same compression types
  • PARQUET has a different set of compression types

I thought it starts to get messy if I have a different CompressionType class for each file format, or have two that handle the different formats above - but unsure on this.


Anyway! Let me know whether to go for method (a) or (b) for the FileFormat and I will try wrap this up ASAP.

@tunetheweb
Copy link
Member

I think if you're having different sections for each (which you have currently done), then it makes more sense to further restrict to the actual type, rather than the whole group. I think StringParser is better to prevent 'CSV or CSV' being allowed as per my comment above.

For the over Compression Type I think two RegexParsers are the way to go one with quotes, one without, wrapped in a OneOf(). But know @WittierDinosaur has some concerns on RegexParsers but did ask some questions above as to how valid those concerns are for this specific case.

The other alternative is to generalise this into one big statement. I.e. rather than having separate CsvFileFormatTypeParameters, JsonFileFormatTypeParameters...etc. grammars could just have a more generic FileFormatTypeParameters that accepts any of those formats for any of the types. Technically less accurate but could be a lot less code. But then you've done the work now, so might as well keep the specific ones.

@jmc-bbk jmc-bbk requested a review from tunetheweb May 5, 2022 16:39
src/sqlfluff/dialects/dialect_snowflake.py Outdated Show resolved Hide resolved
src/sqlfluff/dialects/dialect_snowflake.py Outdated Show resolved Hide resolved
jmc-bbk and others added 3 commits May 5, 2022 18:13
@jmc-bbk
Copy link
Contributor Author

jmc-bbk commented May 5, 2022

Hopefully done @tunetheweb !

@tunetheweb tunetheweb changed the title Snowflake: Add Create File Format Segment, Enhance Parseable Grammar Snowflake: Add CREATE FILE FORMAT Support May 5, 2022
@tunetheweb tunetheweb merged commit feb0aff into sqlfluff:main May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants