Snowflake: Add `CREATE FILE FORMAT` Support #3104

jmc-bbk · 2022-04-15T14:08:12Z

CREATE FILE FORMAT ... is currently unparsable as it is defined in CreateStatementSegment() which does not include parameters for file formats (it currently handles Warehouses, Pipes, and soon Integrations #3075).

An example of unparsable code:

CREATE OR REPLACE FILE FORMAT my_file_format
    TYPE = CSV
    SKIP_HEADER = 1
    COMPRESSION = NONE
    FIELD_DELIMITER = '|'
    DATE_FORMAT = 'DD/MM/YYYY'
;

In this PR, I want to introduce CreateFileFormatSegment() so that we can:

Begin to breakdown CreateStatementSegment() into it's component parts.
Provide better support for optional parameters for file formats (e.g. CSV, JSON, ARVO).

I have made some good progress, and the above would now be parsable. But I want to get advice from the maintainers before I continue building out this pattern. I will also post on the Slack channel for some help.

One big issue that I'm unsure on how to resolve is that both comma separation and no comma separation are valid syntax:

-- No comma separation.
CREATE OR REPLACE FILE FORMAT my_file_format
    TYPE = CSV
    SKIP_HEADER = 1

-- Comma separation.
CREATE OR REPLACE FILE FORMAT my_file_format
    TYPE = CSV,
    SKIP_HEADER = 1,

This PR currently only handles for no comma separation.

To do

Handle for both comma and non-comma separated sets.
Update other similar classes (e.g. FileFormatSegment) to reduce duplication / redundant code.
Add test cases and a specific create_file_format.sql file.
Determine whether the above pattern is correct (or too specific) - and make changes.

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

Please confirm you have completed any of the necessary steps below.
Included test cases to demonstrate any code changes, which may be one or more of the following:
- .yml rule test cases in test/fixtures/rules/std_rule_cases.
- .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
- Full autofix test cases in test/fixtures/linter/autofix.
- Other.
Added appropriate documentation for the change.
Created GitHub issues for any relevant followup/future enhancements if appropriate.

codecov · 2022-04-15T14:21:05Z

Codecov Report

Merging #3104 (96f4457) into main (6954d68) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #3104   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          167       167           
  Lines        12507     12531   +24     
=========================================
+ Hits         12507     12531   +24

Impacted Files	Coverage Δ
...rc/sqlfluff/dialects/dialect_snowflake_keywords.py	`100.00% <ø> (ø)`
src/sqlfluff/dialects/dialect_snowflake.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6954d68...96f4457. Read the comment docs.

tunetheweb · 2022-04-17T07:27:11Z

One big issue that I'm unsure on how to resolve is that both comma separation and no comma separation are valid syntax:

We don't currently have an OptionallyDelimted type, and don't think it's worth introducing for this.

Instead what I'd suggest something like this:

        OneOf(
            Delimited(
                Ref("CsvFileFormatTypeOptions"),  # TODO: Add Other File Formats.
             ),
            AnySetOf(
                Ref("CsvFileFormatTypeOptions"),  # TODO: Add Other File Formats.
            ),
        ),

jmc-bbk · 2022-04-26T16:46:27Z

Starting to go a bit crazy with this one. Here's some examples of "valid" syntax on Snowflake:

-- Repeated arguments, some delimited
CREATE OR REPLACE FILE FORMAT my_file_format
    TYPE = CSV
    TYPE = CSV,
    TYPE = CSV,
    TYPE = CSV
;

-- No type specified (valid in CREATE OR REPLACE but not elsewhere)
CREATE OR REPLACE FILE FORMAT my_file_format
    SKIP_HEADER = 1,
    RECORD_DELIMITER = '|'
;

-- Duplicated type, mixed delimited, type succeeds optional parameters
CREATE or replace EXTERNAL TABLE dev_test
LOCATION = @database.schema.my_stage
FILE_FORMAT = (SKIP_HEADER = 1 TYPE = CSV, TYPE = CSV);

And then an example of "not valid"

-- No type specified (note this is acceptable in create or replace)
CREATE or replace EXTERNAL TABLE dev_test
LOCATION = @database.schema.my_stage
FILE_FORMAT = (SKIP_HEADER = 1);

🤷‍♂️

tunetheweb · 2022-04-26T17:12:28Z

I suggest you get the basic working. Can always enhance it later.

jmc-bbk · 2022-04-30T18:07:38Z

I've added in the remaining file types, so made a bit more progress. But think it would be helpful for your opinions at this stage @tunetheweb !

The code can now handle:

Delimited arguments a = 1, b = 2, c = 3
Non-delimited arguments a = 1 b = 2 c = 3

But I have deliberately ignored mixed arguments a = 1 b = 2, c = 3 and these would cause a parse error.

Another thing I can't currently handle is:

Example (1)

-- TYPE coming after optional file parameters (which is valid syntax in CREATE OR REPLACE).

CREATE OR REPLACE my_file_format
  SKIP_HEADER = 1
  TYPE = CSV

The above 👆 could be solved by moving type into the file optional parameters but this feels wrong, as it isn't really an optional parameter.

Moving into into the optional parameters, would mean that we would parse this "invalid" syntax where TYPE is required:

Example (2)

-- No type specified (note this is acceptable in CREATE OR REPLACE)
CREATE or replace EXTERNAL TABLE dev_test
LOCATION = @database.schema.my_stage
FILE_FORMAT = (SKIP_HEADER = 1);

Last thing!

We can't currently handle TYPE being anywhere in the parameters, as shown by this snippet in class FileFormatSegment(BaseSegment): that expects TYPE to come first.


                Sequence(
                    "TYPE",
                    Ref("EqualsSegment"),
                    Ref("FileType"),
                    # formatTypeOptions - To Do to make this more specific
                    Ref("FormatTypeOptionsSegment", optional=True),
                ),

If we did move type into the optional parameters I've created here, we would resolve the issue of TYPE appearing anywhere (see Example 1) - but introduce the issue that we parse code without TYPE (see Example 2).

Sorry that's really long to explain - Snowflake's parser makes really weird decisions in this area.

tunetheweb · 2022-05-01T22:19:21Z

I would move it in to the params. SQLFluff syntax is not expected to 100% match the actual dialect - in part because we can't always match.

#2426 is a suggestion to add an ability to make an option mandatory, but that's not be implemented yet.

jmc-bbk · 2022-05-03T18:34:25Z

Okay, I think this is ready for review. Jumped around a bit here but the current thing "works".

Changes:

Replaced CompressionTypeGrammar with CompressionType, this can handle quoted compressions.
Removed FileFormatSegment which is replaced with specific classes (e.g. CsvFileFormatParameters).
No longer use FileType but have left this present. Prefer embedded RegexParser to make exact type match (e.g. only parse JSON parameters, when matched to JSON ~ rather than any valid file type).
Question whether a similar approach to above ☝️ would be valid for CompressionType. Here I have the opposite pattern, and will match to any compression type (I did this to avoid multiple parsers that are all similar to one another).

Improvements:

Can handle TYPE at any position in the parameters.
Can handle quoted compression types.
Begin to decompose CreateStatementSegment.
Better parse tree with more specific types (e.g. xml_file_format_parameters).
Separation of file format parameters allows easier future development.

Concerns:

Handle for both delimited (e.g. a, b, c) and non-delimited arguments (e.g. a b c) but Snowflake allows mixed arguments.
Consistency between CompressionType (non-specific) and file types (e.g. RegexParser(r"'?JSON'? ...) that are specific.
Now redundant FileType parser (can delete this if above approach is okay).
Added keywords to Snowflake dialect, that may no longer be required with above parsers.

src/sqlfluff/dialects/dialect_snowflake.py

jmc-bbk · 2022-05-05T11:41:40Z

Hopefully quick question @tunetheweb and @WittierDinosaur - and I will try make suggested changes today (or ASAP).

Should we:

(a) leave the file_type in the individual classes (e.g. CsvFileFormatTypeParameters) but with the StringParser()?

(b) change this pattern to align with CompressionType and have something like FileType =...Parser('(CSV|JSON|AVRO... and then have a single FileFormatParametersSegment class?

Advantages of (a):

Can make changes to individual file format classes without affecting the others
Parse tree will recognise (e.g.) csv_file_format_parameters

Advantages of (b):

Generalisable class handles all parameters
Changes affect all file formats / non-specific parse tree

The reason why I want to make sure, is there is the exact same argument for CompressionType - but at the moment I'm handling that in the style of (b) rather than (a).

The reason I went with the generalisable compression type is that:

CSV/JSON/AVRO/XML all handle the same compression types
PARQUET has a different set of compression types

I thought it starts to get messy if I have a different CompressionType class for each file format, or have two that handle the different formats above - but unsure on this.

Anyway! Let me know whether to go for method (a) or (b) for the FileFormat and I will try wrap this up ASAP.

tunetheweb · 2022-05-05T11:48:52Z

I think if you're having different sections for each (which you have currently done), then it makes more sense to further restrict to the actual type, rather than the whole group. I think StringParser is better to prevent 'CSV or CSV' being allowed as per my comment above.

For the over Compression Type I think two RegexParsers are the way to go one with quotes, one without, wrapped in a OneOf(). But know @WittierDinosaur has some concerns on RegexParsers but did ask some questions above as to how valid those concerns are for this specific case.

The other alternative is to generalise this into one big statement. I.e. rather than having separate CsvFileFormatTypeParameters, JsonFileFormatTypeParameters...etc. grammars could just have a more generic FileFormatTypeParameters that accepts any of those formats for any of the types. Technically less accurate but could be a lot less code. But then you've done the work now, so might as well keep the specific ones.

src/sqlfluff/dialects/dialect_snowflake.py

Co-authored-by: Barry Pollard <barry_pollard@hotmail.com>

…into snowflake-file-formats

jmc-bbk · 2022-05-05T17:17:21Z

Hopefully done @tunetheweb !

Add CreateFileFormatSegment, CSV Format Type Options

88f28c5

Merge main

27b3db2

Swapping to AnyNumberOf()

8e75770

Add other file formats, complete create file format

65c94b2

jmc-bbk added 4 commits May 3, 2022 18:19

Move TYPE to Parameters, Add tests, Remove old class

6ca706f

Big refactor, create tests

5e52f2d

Initial merge fix

811307c

Resolve merge conflicts fully

af08002

jmc-bbk marked this pull request as ready for review May 3, 2022 18:46

jmc-bbk changed the title ~~[WIP] Snowflake: Add Create File Format Segment, Enhance Parseable Grammar~~ Snowflake: Add Create File Format Segment, Enhance Parseable Grammar May 3, 2022

tunetheweb requested changes May 3, 2022

View reviewed changes

jmc-bbk added 2 commits May 5, 2022 15:44

Update Regex and String Parsers

602c96d

test fixtures

28bde92

jmc-bbk requested a review from tunetheweb May 5, 2022 16:39

jmc-bbk added 2 commits May 5, 2022 17:44

Remove commented out code

f9c3b2a

Remove artefact, ugh

88ea842

tunetheweb requested changes May 5, 2022

View reviewed changes

src/sqlfluff/dialects/dialect_snowflake.py Outdated Show resolved Hide resolved

src/sqlfluff/dialects/dialect_snowflake.py Outdated Show resolved Hide resolved

jmc-bbk and others added 3 commits May 5, 2022 18:13

Update src/sqlfluff/dialects/dialect_snowflake.py

4ea3da0

Co-authored-by: Barry Pollard <barry_pollard@hotmail.com>

Remove raw string

b3f2508

Merge branch 'snowflake-file-formats' of github.com:jmc-bbk/sqlfluff …

bbfcff1

…into snowflake-file-formats

jmc-bbk and others added 2 commits May 5, 2022 18:27

Generate yml fixtures

5ed2d89

Merge branch 'main' into snowflake-file-formats

96f4457

tunetheweb approved these changes May 5, 2022

View reviewed changes

tunetheweb changed the title ~~Snowflake: Add Create File Format Segment, Enhance Parseable Grammar~~ Snowflake: Add CREATE FILE FORMAT Support May 5, 2022

tunetheweb merged commit feb0aff into sqlfluff:main May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snowflake: Add `CREATE FILE FORMAT` Support #3104

Snowflake: Add `CREATE FILE FORMAT` Support #3104

jmc-bbk commented Apr 15, 2022 •

edited

codecov bot commented Apr 15, 2022 •

edited

tunetheweb commented Apr 17, 2022

jmc-bbk commented Apr 26, 2022 •

edited

tunetheweb commented Apr 26, 2022

jmc-bbk commented Apr 30, 2022 •

edited

tunetheweb commented May 1, 2022

jmc-bbk commented May 3, 2022 •

edited

jmc-bbk commented May 5, 2022 •

edited

tunetheweb commented May 5, 2022

jmc-bbk commented May 5, 2022

Snowflake: Add CREATE FILE FORMAT Support #3104

Snowflake: Add CREATE FILE FORMAT Support #3104

Conversation

jmc-bbk commented Apr 15, 2022 • edited

To do

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

codecov bot commented Apr 15, 2022 • edited

Codecov Report

tunetheweb commented Apr 17, 2022

jmc-bbk commented Apr 26, 2022 • edited

tunetheweb commented Apr 26, 2022

jmc-bbk commented Apr 30, 2022 • edited

tunetheweb commented May 1, 2022

jmc-bbk commented May 3, 2022 • edited

jmc-bbk commented May 5, 2022 • edited

tunetheweb commented May 5, 2022

jmc-bbk commented May 5, 2022

Snowflake: Add `CREATE FILE FORMAT` Support #3104

Snowflake: Add `CREATE FILE FORMAT` Support #3104

jmc-bbk commented Apr 15, 2022 •

edited

codecov bot commented Apr 15, 2022 •

edited

jmc-bbk commented Apr 26, 2022 •

edited

jmc-bbk commented Apr 30, 2022 •

edited

jmc-bbk commented May 3, 2022 •

edited

jmc-bbk commented May 5, 2022 •

edited