Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COPY TO allign with CREATE EXTERNAL TABLE #10

Closed
wants to merge 8 commits into from

Conversation

metesynnada
Copy link
Collaborator

Which issue does this PR close?

Closes apache#9369.

Rationale for this change

Changing

OPTIONS (
    format X,
    X.foo.bar baz
)

into

STORED AS X
OPTIONS (
    format.foo.bar baz
)

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@metesynnada
Copy link
Collaborator Author

@ozankabak If the syntax meets the requirements, I will submit a PR to the upstream repository.

@ozankabak
Copy link
Collaborator

Looks good

'parquet.bloom_filter_ndv' 100
STORED AS PARQUET
OPTIONS (
'format.compression' snappy,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be awesome if we could avoid having to repeat out format. for each option

Note this change may conflict with the changes from @tinfoil-knight in apache#9594

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge here is an option may refer to a non-format thing too. Think about something like:

STORED AS PARQUET
OPTIONS (
  format.foo bar
  format.fizz buzz
  credentials.username admin
)

With the aligned/unified syntax here, all format entries refer to parquet and there is no repetition of parquet. But we repeat format instead, and this prefix is necessary as it separates format-related and non-format-related options.

As a next step, a consistent and generalizable way to remove repetitive patterns would be a syntax like:

STORED AS PARQUET
OPTIONS format (
  foo bar
  hey ho
)
OPTIONS other.prefix (
  fizz buzz
  paul atreides
)
OPTIONS (
  credentials.username admin
)

where multiple OPTIONS are allowable, each with a possible prefix to avoid repetition. In case of an option is specified multiple times with different values, we'd generate an error. The above would be equivalent to

STORED AS PARQUET
OPTIONS (
  format.foo bar
  format.hey ho
  other.prefix.fizz buzz
  other.prefix.paul atreides
  credentials.username admin
)

where the user is explicitly repeating prefixes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- I agree that being able to explicitly namespace out the options is valuable to disambiguate them

I think as a user it would be really nice to not have to worry about such collisions unless i actually had to disambiguate them

Like in my mind, this statement isn't ambiguous

STORED AS PARQUET
OPTIONS (
  compression snappy
)

Because the only possible "compression" option is on the format.

I think the way @tinfoil-knight handles this in apache#9594 is to internally try and resolve compression as though it were format.compression (or parquet.compression) if the compression configuration isn't valid.

Even something like this is not ambiguous (though I realize actually implementing this may not be feasible without much more work):

STORED AS PARQUET
LOCATION 's3://....'
OPTIONS (
  compression snappy
  access_key paul
  secrete_key atreides
)

As a next step, a consistent and generalizable way to remove repetitive patterns would be a syntax like:

That is definitely neat as well 🤔

So TLDR is I think it would be ok to introduce the syntax in this PR, but then I would likely try and propose (as a follow on PR) some smarter configuration namespace resolution that didn't require explicitly typing out the namespace each time

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let's get the "base" syntax in and then we can progressively optimize shortcuts for cases when there is no ambiguity.

Copy link

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think this is a PR into the synnada fork, not the apache repo (not sure if that is the intent)

I have one question, about a test change, but otherwise this PR looks good to me.

I would like to track the "make the syntax require less repeitition" as discussed in https://github.com/synnada-ai/datafusion-upstream/pull/10/files#r1523875240 as a new ticket (or maybe add to the existing one in apache#9575)

@@ -299,7 +299,7 @@ select * from validate_parquet_with_options;

# Copy from table to single file
query IT
COPY source_table to 'test_files/scratch/copy/table.parquet' STORED AS PARQUET;
COPY source_table to 'test_files/scratch/copy/table.parquet';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@@ -152,7 +152,7 @@ FileSinkExec: sink=ParquetSink(file_groups=[])
--MemoryExec: partitions=1, partition_sizes=[1]

# Error case
query error DataFusion error: SQL error: ParserError\("Missing STORED AS clause in COPY statement"\)
query error DataFusion error: Invalid or Unsupported Configuration: Format not explicitly set and unable to get file extension! Use STORED AS to define file format.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -23,7 +23,7 @@
# create.sql came from
# https://github.com/ClickHouse/ClickBench/blob/8b9e3aa05ea18afa427f14909ddc678b8ef0d5e6/datafusion/create.sql
# Data file made with DuckDB:
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' (FORMAT PARQUET);
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this (comment) should not be updated as it is duckdb comment

CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV
LOCATION 'test_files/scratch/copy/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");

## Until the partition by parsing uses ColumnDef, this test is meaningless since it becomes an overfit. Even in
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment or why this test doesn't work anymore. Maybe it just needs to be updated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this pull request, there was a problem with how escape characters in the PARTITIONED BY clause were processed in both the CREATE EXTERNAL TABLE and COPY TO statements. The issue stemmed from converting the Value to a string too early in the parsing process, leading to errors.

Although it appears that the test passes, in reality, during the execution of CREATE EXTERNAL TABLE, the specified field names do not match as expected, indicating a silent error. This issue should be addressed separately, as it falls outside the scope of the current PR.

I propose that we standardize the approach to handling column names in both the schema definitions and the PARTITIONED BY clauses. Once we establish a consistent method, the relevance of this test can be reassessed.

I will proceed to open an issue to address it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed apache#9714

@metesynnada
Copy link
Collaborator Author

We usually scan the PRs internally first, then submit an upstream PR. The upstream PR was already opened at apache#9604. You can review it on upstream as well.

@alamb
Copy link

alamb commented Mar 15, 2024

We usually scan the PRs internally first, then submit an upstream PR. The upstream PR was already opened at apache#9604. You can review it on upstream as well.

Got it. Sorry about my confusion.

@ozankabak ozankabak closed this Mar 18, 2024
@ozankabak
Copy link
Collaborator

Merged upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants