Support reading multiple parquet files via `datafusion-cli`

### Is your feature request related to a problem or challenge?

This is an idea that @robtandy brought up on the DataFusion sync call the other day and I think it would be pretty useful.

The usecase is "I want to read more than 1 but not an entire directory of parquet files from remote object store" -- I think in this case to look at some particular files

For example, let's say you want to read just these two files:
* s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet
* s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_2.parquet

There is currently no way to do so via SQL. You can either do the entire directory 

```sql
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/' options (aws.region 'eu-central-1');
0 row(s) fetched.
Elapsed 2.928 seconds.
```

Or you can read each file separately
```sql
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet' options (aws.region 'eu-central-1');
0 row(s) fetched.
Elapsed 1.017 seconds.
```


### Describe the solution you'd like

I would like to be able to read an arbitrary set of remote parquet files

It would also be awesome to support GLOB files (e.g. `*`)  which has been  requested before
- https://github.com/apache/datafusion/issues/7393

### Describe alternatives you've considered

I suggest we  implement a `TableFunction` similar to the DuckDB `read_parquet` file ONLY in the `datafusion-cli` source

So to query the files listed above, this would look like
```sql
SELECT * FROM read_parquet([
  'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet', 
  'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_2.parquet'
]);
```


From the duckdb docs: https://duckdb.org/docs/stable/data/parquet/overview.html

```sql
-- read file1, file2, file3
SELECT *
FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);
-- Support GLOB access
SELECT *
FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);
```
We already support the `parquet_metadata` function in datafusion-cli ([docs](https://datafusion.apache.org/user-guide/cli/usage.html#parquet-metadata))

```sql
SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;

+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min           | stats_max           | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID"      | 0            | 450560             | 4611687214012840539 | 9223369186199968220 | 3883759               |
| "WatchID"      | 1            | 612174             | 4611689135232456464 | 9223371478009085789 | 5176803               |
| "WatchID"      | 2            | 344064             | 4611692774829951781 | 9223363791697310021 | 3031680               |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.
```

Here is the code implementation:
https://github.com/apache/datafusion/blob/85f6621a6b1680b40d483a56b10ff3495861ece3/datafusion-cli/src/functions.rs#L322

We can also look at the `ClickBench` S3 command that is similar: https://clickhouse.com/docs/integrations/s3

```sql
DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz', 'TabSeparatedWithNames');
```

## Open questions
What to do if the files are on different object stores (e.g. S3 and http):
```sql
SELECT * FROM read_parquet([
  'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet', 
  -- note a different object store 
  's3://public-datasets/hits_compatible/athena_partitioned/hits_2.parquet'
]);
```

At first I suggest we don't try and support this

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support reading multiple parquet files via `datafusion-cli` #16303

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Open questions

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Support reading multiple parquet files via datafusion-cli #16303

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Open questions

Additional context

Activity

comphead commented on Jun 6, 2025

comphead commented on Jun 6, 2025

alamb commented on Jun 6, 2025

a-agmon commented on Jun 7, 2025

a-agmon commented on Jun 8, 2025

alamb commented on Jun 8, 2025

alamb commented on Jun 8, 2025

a-agmon commented on Jun 8, 2025

a-agmon commented on Jun 9, 2025

a-agmon commented on Jun 12, 2025

alamb commented on Jun 13, 2025

robtandy commented on Jun 13, 2025

alamb commented on Jun 16, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions

Support reading multiple parquet files via `datafusion-cli` #16303