Support integration with Parquet modular encryption

### Is your feature request related to a problem or challenge?

arrow-rs is in the process of gaining support for [Parquet modular encryption](https://parquet.apache.org/docs/file-format/data-pages/encryption/) - see https://github.com/apache/arrow-rs/issues/7278. It would be useful to be able to read and write encrypted Parquet files with DataFusion, but it's not clear how to integrate this feature due to the complex configuration required.

Examples of this complex configuration are:
* Users may require different encryption or decryption keys to be specified per Parquet file
* The encryption and decryption keys specified may depend on the file schema
* The encryption keys may need to be generated per file by interacting with a user's key management service (KMS)
* Decryption keys may need to be retrieved dynamically based on the metadata read from Parquet files and require interaction with a KMS. This process would be opaque to DataFusion, but requires the `FileDecryptionProperties` in arrow-rs to be created with a callback that can't be represented as a string configuration option (https://github.com/apache/arrow-rs/issues/7257).

I have an example of what using a KMS might look like to read and write encrypted files but this isn't yet merged in arrow-rs: https://github.com/adamreeve/arrow-rs/blob/7afb60e1ee0e4c190468c153b252324235a63d96/parquet/examples/round_trip_encrypted_parquet.rs

Currently all Parquet format options can be easily encoded as strings or primitive types, and live in `datafusion-common`, which has an optional dependency on the parquet crate, although `TableParquetOptions` is always defined even if the parquet feature is disabled.

We're experimenting with using encryption in DataFusion by adding encoded keys to the `ParquetOptions` struct, but this is quite limited and doesn't support the more complex configuration options I mention above.

### Describe the solution you'd like

One solution might be to allow users to arbitrarily customize the Parquet writing and reading options, eg. with something like:
```diff
--- a/datafusion/common/src/config.rs
+++ b/datafusion/common/src/config.rs
@@ -1615,6 +1615,12 @@ pub struct TableParquetOptions {
     /// )
     /// ```
     pub key_value_metadata: HashMap<String, Option<String>>,
+    /// Callback to modify the Parquet WriterPropertiesBuilder with custom configuration
+    #[cfg(feature = "parquet")]
+    pub writer_configuration: Option<Arc<dyn Fn(WriterPropertiesBuilder) -> WriterPropertiesBuilder>>,
+    /// Callback to modify the Parquet ArrowReaderOptions with custom configuration
+    #[cfg(feature = "parquet")]
+    pub read_configuration: Option<Arc<dyn Fn(ArrowReaderOptions) -> ArrowReaderOptions>>,
 }
 
 impl TableParquetOptions {
```

These callbacks would probably need some other inputs like the file schema too. This would allow DataFusion users to specify encryption specific options without DataFusion itself needing to know about them, and might be useful for applying other Parquet options that aren't already exposed in DataFusion. This also supports generating different encryption properties per file.

`TableParquetOptions` can currently be created from environment variables, which wouldn't be possible for these extra fields, but I don't think that should be a problem?

Another minor issue is that `TableParquetOptions` implements `PartialEq`, and I don't think it would be possible to sanely implement equality while allowing custom callbacks like this.

### Describe alternatives you've considered

@alamb also suggested in https://github.com/delta-io/delta-rs/issues/3300 that it could be possible to use an `Arc<dyn Any>` to allow passing more complex configuration types through `TableParquetOptions`.

I'm not sure exactly what this would look like though. Maybe the option would still hold a callback function but just hidden behind the `Any` trait, or maybe we would want to limit this to encryption specific configuration options, but I think we'd need to maintain the ability to generate `ArrowReaderOptions` and  `WriterProperties` per file.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support integration with Parquet modular encryption #15216

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Support integration with Parquet modular encryption #15216

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Activity

adamreeve commented on Mar 20, 2025

corwinjoy commented on Mar 21, 2025

alamb commented on Mar 24, 2025

corwinjoy commented on Apr 24, 2025

adamreeve commented on Apr 24, 2025

alamb commented on Apr 28, 2025

alamb commented on Apr 28, 2025

alamb commented on May 2, 2025

adamreeve commented on May 5, 2025

adamreeve commented on May 5, 2025

adamreeve commented on Jun 3, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions