-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Is your feature request related to a problem or challenge?
arrow-rs is in the process of gaining support for Parquet modular encryption - see apache/arrow-rs#7278. It would be useful to be able to read and write encrypted Parquet files with DataFusion, but it's not clear how to integrate this feature due to the complex configuration required.
Examples of this complex configuration are:
- Users may require different encryption or decryption keys to be specified per Parquet file
- The encryption and decryption keys specified may depend on the file schema
- The encryption keys may need to be generated per file by interacting with a user's key management service (KMS)
- Decryption keys may need to be retrieved dynamically based on the metadata read from Parquet files and require interaction with a KMS. This process would be opaque to DataFusion, but requires the
FileDecryptionProperties
in arrow-rs to be created with a callback that can't be represented as a string configuration option (Allow retrieving Parquet decryption keys based on the key metadata arrow-rs#7257).
I have an example of what using a KMS might look like to read and write encrypted files but this isn't yet merged in arrow-rs: https://github.com/adamreeve/arrow-rs/blob/7afb60e1ee0e4c190468c153b252324235a63d96/parquet/examples/round_trip_encrypted_parquet.rs
Currently all Parquet format options can be easily encoded as strings or primitive types, and live in datafusion-common
, which has an optional dependency on the parquet crate, although TableParquetOptions
is always defined even if the parquet feature is disabled.
We're experimenting with using encryption in DataFusion by adding encoded keys to the ParquetOptions
struct, but this is quite limited and doesn't support the more complex configuration options I mention above.
Describe the solution you'd like
One solution might be to allow users to arbitrarily customize the Parquet writing and reading options, eg. with something like:
--- a/datafusion/common/src/config.rs
+++ b/datafusion/common/src/config.rs
@@ -1615,6 +1615,12 @@ pub struct TableParquetOptions {
/// )
/// ```
pub key_value_metadata: HashMap<String, Option<String>>,
+ /// Callback to modify the Parquet WriterPropertiesBuilder with custom configuration
+ #[cfg(feature = "parquet")]
+ pub writer_configuration: Option<Arc<dyn Fn(WriterPropertiesBuilder) -> WriterPropertiesBuilder>>,
+ /// Callback to modify the Parquet ArrowReaderOptions with custom configuration
+ #[cfg(feature = "parquet")]
+ pub read_configuration: Option<Arc<dyn Fn(ArrowReaderOptions) -> ArrowReaderOptions>>,
}
impl TableParquetOptions {
These callbacks would probably need some other inputs like the file schema too. This would allow DataFusion users to specify encryption specific options without DataFusion itself needing to know about them, and might be useful for applying other Parquet options that aren't already exposed in DataFusion. This also supports generating different encryption properties per file.
TableParquetOptions
can currently be created from environment variables, which wouldn't be possible for these extra fields, but I don't think that should be a problem?
Another minor issue is that TableParquetOptions
implements PartialEq
, and I don't think it would be possible to sanely implement equality while allowing custom callbacks like this.
Describe alternatives you've considered
@alamb also suggested in delta-io/delta-rs#3300 that it could be possible to use an Arc<dyn Any>
to allow passing more complex configuration types through TableParquetOptions
.
I'm not sure exactly what this would look like though. Maybe the option would still hold a callback function but just hidden behind the Any
trait, or maybe we would want to limit this to encryption specific configuration options, but I think we'd need to maintain the ability to generate ArrowReaderOptions
and WriterProperties
per file.
Additional context
No response
Activity
adamreeve commentedon Mar 20, 2025
I had a go at seeing if I could use this callback based configuration approach to integrate with encryption without Datafusion needing to know anything about Parquet encryption.
I tested with Rok's current encryption branch, and tried both reading and writing encrypted files. I wrote an example that nearly works here: https://github.com/adamreeve/datafusion/blob/18d1aca67d0e090823ebd407dc26fef2158f26fd/datafusion-examples/examples/parquet_encryption.rs
One obvious downside is that this isn't compatible with conversion of plans to protobuf. I've just ignored the new config fields there, although ideally we would at least raise an error if they're set and we try to convert a plan to protobuf, but that might require adding a "parquet" feature to the protobuf crates.
I think this is fine though as long as you don't want to use this in a distributed query engine.
I could get writing of encrypted Parquet working, but reading fails when trying to infer the schema, as this uses a
ParquetMetaDataReader
(here) which doesn't know about theArrowReaderOptions
but instead usesFileDecryptionProperties
directly (here).If I pass the schema to
register_listing_table
, it still fails at the same place when fetching statistics during the query execution.I'm not sure how best to work around that. Maybe arrow-rs could be refactored to support reader options that aren't Arrow specific, that could be passed to the
ParquetMetaDataReader
?Or maybe Datafusion could change how metadata is read and use
ParquetObjectReader::get_metadata_with_options
instead? I made a brief attempt at that but didn't get very far. There is a comment onfetch_parquet_metadata
though that says "This component is a subject to change in near future"...corwinjoy commentedon Mar 21, 2025
So, to play the devil's advocate, here are some arguments for having encryption configurations encoded as plain strings:
So, even though it may be harder to configure encryption from strings, if we can enable this logic it would buy us a lot of flexibility in how end users can access this feature. It also may not restrict us that much. In practice most users want to use just a few KMS classes that connect to standard cloud API such as AWS or Azure and we could provide a few pre-built classes like is done in Java.
A final idea could be to use this crate here:
https://github.com/dtolnay/typetag
We require that classes implementing the KMS trait be serializable. Then, we can serialize them to strings or distribute them across hosts as needed. This constraint might even help us in the end.
@adamreeve
alamb commentedon Mar 24, 2025
FWIW I agree with @corwinjoy that having string based configuration is likely needed in several scenarios
Maybe we could do string based settings for any SQL-based interaction and then something programatic for any usecase that needs more control
corwinjoy commentedon Apr 24, 2025
@alamb @adamreeve With the modular encryption essentially complete in arrow-rs, we are interested in beginning to move forward with adding support for this feature in datafusion and implementing more concretely what this looks like. It looks like with the upgrade to arrow 55.0.0 (#15749) this can now handle the new code? So, if that is true, I think we are in a position to proceed to add this feature. It may take us a bit of time; we've had a bunch of new work come up, but we are still interested in seeing this through. Anyway, I just wanted to add a status update that we are still interested in this piece.
adamreeve commentedon Apr 24, 2025
With the KMS API not being included in arrow-rs but being built as a third-party crate (apache/arrow-rs#7387 (comment)), I would assume we probably don't want to depend on that in Datafusion, but keep the encryption integration more flexible to allow other approaches?
I think it should still be possible to achieve this while mostly using string-based configuration, but the encryption configuration might need to be an opaque string or an arbitrary JSON object.
We could support configuring encryption keys directly by default without needing programmatic access, but allow registering factory methods that could take the configuration string and produce file decryption or encryption properties. I believe you've suggested something like this previously @corwinjoy.
alamb commentedon Apr 28, 2025
I think that is correct.
I agree the key question here will be "how to integrate / use a KMS system without making DataFusion dependent on such a system" As I think you have been discussing the major challenge will be configuration.
I personally suggest using the
Arc<dyn Any>
approach, but I don't really understand the requirements enough to knowWhat i suggest as a next step is to make a public example of reading encrypted parquet files files, and then try to integrate that example with the KMS -- to get the example working you'll very likely have to add some new APIs to DataFusion, but the example will drive that implementation.
alamb commentedon Apr 28, 2025
I did this, for example, with the parquet index API: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
Basically I started writing the example and what I wanted to show and then when I couldn't do it with the existing DataFusion APIs I updated / added some new ones.
Having the example also helped review because reviewers could see what was going on with the example and how the newly added APIs got used
alamb commentedon May 2, 2025
Here is how spark does encryption configuration
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
adamreeve commentedon May 5, 2025
My understanding of how this works in Spark from reading this and looking at some of the code:
spark.hadoop.parquet.crypto.factory.class
config setting, and the class needs to implementEncryptionPropertiesFactory
and/orDecryptionPropertiesFactory
to generate file encryption or decryption properties as required. The class gets access to extra context like the file schema so it knows what columns to provide keys for (see the getFileEncryptionProperties method).PropertiesDrivenCryptoFactory
class that implementsEncryptionPropertiesFactory
andDecryptionPropertiesFactory
. This requires also specifying aKmsClient
implementation with thespark.hadoop.parquet.encryption.kms.client.class
key, and this class must be defined by users (only a mockInMemoryKMS
class is provided for testing).EncryptionPropertiesFactory
andDecryptionPropertiesFactory
if they don't want to use the KMS based API, for example if they want to define AES keys directly.Starting with similarly flexible
EncryptionPropertiesFactory
andDecryptionPropertiesFactory
traits in Datafusion seems like a reasonable approach to me.I'm not that familiar with Java, but from what I understand it's straightforward to define your own
KmsClient
in a JAR and then include that at runtime so it's discoverable by the configuration mechanism. This approach doesn't really translate to Rust though. If any custom code is needed it will need to be compiled in unless we use something like WebAssembly or an FFI, but that seems overly complicated and unnecessary. We could maintain some level of string-configurability by letting users statically register named implementations of traits in code and then reference these in configuration strings. Corwin mentioned thetypetag
crate that can automate this, or it could be more manual.I don't really understand the reason for using
Any
rather than a trait likeArc<dyn EncryptionPropertiesFactory>
. At some point anAny
would need to be downcast to something that Datafusion understands for it to be usable right? But I agree we should come up with an example of how we'd like this to work and that should provide more clarity.adamreeve commentedon May 5, 2025
Actually I think I remember now that this would let us include structs in config types in
datafusion::common::config
without needing to expose details of the parquet and encryption implementation.adamreeve commentedon Jun 3, 2025
I've created a draft PR with an example of what integration with a KMS could look like: #16237
Any feedback would be much appreciated!
49.0.0
(July 2025) #16235