-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Is your feature request related to a problem or challenge?
One of the common criticisms of parquet based query systems is that they don't have some particular type of index (e.g. HyperLogLog and more specialized / advanced structures)
I have written extensively about why these arguments are not compelling to me, for example: Accelerating Query Performance of Apache Parquet using Specialized Indexes: https://youtu.be/74YsJT1-Rdk
Here are relevant examples in datafusion of how to use such indexes:
- https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
- https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
However, both of those examples use "external indexes" -- the index is stored separately from the parquet file.
Manage the index information separately from the parquet file is likely more operationally complex (as you have to now keep 2 files in sync, for example) and this is sometimes cited (again!) as a reason we need a new file format. For example, here is a recent post to this effect from amudai:
https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md#extensible-metadata-and-hierarchical-statistics
Parquet lacks a standardized and extensible mechanism for augmenting data with index artifacts, most notably inverted term indexes for full-text search. While workarounds exist, such as maintaining indexes and their metadata outside of Parquet, these solutions quickly become complex, fragile, and difficult to manage
However there is no reason you can't add such an index inside a parquet file as well (though other readers will not know what to do it and will ignore it)
Describe the solution you'd like
I would like an example that shows how to write and read a specialized index inside a parquet file
Describe alternatives you've considered
Ideally I would love to see a full text inverted index stored in the parquet file but that might be too much for an example
Something simpler might be a "distinct values" type index. I think a good example might be:
- Read an existing parquet file, and compute distinct values (using a Datafusion plan perhaps) for one column
- Write a new parquet file that includes the index (write the index bytes to the file somewhere and then add custom key/value metadata to the parquet footer that references it)
- Show how to open the parquet file, read the footer metadata, use the custom metadata to find the special index, and decide it.
Basically something like this:
Example creating parquet file that
contains specialized indexes that are
ignored by other readers
┌──────────────────────┐
│┌───────────────────┐ │
││ DataPage │ │ Standard Parquet
│└───────────────────┘ │ Data / pages
│┌───────────────────┐ │
││ DataPage │ │
│└───────────────────┘ │
│ ... │
│ │
│┌───────────────────┐ │
││ DataPage │ │
│└───────────────────┘ │
│┏━━━━━━━━━━━━━━━━━━━┓ │
│┃ ┃ │ key/value metadata
│┃ Special Index ┃◀┼ ─ ─ that points at the
│┃ ┃ │ │ special index
│┗━━━━━━━━━━━━━━━━━━━┛ │
│╔═══════════════════╗ │ │
│║ ║ │
│║ Parquet Footer ║ │ │ Footer includes
│║ ║ ┼ ─ ─ ─ thrift-encoded
│║ ║ │ ParquetMetadata
│╚═══════════════════╝ │
└──────────────────────┘
Parquet File
Additional context
No response
Activity
zhuqi-lucas commentedon Jun 12, 2025
take
zhuqi-lucas commentedon Jun 12, 2025
I am interested in this, and i want to be familiar with embedding indexes.
alamb commentedon Jun 12, 2025
Nice @zhuqi-lucas -- BTW I am not sure how easy it will be to use the parquet APIs to do this (specifically write arbitrary bytes to the inner writer) so it may take some fiddling / using the lower level API / adding a new API
zhuqi-lucas commentedon Jun 12, 2025
Thank you @alamb, i will investigate and explore the APIs and see what’s possible.
adriangb commentedon Jun 12, 2025
Very excited about this!
zhuqi-lucas commentedon Jun 13, 2025
Thank you @alamb @adriangb , submit a simple example PR for review, i can add more examples follow-up:
#16395
zhuqi-lucas commentedon Jun 13, 2025
I am also preparing to cook a advanced_embedding_indexes example which covers page index later after the simple one merged.
Support write to buffer api for SerializedFileWriter (#7714)
JigaoLuo commentedon Jun 21, 2025
@alamb @zhuqi-lucas Thank you for this issue and the PR. This could significantly aid query processing on Parquet.
I was previously never aware of
key_value_metadata
and am grateful for the insight: today marks my first discovery of its presence in both ColumnMetaData and FileMetaData. Also @alamb's argument also reminded me of a paper from the German DB Conference: https://dl.gi.de/server/api/core/bitstreams/9c8435ee-d478-4b0e-9e3f-94f39a9e7090/content for reference andat the end of Section 2.3 of it:
In my experience, there’s a widespread underappreciation for the configurability of Parquet files. Many practitioners default to blaming Parquet’s performance or feature limitations, such as HLL. This often leads to unfair comparisons with proprietary formats, which are fine-tuned and cherry-picked.
JigaoLuo commentedon Jul 5, 2025
Hi @zhuqi-lucas,
While proofreading the blog, I had one major general question: What are the limitations of such an embedded index?
I imagine other blog readers might have similar questions about the limitations—or the potential—of this embedded_index approach. If there are no strict limitations, then my follow-up discussion is: Could we potentially supercharge Parquet with techniques inspired by proprietary file formats? For example:
alamb commentedon Jul 5, 2025
No -- you could put as many indexes as you want (of course each new index will consume space in the file and add something to the metadata
No, it is possible to have indexes with whatever granularity you want (
Yes it is a good point -- we should make sure to point this out on the blog
If there are no strict limitations, then my follow-up discussion is: Could we potentially supercharge Parquet with techniques inspired by proprietary file formats? For example:
Absolutely! Maybe just for fun someone could cook up special indexes for each clickbench query and put it in the files -- and we could show some truly crazy speed
The point would not be that any of those indexes is actually general purpose, but that parquet lets you put whatever. you like in it
adriangb commentedon Jul 5, 2025
Index suggestion: a tablesample index.
And a general thought: exploring these sorts of indexes could do very cool stuff for DataFusion in general in terms of pushing us to develop good APIs to make use of things like stats in join optimization.
JigaoLuo commentedon Jul 5, 2025
Thanks! This gave me the impression of a kind of User-Defined Index. I can now imagine that users could embed arbitrary binary data into this section of Parquet. As long as the Parquet reader knows how to interpret that binary using a corresponding User-Defined Index Function, it could enable powerful capabilities, such as pruning & precomputed results for query processing, or even query optimization.
zhuqi-lucas commentedon Jul 6, 2025
Thank you @alamb @JigaoLuo @adriangb , i agree current example is the start, we can further add more advanced examples!
alamb commentedon Jul 6, 2025
I also made a PR to clarify the comments in the example
alamb commentedon Jul 6, 2025
I think this is a really good term -- I will update the blog post in apache/datafusion-site#79 to use that
zhuqi-lucas commentedon Jul 6, 2025
Thank you @alamb , a minor topic is i may pick up this:
Find a way to communicate the ordering of a file back
To use this user-defined index or parquet SortColumn metadata. So we can restore the sort info in a better way.