Skip to content

Add an example of embedding indexes *inside* a parquet file #16374

@alamb

Description

@alamb
Contributor

Is your feature request related to a problem or challenge?

One of the common criticisms of parquet based query systems is that they don't have some particular type of index (e.g. HyperLogLog and more specialized / advanced structures)

I have written extensively about why these arguments are not compelling to me, for example: Accelerating Query Performance of Apache Parquet using Specialized Indexes: https://youtu.be/74YsJT1-Rdk

Here are relevant examples in datafusion of how to use such indexes:

However, both of those examples use "external indexes" -- the index is stored separately from the parquet file.

Manage the index information separately from the parquet file is likely more operationally complex (as you have to now keep 2 files in sync, for example) and this is sometimes cited (again!) as a reason we need a new file format. For example, here is a recent post to this effect from amudai:
https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md#extensible-metadata-and-hierarchical-statistics

Parquet lacks a standardized and extensible mechanism for augmenting data with index artifacts, most notably inverted term indexes for full-text search. While workarounds exist, such as maintaining indexes and their metadata outside of Parquet, these solutions quickly become complex, fragile, and difficult to manage

However there is no reason you can't add such an index inside a parquet file as well (though other readers will not know what to do it and will ignore it)

Describe the solution you'd like

I would like an example that shows how to write and read a specialized index inside a parquet file

Describe alternatives you've considered

Ideally I would love to see a full text inverted index stored in the parquet file but that might be too much for an example

Something simpler might be a "distinct values" type index. I think a good example might be:

  1. Read an existing parquet file, and compute distinct values (using a Datafusion plan perhaps) for one column
  2. Write a new parquet file that includes the index (write the index bytes to the file somewhere and then add custom key/value metadata to the parquet footer that references it)
  3. Show how to open the parquet file, read the footer metadata, use the custom metadata to find the special index, and decide it.

Basically something like this:

    Example creating parquet file that                      
  contains specialized indexes that are                     
         ignored by other readers                           
                                                            
                                                            
                                                            
         ┌──────────────────────┐                           
         │┌───────────────────┐ │                           
         ││     DataPage      │ │      Standard Parquet     
         │└───────────────────┘ │      Data / pages         
         │┌───────────────────┐ │                           
         ││     DataPage      │ │                           
         │└───────────────────┘ │                           
         │        ...           │                           
         │                      │                           
         │┌───────────────────┐ │                           
         ││     DataPage      │ │                           
         │└───────────────────┘ │                           
         │┏━━━━━━━━━━━━━━━━━━━┓ │                           
         │┃                   ┃ │        key/value metadata 
         │┃   Special Index   ┃◀┼ ─ ─    that points at the 
         │┃                   ┃ │     │  special index      
         │┗━━━━━━━━━━━━━━━━━━━┛ │                           
         │╔═══════════════════╗ │     │                     
         │║                   ║ │                           
         │║  Parquet Footer   ║ │     │  Footer includes    
         │║                   ║ ┼ ─ ─ ─  thrift-encoded     
         │║                   ║ │        ParquetMetadata    
         │╚═══════════════════╝ │                           
         └──────────────────────┘                           
                                                            
               Parquet File                                 

Additional context

No response

Activity

zhuqi-lucas

zhuqi-lucas commented on Jun 12, 2025

@zhuqi-lucas
Contributor

take

zhuqi-lucas

zhuqi-lucas commented on Jun 12, 2025

@zhuqi-lucas
Contributor

I am interested in this, and i want to be familiar with embedding indexes.

alamb

alamb commented on Jun 12, 2025

@alamb
ContributorAuthor

Nice @zhuqi-lucas -- BTW I am not sure how easy it will be to use the parquet APIs to do this (specifically write arbitrary bytes to the inner writer) so it may take some fiddling / using the lower level API / adding a new API

zhuqi-lucas

zhuqi-lucas commented on Jun 12, 2025

@zhuqi-lucas
Contributor

Thank you @alamb, i will investigate and explore the APIs and see what’s possible.

adriangb

adriangb commented on Jun 12, 2025

@adriangb
Contributor

Very excited about this!

zhuqi-lucas

zhuqi-lucas commented on Jun 13, 2025

@zhuqi-lucas
Contributor

Thank you @alamb @adriangb , submit a simple example PR for review, i can add more examples follow-up:

#16395

zhuqi-lucas

zhuqi-lucas commented on Jun 13, 2025

@zhuqi-lucas
Contributor

I am also preparing to cook a advanced_embedding_indexes example which covers page index later after the simple one merged.

JigaoLuo

JigaoLuo commented on Jun 21, 2025

@JigaoLuo

@alamb @zhuqi-lucas Thank you for this issue and the PR. This could significantly aid query processing on Parquet.

I was previously never aware of key_value_metadata and am grateful for the insight: today marks my first discovery of its presence in both ColumnMetaData and FileMetaData. Also @alamb's argument also reminded me of a paper from the German DB Conference: https://dl.gi.de/server/api/core/bitstreams/9c8435ee-d478-4b0e-9e3f-94f39a9e7090/content for reference and

at the end of Section 2.3 of it:

The only statistics available in Parquet files are the cardinality of the contained dataset and
each page’s minimum and maximum values. Unfortunately, the minimum and maximum values are optional fields, so Parquet writers are not forced to use them. ... These minimum and maximum values, as well as the cardinality of the datasets, are the only sources available for performing cardinality estimates. Therefore, we get imprecise results since we do not know how the data is distributed within the given boundaries. As a consequence, we get erroneous cardinality estimates and suboptimal query plans.

... This shows how crucial a good cardinality estimate is for a Parquet scan to be
an acceptable alternative to database relations. The Parquet scan cannot get close to the
execution times of database relations as long as the query optimizer cannot choose the same query plans for the Parquet files

In my experience, there’s a widespread underappreciation for the configurability of Parquet files. Many practitioners default to blaming Parquet’s performance or feature limitations, such as HLL. This often leads to unfair comparisons with proprietary formats, which are fine-tuned and cherry-picked.

9 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    Participants

    @alamb@adriangb@zhuqi-lucas@JigaoLuo

    Issue actions

      Add an example of embedding indexes *inside* a parquet file · Issue #16374 · apache/datafusion