Add an example of embedding indexes *inside* a parquet file

### Is your feature request related to a problem or challenge?

One of the common criticisms of parquet based query systems is that they don't have some particular type of index (e.g. HyperLogLog and more specialized / advanced structures)

I have written extensively about why these arguments are not compelling to me, for example: Accelerating Query Performance of Apache Parquet using Specialized Indexes: https://youtu.be/74YsJT1-Rdk

Here are relevant examples in datafusion of how to use such indexes:

* https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
* https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs

However, both of those examples use "external indexes" -- the index is stored separately from the parquet file. 

Manage the index information separately from the parquet file is likely more operationally complex (as you have to now keep 2 files in sync, for example) and this is sometimes cited (again!) as a reason we need a new file format. For example, here is a recent post to this effect from amudai: 
https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md#extensible-metadata-and-hierarchical-statistics

> Parquet lacks a standardized and extensible mechanism for augmenting data with index artifacts, most notably inverted term indexes for full-text search. While workarounds exist, such as maintaining indexes and their metadata outside of Parquet, these solutions quickly become complex, fragile, and difficult to manage

However there is no reason you can't add such an index *inside* a parquet file as well (though other readers will not know what to do it and will ignore it)


### Describe the solution you'd like

I would like an example that shows how to write and read a specialized index *inside* a parquet file



### Describe alternatives you've considered

Ideally I would love to see a full text inverted index stored in the parquet file but that might be too much for an example

Something simpler might be a "distinct values" type index. I think a good example might be:

1. Read an existing parquet file, and compute distinct values (using a Datafusion plan perhaps) for one column
2. Write a new parquet file that includes the index (write the index bytes to the file somewhere and then add [custom key/value metadata](https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_key_value_metadata) to the parquet footer that references it)
3. Show how to open the parquet file, read the footer metadata, use the custom metadata to find the special index, and decide it. 

Basically something like this:

```text
    Example creating parquet file that                      
  contains specialized indexes that are                     
         ignored by other readers                           
                                                            
                                                            
                                                            
         ┌──────────────────────┐                           
         │┌───────────────────┐ │                           
         ││     DataPage      │ │      Standard Parquet     
         │└───────────────────┘ │      Data / pages         
         │┌───────────────────┐ │                           
         ││     DataPage      │ │                           
         │└───────────────────┘ │                           
         │        ...           │                           
         │                      │                           
         │┌───────────────────┐ │                           
         ││     DataPage      │ │                           
         │└───────────────────┘ │                           
         │┏━━━━━━━━━━━━━━━━━━━┓ │                           
         │┃                   ┃ │        key/value metadata 
         │┃   Special Index   ┃◀┼ ─ ─    that points at the 
         │┃                   ┃ │     │  special index      
         │┗━━━━━━━━━━━━━━━━━━━┛ │                           
         │╔═══════════════════╗ │     │                     
         │║                   ║ │                           
         │║  Parquet Footer   ║ │     │  Footer includes    
         │║                   ║ ┼ ─ ─ ─  thrift-encoded     
         │║                   ║ │        ParquetMetadata    
         │╚═══════════════════╝ │                           
         └──────────────────────┘                           
                                                            
               Parquet File                                 
```



### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add an example of embedding indexes inside a parquet file #16374

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

9 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Add an example of embedding indexes *inside* a parquet file #16374

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Activity

zhuqi-lucas commented on Jun 12, 2025

zhuqi-lucas commented on Jun 12, 2025

alamb commented on Jun 12, 2025

zhuqi-lucas commented on Jun 12, 2025

adriangb commented on Jun 12, 2025

zhuqi-lucas commented on Jun 13, 2025

zhuqi-lucas commented on Jun 13, 2025

JigaoLuo commented on Jun 21, 2025

9 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions

Add an example of embedding indexes inside a parquet file #16374