Add Vortex file metadata segments by gatesn · Pull Request #7954 · vortex-data/vortex

gatesn · 2026-05-15T19:56:09Z

Summary

Adds keyed user-defined metadata segments to Vortex files. Writers can attach ByteBuffer metadata by string key, and readers can retrieve those buffers from VortexFile or Footer after opening a file.

The segment locations are stored in the postscript alongside the existing dtype, layout, statistics, and footer segment specs. The footer deserializer includes metadata segments in its initial footer read window, so large metadata payloads are read back correctly even when they sit before the footer flatbuffers.

No existing issue was found for this change, so this PR does not use a Fixes #... trailer.

Validation

cargo +nightly fmt --all
./scripts/public-api.sh
cargo clippy --all-targets --all-features
cargo test -p vortex-file
git diff --check

Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

codspeed-hq · 2026-05-15T20:06:30Z

Merging this PR will not alter performance

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 1219 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`chunked_varbinview_canonical_into[(100, 100)]`	307.8 µs	273 µs	+12.72%
❌	Simulation	`chunked_varbinview_opt_canonical_into[(1000, 10)]`	187.6 µs	225 µs	-16.6%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing ngates/file-metadata-segments (eb72ffd) with develop (d71d3d3)}

gatesn · 2026-05-15T20:11:23Z

@claude review this

github-actions · 2026-05-15T20:11:44Z

Claude finished @gatesn's task in 3m 45s —— View job

PR Review: [codex] Add Vortex file metadata segments

Overall this is a clean, well-structured feature. The FlatBuffers schema design is backward-compatible, the round-trip test is solid, and the alignment/padding logic in write_buffer is correct. The three self-review concerns you flagged are the main design-level questions worth resolving before merge. I've also spotted a few smaller issues below.

Design / API

1. Eager loading in the initial read window (your own comment on deserializer.rs:126)

In deserializer.rs, all metadata segments are unconditionally folded into read_more_offset:

for metadata in &postscript.metadata {
    read_more_offset = read_more_offset.min(metadata.segment.offset);
}

For callers that never touch metadata, this inflates the initial read window and pays the I/O cost of fetching every metadata buffer on every file open. This is fine for small metadata (the primary motivation seems to be "sidecar data for the file"), but it becomes a tax for large payloads.

Your suggested include_metadata flag on VortexOpenOptions / FooterDeserializer is a good solution, though lazy/on-demand fetching (returning a separate segment location to fetch) would be even more composable long-term. At minimum an opt-out flag should be present before this ships as a stable API.

Fix this →

2. Slice vs. HashMap for storage (your own comment on footer/mod.rs:48)

Arc<[(String, ByteBuffer)]> preserves insertion order (which may matter if callers iterate metadata), but metadata_segment(key) is then O(n) per lookup. For small metadata counts (the typical case) the linear scan is negligible. If the API contract is that metadata keys are unordered, switching to Arc<HashMap<String, ByteBuffer>> gives O(1) lookups. The current choice is defensible; worth making explicit in the doc comment so callers know the complexity.

3. metadata_segments() return type (your own comment on footer/mod.rs:155)

Returning &[(String, ByteBuffer)] exposes the internal storage layout. An impl Iterator<Item = (&str, &ByteBuffer)> (or a named wrapper type) would decouple the API from the representation and make a future switch to HashMap non-breaking. With the current API, changing the internal type would be a public API break. Fix this →

Correctness

4. O(n²) duplicate-key check in Postscript::read_flatbuffer

// postscript.rs:87-93
for (idx, entry) in metadata.iter().enumerate() {
    if metadata[..idx].iter().any(|prev| prev.key == entry.key) {
        return Err(...);
    }
}

This is quadratic in the number of metadata entries. Since the postscript size is bounded by MAX_POSTSCRIPT_SIZE = 65527 bytes the worst-case n is small, but a HashSet is cleaner and documents intent better:

let mut seen = std::collections::HashSet::with_capacity(metadata.len());
for entry in &metadata {
    if !seen.insert(&entry.key) {
        return Err(vortex_err!("Postscript contains duplicate metadata key {}", entry.key));
    }
}

Fix this →

5. No validation of empty keys

An empty-string key "" satisfies the FlatBuffers (required) constraint but is semantically ambiguous. Consider adding a check in read_flatbuffer and in with_metadata_segment (writer.rs) to reject empty keys early. Fix this →

6. Missing key() returns an empty string on required-field absence

The generated accessor fb.key() calls .unwrap() internally (because key is (required)) and would panic on a malformed buffer. The FlatBuffers verifier in read_flatbuffer_bytes → root::<Postscript>() does verify required fields (visit_field(..., true)), so in practice the unwrap is unreachable after successful verification. No code change needed, but this is worth a short comment explaining the invariant.

Testing

7. Duplicate-key rejection has no test

The guard at postscript.rs:87-93 has no test. Since it's a format invariant (a well-formed file won't trigger it, only a malformed one), it's low risk — but adding a small test that manually assembles a duplicate-key postscript would give confidence the error path works. A #[test] in tests.rs constructing a Postscript with repeated keys and calling write_flatbuffer_bytes / read_flatbuffer_bytes would cover this.

8. Large metadata test exercises the NeedMoreData path ✓

The test_file_metadata_segments_roundtrip test intentionally writes a large buffer of MAX_POSTSCRIPT_SIZE + 1024 bytes to force the deserializer's second-read path. This is a good check.

Documentation

9. File format diagram in lib.rs is missing the Statistics flatbuffer

The illustrated diagram shows "Per-Column Statistics" (layout-embedded stats) and skips to "User Metadata Segments", but there is a separate File Statistics Flatbuffer segment emitted by FooterSerializer::serialize() between layout data and the postscript. This was missing before this PR and is a pre-existing gap, but since the diagram was touched in this PR it's a good moment to add:

├────────────────────────────┤
│  File Statistics Flatbuffer│
├────────────────────────────┤
│  User Metadata Segments    │

Performance regression (CodSpeed)

The 12.25% regression on encode_varbin[(1000, 2)] is almost certainly a measurement artifact — this PR doesn't touch array encoding code at all, and CodSpeed itself flags both "Unknown Walltime execution environment" and "Different runtime environments detected" for this benchmark. The other 1,216 benchmarks are untouched. Safe to dismiss.

Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

Add Vortex file metadata segments

2d1c59b

Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

gatesn commented May 15, 2026

View reviewed changes

Comment thread vortex-file/src/footer/deserializer.rs Outdated

Comment thread vortex-file/src/footer/mod.rs Outdated

Comment thread vortex-file/src/footer/mod.rs Outdated

gatesn changed the title ~~[codex] Add Vortex file metadata segments~~ Add Vortex file metadata segments May 15, 2026

gatesn added the changelog/feature A new feature label May 15, 2026

Address metadata segment review feedback

c228cba

Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

gatesn marked this pull request as ready for review May 15, 2026 21:08

gatesn enabled auto-merge (squash) May 15, 2026 21:08

gatesn commented May 15, 2026

View reviewed changes

Comment thread vortex-file/src/footer/deserializer.rs Outdated

gatesn added 3 commits May 15, 2026 22:13

Avoid copying footer metadata slices

12c0c43

Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

Limit file metadata postscript entries

35fa27d

Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

Measure metadata keys in bytes

eb72ffd

Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

gatesn requested a review from lwwmanning May 15, 2026 22:18

gatesn added the tldr label May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Vortex file metadata segments#7954

Add Vortex file metadata segments#7954
gatesn wants to merge 5 commits into
developfrom
ngates/file-metadata-segments

gatesn commented May 15, 2026

Uh oh!

codspeed-hq Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gatesn commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gatesn commented May 15, 2026

Summary

Validation

Uh oh!

codspeed-hq Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Performance Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gatesn commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: [codex] Add Vortex file metadata segments

Design / API

Correctness

Testing

Documentation

Performance regression (CodSpeed)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codspeed-hq Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading