Skip to content

Add Vortex file metadata segments#7954

Open
gatesn wants to merge 5 commits into
developfrom
ngates/file-metadata-segments
Open

Add Vortex file metadata segments#7954
gatesn wants to merge 5 commits into
developfrom
ngates/file-metadata-segments

Conversation

@gatesn
Copy link
Copy Markdown
Contributor

@gatesn gatesn commented May 15, 2026

Summary

Adds keyed user-defined metadata segments to Vortex files. Writers can attach ByteBuffer metadata by string key, and readers can retrieve those buffers from VortexFile or Footer after opening a file.

The segment locations are stored in the postscript alongside the existing dtype, layout, statistics, and footer segment specs. The footer deserializer includes metadata segments in its initial footer read window, so large metadata payloads are read back correctly even when they sit before the footer flatbuffers.

No existing issue was found for this change, so this PR does not use a Fixes #... trailer.

Validation

  • cargo +nightly fmt --all
  • ./scripts/public-api.sh
  • cargo clippy --all-targets --all-features
  • cargo test -p vortex-file
  • git diff --check

Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 15, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 1219 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_canonical_into[(100, 100)] 307.8 µs 273 µs +12.72%
Simulation chunked_varbinview_opt_canonical_into[(1000, 10)] 187.6 µs 225 µs -16.6%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ngates/file-metadata-segments (eb72ffd) with develop (d71d3d3)

Open in CodSpeed

Comment thread vortex-file/src/footer/deserializer.rs Outdated
Comment thread vortex-file/src/footer/mod.rs Outdated
Comment thread vortex-file/src/footer/mod.rs Outdated
@gatesn
Copy link
Copy Markdown
Contributor Author

gatesn commented May 15, 2026

@claude review this

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Claude finished @gatesn's task in 3m 45s —— View job


PR Review: [codex] Add Vortex file metadata segments

  • Reading changed files
  • Reviewing flatbuffers schema changes
  • Reviewing serializer / postscript changes
  • Reviewing deserializer changes
  • Reviewing public API & writer
  • Reviewing tests
  • Checking performance regression
  • Final summary

Overall this is a clean, well-structured feature. The FlatBuffers schema design is backward-compatible, the round-trip test is solid, and the alignment/padding logic in write_buffer is correct. The three self-review concerns you flagged are the main design-level questions worth resolving before merge. I've also spotted a few smaller issues below.

Design / API

1. Eager loading in the initial read window (your own comment on deserializer.rs:126)

In deserializer.rs, all metadata segments are unconditionally folded into read_more_offset:

for metadata in &postscript.metadata {
    read_more_offset = read_more_offset.min(metadata.segment.offset);
}

For callers that never touch metadata, this inflates the initial read window and pays the I/O cost of fetching every metadata buffer on every file open. This is fine for small metadata (the primary motivation seems to be "sidecar data for the file"), but it becomes a tax for large payloads.

Your suggested include_metadata flag on VortexOpenOptions / FooterDeserializer is a good solution, though lazy/on-demand fetching (returning a separate segment location to fetch) would be even more composable long-term. At minimum an opt-out flag should be present before this ships as a stable API.

Fix this →

2. Slice vs. HashMap for storage (your own comment on footer/mod.rs:48)

Arc<[(String, ByteBuffer)]> preserves insertion order (which may matter if callers iterate metadata), but metadata_segment(key) is then O(n) per lookup. For small metadata counts (the typical case) the linear scan is negligible. If the API contract is that metadata keys are unordered, switching to Arc<HashMap<String, ByteBuffer>> gives O(1) lookups. The current choice is defensible; worth making explicit in the doc comment so callers know the complexity.

3. metadata_segments() return type (your own comment on footer/mod.rs:155)

Returning &[(String, ByteBuffer)] exposes the internal storage layout. An impl Iterator<Item = (&str, &ByteBuffer)> (or a named wrapper type) would decouple the API from the representation and make a future switch to HashMap non-breaking. With the current API, changing the internal type would be a public API break. Fix this →

Correctness

4. O(n²) duplicate-key check in Postscript::read_flatbuffer

// postscript.rs:87-93
for (idx, entry) in metadata.iter().enumerate() {
    if metadata[..idx].iter().any(|prev| prev.key == entry.key) {
        return Err(...);
    }
}

This is quadratic in the number of metadata entries. Since the postscript size is bounded by MAX_POSTSCRIPT_SIZE = 65527 bytes the worst-case n is small, but a HashSet is cleaner and documents intent better:

let mut seen = std::collections::HashSet::with_capacity(metadata.len());
for entry in &metadata {
    if !seen.insert(&entry.key) {
        return Err(vortex_err!("Postscript contains duplicate metadata key {}", entry.key));
    }
}

Fix this →

5. No validation of empty keys

An empty-string key "" satisfies the FlatBuffers (required) constraint but is semantically ambiguous. Consider adding a check in read_flatbuffer and in with_metadata_segment (writer.rs) to reject empty keys early. Fix this →

6. Missing key() returns an empty string on required-field absence

The generated accessor fb.key() calls .unwrap() internally (because key is (required)) and would panic on a malformed buffer. The FlatBuffers verifier in read_flatbuffer_bytesroot::<Postscript>() does verify required fields (visit_field(..., true)), so in practice the unwrap is unreachable after successful verification. No code change needed, but this is worth a short comment explaining the invariant.

Testing

7. Duplicate-key rejection has no test

The guard at postscript.rs:87-93 has no test. Since it's a format invariant (a well-formed file won't trigger it, only a malformed one), it's low risk — but adding a small test that manually assembles a duplicate-key postscript would give confidence the error path works. A #[test] in tests.rs constructing a Postscript with repeated keys and calling write_flatbuffer_bytes / read_flatbuffer_bytes would cover this.

8. Large metadata test exercises the NeedMoreData path ✓

The test_file_metadata_segments_roundtrip test intentionally writes a large buffer of MAX_POSTSCRIPT_SIZE + 1024 bytes to force the deserializer's second-read path. This is a good check.

Documentation

9. File format diagram in lib.rs is missing the Statistics flatbuffer

The illustrated diagram shows "Per-Column Statistics" (layout-embedded stats) and skips to "User Metadata Segments", but there is a separate File Statistics Flatbuffer segment emitted by FooterSerializer::serialize() between layout data and the postscript. This was missing before this PR and is a pre-existing gap, but since the diagram was touched in this PR it's a good moment to add:

├────────────────────────────┤
│  File Statistics Flatbuffer│
├────────────────────────────┤
│  User Metadata Segments    │

Performance regression (CodSpeed)

The 12.25% regression on encode_varbin[(1000, 2)] is almost certainly a measurement artifact — this PR doesn't touch array encoding code at all, and CodSpeed itself flags both "Unknown Walltime execution environment" and "Different runtime environments detected" for this benchmark. The other 1,216 benchmarks are untouched. Safe to dismiss.


@gatesn gatesn changed the title [codex] Add Vortex file metadata segments Add Vortex file metadata segments May 15, 2026
@gatesn gatesn added the changelog/feature A new feature label May 15, 2026
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
@gatesn gatesn marked this pull request as ready for review May 15, 2026 21:08
@gatesn gatesn enabled auto-merge (squash) May 15, 2026 21:08
Comment thread vortex-file/src/footer/deserializer.rs Outdated
gatesn added 3 commits May 15, 2026 22:13
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
@gatesn gatesn requested a review from lwwmanning May 15, 2026 22:18
@gatesn gatesn added the tldr label May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant