Skip to content

[codex] Implement native tabular cache path#464

Merged
k82cn merged 2 commits into
xflops:mainfrom
k82cn:flm_318_3
May 19, 2026
Merged

[codex] Implement native tabular cache path#464
k82cn merged 2 commits into
xflops:mainfrom
k82cn:flm_318_3

Conversation

@k82cn
Copy link
Copy Markdown
Contributor

@k82cn k82cn commented May 19, 2026

Summary

  • Add native Arrow table cache payload support in the Rust object cache and disk storage.
  • Update FlamePy to cache PyArrow tables/batches and optional pandas/polars DataFrames directly instead of wrapping them as opaque bytes.
  • Refresh the RFE318 cache design notes and add Rust, Python unit, and E2E coverage for the native path.

Impact

DataSet/DataFrame-style cache writes can now preserve Arrow schemas and record batches as first-class cache payloads while keeping the existing opaque object path, ObjectRef shape, and legacy cache file compatibility.

Validation

  • cargo check -p flame-object-cache
  • cargo test -p flame-object-cache
  • cargo clippy -p flame-object-cache --all-targets -- -D warnings
  • cargo fmt --all -- --check
  • sdk/python/.venv/bin/python -m pytest sdk/python/tests/test_cache.py
  • sdk/python/.venv/bin/ruff check sdk/python/src/flamepy/core/cache.py sdk/python/tests/test_cache.py e2e/tests/test_cache.py
  • sdk/python/.venv/bin/ruff format --check sdk/python/src/flamepy/core/cache.py sdk/python/tests/test_cache.py e2e/tests/test_cache.py
  • git diff --check

Note

The new E2E tests were added but not run against a live local Flame cache here because the local flame.yaml lacks a current-context.

@k82cn k82cn marked this pull request as ready for review May 19, 2026 01:32
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a native direct cache path for tabular data (PyArrow, pandas, and polars) to avoid the overhead of cloudpickle and opaque binary wrapping. Key changes include a new ObjectPayload enum in Rust to distinguish between opaque and native Arrow table payloads, updated storage logic to persist original Arrow schemas with reserved metadata, and a Python-side classifier to detect tabular objects. Feedback highlights an efficiency concern in get_flight_info where a full disk load is triggered just to retrieve a schema, and a reliability issue in the Python SDK where pa.ipc.new_file should use a context manager to ensure proper file closure during exceptions.

Comment thread object_cache/src/cache.rs Outdated
Comment on lines +1188 to +1196
match self.cache.get(&object_key).await {
Ok(object) => match object.payload {
ObjectPayload::ArrowTable { schema, .. } => {
Bytes::from(encode_schema(&schema)?)
}
ObjectPayload::Opaque(_) => Bytes::new(),
},
Err(_) => Bytes::new(),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling self.cache.get(&object_key).await inside get_flight_info can be inefficient for large native tables that are not currently resident in memory. This operation triggers a full load of all record batches from disk just to extract the schema. Consider implementing a lighter-weight storage operation that only retrieves the schema from the Arrow IPC file header.

Comment thread sdk/python/src/flamepy/core/cache.py Outdated
Comment on lines 882 to 885
writer = pa.ipc.new_file(object_path, payload.schema)
for batch in payload.batches:
writer.write_batch(batch)
writer.close()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pa.ipc.RecordBatchFileWriter (returned by pa.ipc.new_file) should be used as a context manager to ensure the file is properly closed and the footer is written even if an exception occurs during write_batch. While writer.close() is called at the end of the block, it won't be reached if an error occurs in the loop.

Suggested change
writer = pa.ipc.new_file(object_path, payload.schema)
for batch in payload.batches:
writer.write_batch(batch)
writer.close()
with pa.ipc.new_file(object_path, payload.schema) as writer:
for batch in payload.batches:
writer.write_batch(batch)

@k82cn k82cn merged commit 644a414 into xflops:main May 19, 2026
6 checks passed
@k82cn k82cn deleted the flm_318_3 branch May 19, 2026 02:11
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 89.75904% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
object_cache/src/cache.rs 88.54% 11 Missing ⚠️
object_cache/src/storage/disk.rs 94.11% 4 Missing ⚠️
object_cache/src/storage/none.rs 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant