[codex] Implement native tabular cache path#464
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a native direct cache path for tabular data (PyArrow, pandas, and polars) to avoid the overhead of cloudpickle and opaque binary wrapping. Key changes include a new ObjectPayload enum in Rust to distinguish between opaque and native Arrow table payloads, updated storage logic to persist original Arrow schemas with reserved metadata, and a Python-side classifier to detect tabular objects. Feedback highlights an efficiency concern in get_flight_info where a full disk load is triggered just to retrieve a schema, and a reliability issue in the Python SDK where pa.ipc.new_file should use a context manager to ensure proper file closure during exceptions.
| match self.cache.get(&object_key).await { | ||
| Ok(object) => match object.payload { | ||
| ObjectPayload::ArrowTable { schema, .. } => { | ||
| Bytes::from(encode_schema(&schema)?) | ||
| } | ||
| ObjectPayload::Opaque(_) => Bytes::new(), | ||
| }, | ||
| Err(_) => Bytes::new(), | ||
| } |
There was a problem hiding this comment.
Calling self.cache.get(&object_key).await inside get_flight_info can be inefficient for large native tables that are not currently resident in memory. This operation triggers a full load of all record batches from disk just to extract the schema. Consider implementing a lighter-weight storage operation that only retrieves the schema from the Arrow IPC file header.
| writer = pa.ipc.new_file(object_path, payload.schema) | ||
| for batch in payload.batches: | ||
| writer.write_batch(batch) | ||
| writer.close() |
There was a problem hiding this comment.
The pa.ipc.RecordBatchFileWriter (returned by pa.ipc.new_file) should be used as a context manager to ensure the file is properly closed and the footer is written even if an exception occurs during write_batch. While writer.close() is called at the end of the block, it won't be reached if an error occurs in the loop.
| writer = pa.ipc.new_file(object_path, payload.schema) | |
| for batch in payload.batches: | |
| writer.write_batch(batch) | |
| writer.close() | |
| with pa.ipc.new_file(object_path, payload.schema) as writer: | |
| for batch in payload.batches: | |
| writer.write_batch(batch) |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Summary
Impact
DataSet/DataFrame-style cache writes can now preserve Arrow schemas and record batches as first-class cache payloads while keeping the existing opaque object path, ObjectRef shape, and legacy cache file compatibility.
Validation
Note
The new E2E tests were added but not run against a live local Flame cache here because the local flame.yaml lacks a current-context.